Skip to content

[BUG] Boba GRPO with vllm training failed - AttributeError: 'GRPOConfig' object has no attribute 'weight_update_mode'. #482

@zhshgmail

Description

@zhshgmail

Checklist

  • [NA] The error occurs when using our provided Docker image.
  • [Checked] I can consistently reproduce the bug across multiple trials or random seeds.
  • [Checked] If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

We use conda to prepare the environment instead of using docker image. But the root cause of the problem is straiteforward which is not directly related to environment.

Detailed Information

Describe the bug

Boba GRPO scripts with vLLM in example folder failed because of configuration type mismatch.

20251025-06:18:12.733 RayLauncher ERROR: Job trainer:3 failed with error: ray::run_func() (pid=3820227, ip=90.91.103.32)
  File "/home/z00637938/workspace/AReaL/areal/launcher/ray.py", line 65, in run_func
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/z00637938/workspace/AReaL/examples/math/boba_grpo.py", line 127, in main
    weight_update_meta = get_model_update_meta(config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/z00637938/workspace/AReaL/areal/utils/model.py", line 50, in get_model_update_meta
    if config.weight_update_mode == "disk":
       ^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'GRPOConfig' object has no attribute 'weight_update_mode'.

Root cause

examples.math.boda_grpo.main calls areal.utils.model.get_model_update_meta with config as parameter. The type of config is areal.api.cli_args.GRPOConfig. However get_model_update_meta() does not provide type tips, and try to access property weight_update_mode. Actually weight_update_mode is only available in attribute actor of config, which in type of PPOActorConfig (sub class of TrainEngineConfig).

The process will fail because of type missing (failed to locate attribute 'weight_update_mode')

The suggest fix is to add type hints and access weight_update_mode from config.actor instead of config

def get_model_update_meta(config):
    if config.weight_update_mode == "disk":
        return WeightUpdateMeta.from_disk(
            config.experiment_name, config.trial_name, config.cluster.fileroot
        )
    else:
        return WeightUpdateMeta.from_fsdp_xccl(
            AllocationMode.from_str(config.allocation_mode)
        )

Expected behavior

Example scripts should run without type mismatch issue.

Full logs

If possible, provide logs for more detailed information.

(run_func pid=3820226) 20251025-06:18:08.596 Launcher Utils INFO: Found 2 rollout servers: 90.91.103.32:11451, 90.91.103.32:38927
(run_func pid=3820226) 20251025-06:18:08.596 [Remote Inference Engine Rank 2] INFO: Get server addresses from name_resolve.
(run_func pid=3820226) 20251025-06:18:08.596 [Remote Inference Engine Rank 2] INFO: Waiting for server ready...
(run_func pid=3820226) 20251025-06:18:08.602 [Remote Inference Engine Rank 2] INFO: Servers are all ready!
(run_func pid=3820225) [rank1]:[W1025 06:18:09.742729016 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(run_func pid=3820224) /root/miniconda3/envs/areal_async_vllm_zheng/lib/python3.11/site-packages/megatron/core/optimizer/clip_grads.py:29: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale [repeated 6x across cluster]
(run_func pid=3820224) warnings.warn( [repeated 12x across cluster]
(run_func pid=3820224) /root/miniconda3/envs/areal_async_vllm_zheng/lib/python3.11/site-packages/megatron/core/models/gpt/gpt_layer_specs.py:67: UserWarning: Apex is not installed. Falling back to Torch Norm [repeated 6x across cluster]
(run_func pid=3820224) warnings.warn("Apex is not installed. Falling back to Torch Norm") [repeated 6x across cluster]
20251025-06:18:12.731 RayLauncher ERROR: Job trainer:0 failed with error: ray::run_func() (pid=3820224, ip=90.91.103.32)
File "/home/z00637938/workspace/AReaL/areal/launcher/ray.py", line 65, in run_func
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/examples/math/boba_grpo.py", line 127, in main
weight_update_meta = get_model_update_meta(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/areal/utils/model.py", line 50, in get_model_update_meta
if config.weight_update_mode == "disk":
^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'GRPOConfig' object has no attribute 'weight_update_mode'.

To Reproduce

Run Boba GRPO vllm scripts in example. It's 100% reproducible.

Commit ID

main's tip: 4a4abc6

Environment

torch 2.8.0
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90

Script

python3 -m areal.launcher.ray examples/math/boba_grpo.py --config examples/math/boba_grpo_vllm.yaml experiment_name=boba_grpo_vllm_16_gpus trial_name=trail_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions