-
Notifications
You must be signed in to change notification settings - Fork 361
Description
Checklist
- [NA] The error occurs when using our provided Docker image.
- [Checked] I can consistently reproduce the bug across multiple trials or random seeds.
- [Checked] If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
We use conda to prepare the environment instead of using docker image. But the root cause of the problem is straiteforward which is not directly related to environment.
Detailed Information
Describe the bug
Boba GRPO scripts with vLLM in example folder failed because of configuration type mismatch.
20251025-06:18:12.733 RayLauncher ERROR: Job trainer:3 failed with error: ray::run_func() (pid=3820227, ip=90.91.103.32)
File "/home/z00637938/workspace/AReaL/areal/launcher/ray.py", line 65, in run_func
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/examples/math/boba_grpo.py", line 127, in main
weight_update_meta = get_model_update_meta(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/areal/utils/model.py", line 50, in get_model_update_meta
if config.weight_update_mode == "disk":
^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'GRPOConfig' object has no attribute 'weight_update_mode'.
Root cause
examples.math.boda_grpo.main calls areal.utils.model.get_model_update_meta with config as parameter. The type of config is areal.api.cli_args.GRPOConfig. However get_model_update_meta() does not provide type tips, and try to access property weight_update_mode. Actually weight_update_mode is only available in attribute actor of config, which in type of PPOActorConfig (sub class of TrainEngineConfig).
The process will fail because of type missing (failed to locate attribute 'weight_update_mode')
The suggest fix is to add type hints and access weight_update_mode from config.actor instead of config
def get_model_update_meta(config):
if config.weight_update_mode == "disk":
return WeightUpdateMeta.from_disk(
config.experiment_name, config.trial_name, config.cluster.fileroot
)
else:
return WeightUpdateMeta.from_fsdp_xccl(
AllocationMode.from_str(config.allocation_mode)
)
Expected behavior
Example scripts should run without type mismatch issue.
Full logs
If possible, provide logs for more detailed information.
(run_func pid=3820226) 20251025-06:18:08.596 Launcher Utils INFO: Found 2 rollout servers: 90.91.103.32:11451, 90.91.103.32:38927
(run_func pid=3820226) 20251025-06:18:08.596 [Remote Inference Engine Rank 2] INFO: Get server addresses from name_resolve.
(run_func pid=3820226) 20251025-06:18:08.596 [Remote Inference Engine Rank 2] INFO: Waiting for server ready...
(run_func pid=3820226) 20251025-06:18:08.602 [Remote Inference Engine Rank 2] INFO: Servers are all ready!
(run_func pid=3820225) [rank1]:[W1025 06:18:09.742729016 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(run_func pid=3820224) /root/miniconda3/envs/areal_async_vllm_zheng/lib/python3.11/site-packages/megatron/core/optimizer/clip_grads.py:29: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale [repeated 6x across cluster]
(run_func pid=3820224) warnings.warn( [repeated 12x across cluster]
(run_func pid=3820224) /root/miniconda3/envs/areal_async_vllm_zheng/lib/python3.11/site-packages/megatron/core/models/gpt/gpt_layer_specs.py:67: UserWarning: Apex is not installed. Falling back to Torch Norm [repeated 6x across cluster]
(run_func pid=3820224) warnings.warn("Apex is not installed. Falling back to Torch Norm") [repeated 6x across cluster]
20251025-06:18:12.731 RayLauncher ERROR: Job trainer:0 failed with error: ray::run_func() (pid=3820224, ip=90.91.103.32)
File "/home/z00637938/workspace/AReaL/areal/launcher/ray.py", line 65, in run_func
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/examples/math/boba_grpo.py", line 127, in main
weight_update_meta = get_model_update_meta(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/z00637938/workspace/AReaL/areal/utils/model.py", line 50, in get_model_update_meta
if config.weight_update_mode == "disk":
^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'GRPOConfig' object has no attribute 'weight_update_mode'.
To Reproduce
Run Boba GRPO vllm scripts in example. It's 100% reproducible.
Commit ID
main's tip: 4a4abc6
Environment
torch 2.8.0
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
Script
python3 -m areal.launcher.ray examples/math/boba_grpo.py --config examples/math/boba_grpo_vllm.yaml experiment_name=boba_grpo_vllm_16_gpus trial_name=trail_0