refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout by garrett4wade · Pull Request #769 · inclusionAI/AReaL

garrett4wade · 2025-12-27T11:19:24Z

Description

This PR implements RolloutCallback that redirects the method calls of RemoteInferenceEngine inside TrainEngine to RolloutController. It has two major benefits:

Updating weights with controllers with pass through method calls through callbacks to the engine. As a result, we can avoid the code duplication of update weights logic in engine and controller.
We can use callbacks to let the engine inform when a rollout is finished, thus avoiding the original busy waiting logic.

Code validated with local and slurm scheduler with FSDP.

Future works:

Test with RayScheduler and remove the hard-coded branching logic (e.g., if 'rollout' in role: ...)
Test with FSDP + TP/CP and vLLM.
Test with the Megatron backend until the previous end-to-end experiment (e.g., tau2-bench) runs fine.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

…acks

gemini-code-assist · 2025-12-27T11:19:52Z

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the weight transfer and rollout completion mechanisms to use an asynchronous callback-based approach. The primary goal is to improve the coordination between training engines and the rollout controller, making the process more efficient by avoiding busy waiting and centralizing weight update logic. This change enhances the system's responsiveness and scalability, particularly in distributed environments.

Highlights

Callback-based Weight Transfer: Introduced a new RolloutCallback mechanism to facilitate asynchronous, non-blocking communication for XCCL weight updates between training engines and the RolloutController. This refactors the weight update logic, centralizing it within the controller and eliminating duplication.
Elimination of Busy Waiting: Replaced the previous busy-waiting mechanism during rollout completion with an event-driven callback system. The RolloutController now waits on asyncio.Future objects that are resolved via HTTP callbacks from the inference engines, significantly improving efficiency and preventing deadlocks during collective operations.
Dynamic NCCL Configuration: The NCCL master address and port for weight updates are now dynamically assigned by the training engines (FSDP and Megatron) using gethostip() and find_free_ports(), rather than being hardcoded or pre-configured.
Simplified TrainController Logic: The TrainController has been streamlined by removing several internal methods related to distributed weight updates and bucket parameter specification, as these responsibilities are now handled by the new callback system and the RolloutController.
Removal of MockInferenceEngine: The MockInferenceEngine has been removed, indicating a shift towards a more integrated and callback-driven interaction model for inference engines.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the weight update and rollout process. By introducing RolloutCallback, it successfully decouples the TrainController from the intricate details of weight update coordination, moving that logic into the TrainEngine. This simplifies the TrainController and reduces code duplication. Furthermore, the switch from a busy-waiting polling mechanism to a callback-based approach for rollout completion is a great improvement for efficiency and system responsiveness. The changes are well-structured and consistent across the different components.

I have one suggestion for a minor refactoring to reduce code duplication in the new RolloutCallback class.

areal/controller/rollout_callback.py

garrett4wade · 2025-12-28T12:41:47Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of the weight update and rollout synchronization mechanisms. By introducing RolloutCallback and moving the weight update orchestration from TrainController into the respective TrainEngines, the design is much cleaner, more modular, and easier to maintain. The replacement of the busy-waiting loop in RolloutController with an asynchronous callback system is a major improvement for efficiency and performance. The changes are consistent across different engine types (FSDP, Megatron) and are supported by updated tests. I have a couple of suggestions for minor improvements to further enhance the robustness of the new callback mechanism.

areal/controller/rollout_controller.py

areal/core/workflow_executor.py

…/AReaL into fw/controller-nccl-w

nuzant

LGTM, only some minor issues to be addressed.

areal/controller/rollout_callback.py

areal/core/workflow_executor.py

areal/tests/test_ray_scheduler.py

examples/math/gsm8k_rl.py

…/AReaL into fw/controller-nccl-w

garrett4wade · 2025-12-29T03:37:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new asynchronous callback mechanism for distributed weight updates and rollout task completion, primarily impacting the RolloutController and TrainController. The RolloutController now hosts a Flask HTTP server to receive non-blocking callbacks from inference engines, replacing a polling-based approach for task completion. This change involved adding a new RolloutCallback class for train workers to coordinate with the TrainController via HTTP. Resource allocation for GPUs in RolloutController was corrected to scale multiplicatively, and the WeightUpdateMeta structure was refined to allow None for NCCL master details, with these details now dynamically set by the engines. The BatchTaskDispatcher was enhanced to manage and send these callbacks using a dedicated thread pool. The MockInferenceEngine was removed, streamlining the architecture. Additionally, the SchedulerConfig type was updated to allow None, and setup_timeout was increased. In the vllm_worker_extension.py file, the weight update groups were refactored from a single instance to a dictionary, enabling the management of multiple named weight update groups for greater flexibility. Furthermore, tensor arguments in RPC calls are now explicitly moved to the correct device before broadcasting, improving robustness.

areal/controller/rollout_controller.py

areal/scheduler/rpc/rpc_server.py

areal/thirdparty/vllm/vllm_worker_extension.py

…/AReaL into fw/controller-nccl-w

garrett4wade added 3 commits December 27, 2025 19:04

implement distributed xccl weight update and rollout waiting as callb…

671c23b

…acks

revert unused functions

4abfeaa

add rollout callback

e027943

gemini-code-assist bot reviewed Dec 27, 2025

View reviewed changes

areal/controller/rollout_callback.py Outdated Show resolved Hide resolved

resolve pr comments

1495b9a

garrett4wade added the safe-to-test Ready to run unit-tests in a PR. label Dec 27, 2025

garrett4wade had a problem deploying to AReaL-unittests December 27, 2025 11:53 — with GitHub Actions Error

garrett4wade and others added 4 commits December 27, 2025 22:24

migrate to single-controller integration test

a6e0307

fix tests

9d6c0a1

fix tests

604c167

fix

980a2d7

gemini-code-assist bot reviewed Dec 28, 2025

View reviewed changes

areal/controller/rollout_controller.py Outdated Show resolved Hide resolved

areal/core/workflow_executor.py Outdated Show resolved Hide resolved

fix gemini comment

22710bf

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 28, 2025

garrett4wade had a problem deploying to AReaL-unittests December 28, 2025 13:05 — with GitHub Actions Failure

fix

9ebe311

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 28, 2025

garrett4wade temporarily deployed to AReaL-unittests December 28, 2025 15:06 — with GitHub Actions Inactive

Merge branch 'fw/controller-nccl-w' of https://github.com/inclusionAI…

ac37941

…/AReaL into fw/controller-nccl-w

nuzant reviewed Dec 29, 2025

View reviewed changes

areal/controller/rollout_callback.py Show resolved Hide resolved

areal/core/workflow_executor.py Outdated Show resolved Hide resolved

areal/tests/test_ray_scheduler.py Outdated Show resolved Hide resolved

examples/math/gsm8k_rl.py Show resolved Hide resolved

garrett4wade and others added 4 commits December 29, 2025 11:19

fix pytest skipif

bdcef93

use properly cleaned up process pool executor for callbacks

2a12309

fix vllm

5b40fad

Merge branch 'fw/controller-nccl-w' of https://github.com/inclusionAI…

aaed69e

…/AReaL into fw/controller-nccl-w

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

areal/controller/rollout_controller.py Outdated Show resolved Hide resolved

areal/scheduler/rpc/rpc_server.py Show resolved Hide resolved

areal/thirdparty/vllm/vllm_worker_extension.py Show resolved Hide resolved

garrett4wade added 2 commits December 29, 2025 13:21

minor fix

8b6fa9d

Merge branch 'fw/controller-nccl-w' of https://github.com/inclusionAI…

79f2a55

…/AReaL into fw/controller-nccl-w

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 29, 2025

garrett4wade temporarily deployed to AReaL-unittests December 29, 2025 05:33 — with GitHub Actions Inactive

nuzant approved these changes Dec 29, 2025

View reviewed changes

nuzant merged commit 8d41ab3 into main Dec 29, 2025
7 checks passed

nuzant deleted the fw/controller-nccl-w branch December 29, 2025 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769

refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769
nuzant merged 17 commits intomainfrom
fw/controller-nccl-w

garrett4wade commented Dec 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

garrett4wade commented Dec 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

nuzant left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrett4wade commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Uh oh!

gemini-code-assist bot commented Dec 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

garrett4wade commented Dec 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

nuzant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrett4wade commented Dec 27, 2025 •

edited

Loading