Skip to content

refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769

Merged
nuzant merged 17 commits intomainfrom
fw/controller-nccl-w
Dec 29, 2025
Merged

refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769
nuzant merged 17 commits intomainfrom
fw/controller-nccl-w

Conversation

@garrett4wade
Copy link
Collaborator

@garrett4wade garrett4wade commented Dec 27, 2025

Description

This PR implements RolloutCallback that redirects the method calls of RemoteInferenceEngine inside TrainEngine to RolloutController. It has two major benefits:

  1. Updating weights with controllers with pass through method calls through callbacks to the engine. As a result, we can avoid the code duplication of update weights logic in engine and controller.
  2. We can use callbacks to let the engine inform when a rollout is finished, thus avoiding the original busy waiting logic.

Code validated with local and slurm scheduler with FSDP.

Future works:

  • Test with RayScheduler and remove the hard-coded branching logic (e.g., if 'rollout' in role: ...)
  • Test with FSDP + TP/CP and vLLM.
  • Test with the Megatron backend until the previous end-to-end experiment (e.g., tau2-bench) runs fine.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the weight transfer and rollout completion mechanisms to use an asynchronous callback-based approach. The primary goal is to improve the coordination between training engines and the rollout controller, making the process more efficient by avoiding busy waiting and centralizing weight update logic. This change enhances the system's responsiveness and scalability, particularly in distributed environments.

Highlights

  • Callback-based Weight Transfer: Introduced a new RolloutCallback mechanism to facilitate asynchronous, non-blocking communication for XCCL weight updates between training engines and the RolloutController. This refactors the weight update logic, centralizing it within the controller and eliminating duplication.
  • Elimination of Busy Waiting: Replaced the previous busy-waiting mechanism during rollout completion with an event-driven callback system. The RolloutController now waits on asyncio.Future objects that are resolved via HTTP callbacks from the inference engines, significantly improving efficiency and preventing deadlocks during collective operations.
  • Dynamic NCCL Configuration: The NCCL master address and port for weight updates are now dynamically assigned by the training engines (FSDP and Megatron) using gethostip() and find_free_ports(), rather than being hardcoded or pre-configured.
  • Simplified TrainController Logic: The TrainController has been streamlined by removing several internal methods related to distributed weight updates and bucket parameter specification, as these responsibilities are now handled by the new callback system and the RolloutController.
  • Removal of MockInferenceEngine: The MockInferenceEngine has been removed, indicating a shift towards a more integrated and callback-driven interaction model for inference engines.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the weight update and rollout process. By introducing RolloutCallback, it successfully decouples the TrainController from the intricate details of weight update coordination, moving that logic into the TrainEngine. This simplifies the TrainController and reduces code duplication. Furthermore, the switch from a busy-waiting polling mechanism to a callback-based approach for rollout completion is a great improvement for efficiency and system responsiveness. The changes are well-structured and consistent across the different components.

I have one suggestion for a minor refactoring to reduce code duplication in the new RolloutCallback class.

@garrett4wade garrett4wade added the safe-to-test Ready to run unit-tests in a PR. label Dec 27, 2025
@garrett4wade
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring of the weight update and rollout synchronization mechanisms. By introducing RolloutCallback and moving the weight update orchestration from TrainController into the respective TrainEngines, the design is much cleaner, more modular, and easier to maintain. The replacement of the busy-waiting loop in RolloutController with an asynchronous callback system is a major improvement for efficiency and performance. The changes are consistent across different engine types (FSDP, Megatron) and are supported by updated tests. I have a couple of suggestions for minor improvements to further enhance the robustness of the new callback mechanism.

@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 28, 2025
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 28, 2025
Copy link
Collaborator

@nuzant nuzant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some minor issues to be addressed.

@garrett4wade
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new asynchronous callback mechanism for distributed weight updates and rollout task completion, primarily impacting the RolloutController and TrainController. The RolloutController now hosts a Flask HTTP server to receive non-blocking callbacks from inference engines, replacing a polling-based approach for task completion. This change involved adding a new RolloutCallback class for train workers to coordinate with the TrainController via HTTP. Resource allocation for GPUs in RolloutController was corrected to scale multiplicatively, and the WeightUpdateMeta structure was refined to allow None for NCCL master details, with these details now dynamically set by the engines. The BatchTaskDispatcher was enhanced to manage and send these callbacks using a dedicated thread pool. The MockInferenceEngine was removed, streamlining the architecture. Additionally, the SchedulerConfig type was updated to allow None, and setup_timeout was increased. In the vllm_worker_extension.py file, the weight update groups were refactored from a single instance to a dictionary, enabling the management of multiple named weight update groups for greater flexibility. Furthermore, tensor arguments in RPC calls are now explicitly moved to the correct device before broadcasting, improving robustness.

@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Dec 29, 2025
@nuzant nuzant merged commit 8d41ab3 into main Dec 29, 2025
7 checks passed
@nuzant nuzant deleted the fw/controller-nccl-w branch December 29, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants