refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769
refactor: use callbacks to implement xccl weigh transfer and avoid busy waiting during rollout#769
Conversation
Summary of ChangesHello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the weight transfer and rollout completion mechanisms to use an asynchronous callback-based approach. The primary goal is to improve the coordination between training engines and the rollout controller, making the process more efficient by avoiding busy waiting and centralizing weight update logic. This change enhances the system's responsiveness and scalability, particularly in distributed environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the weight update and rollout process. By introducing RolloutCallback, it successfully decouples the TrainController from the intricate details of weight update coordination, moving that logic into the TrainEngine. This simplifies the TrainController and reduces code duplication. Furthermore, the switch from a busy-waiting polling mechanism to a callback-based approach for rollout completion is a great improvement for efficiency and system responsiveness. The changes are well-structured and consistent across the different components.
I have one suggestion for a minor refactoring to reduce code duplication in the new RolloutCallback class.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the weight update and rollout synchronization mechanisms. By introducing RolloutCallback and moving the weight update orchestration from TrainController into the respective TrainEngines, the design is much cleaner, more modular, and easier to maintain. The replacement of the busy-waiting loop in RolloutController with an asynchronous callback system is a major improvement for efficiency and performance. The changes are consistent across different engine types (FSDP, Megatron) and are supported by updated tests. I have a couple of suggestions for minor improvements to further enhance the robustness of the new callback mechanism.
…/AReaL into fw/controller-nccl-w
nuzant
left a comment
There was a problem hiding this comment.
LGTM, only some minor issues to be addressed.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new asynchronous callback mechanism for distributed weight updates and rollout task completion, primarily impacting the RolloutController and TrainController. The RolloutController now hosts a Flask HTTP server to receive non-blocking callbacks from inference engines, replacing a polling-based approach for task completion. This change involved adding a new RolloutCallback class for train workers to coordinate with the TrainController via HTTP. Resource allocation for GPUs in RolloutController was corrected to scale multiplicatively, and the WeightUpdateMeta structure was refined to allow None for NCCL master details, with these details now dynamically set by the engines. The BatchTaskDispatcher was enhanced to manage and send these callbacks using a dedicated thread pool. The MockInferenceEngine was removed, streamlining the architecture. Additionally, the SchedulerConfig type was updated to allow None, and setup_timeout was increased. In the vllm_worker_extension.py file, the weight update groups were refactored from a single instance to a dictionary, enabling the management of multiple named weight update groups for greater flexibility. Furthermore, tensor arguments in RPC calls are now explicitly moved to the correct device before broadcasting, improving robustness.
…/AReaL into fw/controller-nccl-w
Description
This PR implements
RolloutCallbackthat redirects the method calls ofRemoteInferenceEngineinsideTrainEnginetoRolloutController. It has two major benefits:Code validated with local and slurm scheduler with FSDP.
Future works:
RaySchedulerand remove the hard-coded branching logic (e.g.,if 'rollout' in role: ...)Type of Change
work as expected)
Checklist
jb build docs/gemini review)