Skip to content

refactor: simplify rollout in training scripts with the connect_engine API#451

Merged
rchardx merged 11 commits intomainfrom
fw/simple-rollout
Oct 21, 2025
Merged

refactor: simplify rollout in training scripts with the connect_engine API#451
rchardx merged 11 commits intomainfrom
fw/simple-rollout

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade commented Oct 14, 2025

This pull request introduces a major refactor to the distributed rollout workflow for both the FSDP and Megatron training engines, centralizing batch redistribution and broadcast logic within the engine classes. This change simplifies how rollout batches are prepared and distributed across workers, improving code maintainability and reducing redundancy in example scripts. Additionally, new validation checks ensure the rollout engine is properly connected before rollout or weight update operations.

Distributed Rollout and Batch Coordination Improvements:

  • Added _broadcast_and_redistribute_batch, rollout_batch, and prepare_batch methods to both FSDPEngine (fsdp_engine.py) and MegatronEngine (megatron_engine.py). These methods encapsulate batch redistribution, broadcasting, and synchronization, streamlining distributed rollout handling. [1] [2]
  • Introduced _check_rollout_engine_connected to both engines, enforcing that the rollout engine is connected before performing rollouts or weight updates, preventing runtime errors. [1] [2] [3] [4]

Example Script Simplification:

  • Refactored example scripts (countdown/train.py and experimental/dapo/gsm8k_dapo.py) to delegate rollout batch preparation and distribution to the engine methods, removing manual broadcasting, tensor transfer, and synchronization barriers. This results in cleaner, less error-prone code. [1] [2]

Imports and Dependency Updates:

  • Updated imports in engine and example files to support new batch handling logic, including adding StatefulDataLoader, RolloutWorkflow, and various utility functions. [1] [2] [3] [4] [5] [6] [7] [8]

Code Cleanup:

  • Removed redundant batch handling code from example scripts, such as manual calls to broadcast_tensor_container, tensor_container_to, and synchronization barriers, since these are now handled by the engine methods. [1] [2]

These changes collectively make distributed rollout workflows more robust, modular, and easier to use in both core engine code and user-facing training scripts.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the rollout process in training scripts by introducing a connect_engine API. It centralizes the complex distributed data handling, including batch redistribution and broadcasting, into the core engine classes (FSDPEngine and MegatronEngine). This change streamlines the training examples, making them more concise and easier to understand by abstracting away the low-level distributed synchronization and data movement operations.

Highlights

  • Centralized Distributed Rollout Logic: The complex logic for distributed rollout generation, including data parallel head checks, device transfer, redistribution, broadcasting, and synchronization, has been moved from individual training scripts into the FSDPEngine and MegatronEngine classes.
  • New Engine API Methods: New methods such as _check_rollout_engine_connected, _broadcast_and_redistribute_batch, rollout_batch, and prepare_batch have been introduced in the engine classes to encapsulate and standardize the distributed rollout process.
  • Simplified Training Scripts: All example training scripts have been updated to utilize the new actor.rollout_batch and actor.prepare_batch API, significantly reducing boilerplate code and improving readability by abstracting away complex distributed data handling.
  • Enhanced Code Reusability and Maintainability: By consolidating distributed rollout mechanisms within the engine, the overall codebase becomes cleaner, more modular, and easier to maintain, promoting better code reuse across different training setups.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@garrett4wade garrett4wade marked this pull request as draft October 14, 2025 02:31
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the training scripts to simplify rollout generation by introducing a new connect_engine API and helper methods (rollout_batch, prepare_batch) in the FSDP and Megatron engines. This encapsulates the complex logic of distributed rollout, redistribution, and broadcasting, making the training scripts much cleaner and easier to read. The changes are a significant improvement.

However, I've found a critical issue in the new _broadcast_and_redistribute_batch method in both fsdp_engine.py and megatron_engine.py. The redistribute function returns a RedistributedData object, but the code fails to access the .data attribute, which will lead to a type error. I've also pointed out some minor inconsistencies in megatron_engine.py regarding a redundant check, an incomplete error message, and docstring style.

Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/engine/megatron_engine.py Outdated
Comment thread areal/engine/megatron_engine.py Outdated
Comment thread areal/engine/megatron_engine.py Outdated
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactor by centralizing the distributed rollout logic into the FSDPEngine and MegatronEngine. This greatly simplifies the training scripts, making them cleaner and more maintainable. The addition of _check_rollout_engine_connected is also a good defensive measure against runtime errors.

My main feedback focuses on a high-impact opportunity to reduce code duplication. The new rollout methods are nearly identical across both engine classes, and abstracting them into a shared mixin would significantly improve long-term maintainability. I've also noted a minor inconsistency and another opportunity for refactoring within the new methods. Overall, this is a strong improvement to the codebase's architecture.

Comment thread areal/engine/fsdp_engine.py
Comment thread areal/engine/fsdp_engine.py
Comment thread areal/engine/megatron_engine.py Outdated
@garrett4wade garrett4wade changed the title [wip] refactor: simplify rollout in training scripts with the connect_engine API refactor: simplify rollout in training scripts with the connect_engine API Oct 20, 2025
@garrett4wade garrett4wade marked this pull request as ready for review October 20, 2025 14:09
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

test_examples.py:

e0c8da20fe25ae86a32c845a9380e834

gsm8k grpo performance comparision:

image

Comment thread areal/engine/megatron_engine.py Outdated
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general except for one minor issue.

@rchardx
Copy link
Copy Markdown
Collaborator

rchardx commented Oct 21, 2025

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and valuable refactoring. It successfully centralizes the distributed rollout logic into a DistRolloutCoordinator, which greatly simplifies the training scripts by removing boilerplate code for batch redistribution and broadcasting. The introduction of rollout_batch and prepare_batch methods in the FSDP and Megatron engines abstracts away the complexities of distributed data handling, improving maintainability and readability.

My review has identified a critical logical error in the _check_rollout_engine_connected method in both FSDPEngine and MegatronEngine that would prevent the code from executing correctly. Once this is addressed, the changes will be a solid improvement to the codebase.

Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/engine/megatron_engine.py Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rchardx rchardx merged commit ccba1bb into main Oct 21, 2025
1 of 4 checks passed
@rchardx rchardx deleted the fw/simple-rollout branch October 21, 2025 03:21
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…ne` API (inclusionAI#451)

* add rollout methods in train engine to simplify data redistribution

* fix

* fix docstring

* .

* .

* refactor rollout logic into areal.core.dist_rollout

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Wentai Zhang <zhangwentai.zwt@antgroup.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants