refactor: simplify rollout in training scripts with the `connect_engine` API by garrett4wade · Pull Request #451 · inclusionAI/AReaL

garrett4wade · 2025-10-14T02:30:33Z

This pull request introduces a major refactor to the distributed rollout workflow for both the FSDP and Megatron training engines, centralizing batch redistribution and broadcast logic within the engine classes. This change simplifies how rollout batches are prepared and distributed across workers, improving code maintainability and reducing redundancy in example scripts. Additionally, new validation checks ensure the rollout engine is properly connected before rollout or weight update operations.

Distributed Rollout and Batch Coordination Improvements:

Added _broadcast_and_redistribute_batch, rollout_batch, and prepare_batch methods to both FSDPEngine (fsdp_engine.py) and MegatronEngine (megatron_engine.py). These methods encapsulate batch redistribution, broadcasting, and synchronization, streamlining distributed rollout handling. [1] [2]
Introduced _check_rollout_engine_connected to both engines, enforcing that the rollout engine is connected before performing rollouts or weight updates, preventing runtime errors. [1] [2] [3] [4]

Example Script Simplification:

Refactored example scripts (countdown/train.py and experimental/dapo/gsm8k_dapo.py) to delegate rollout batch preparation and distribution to the engine methods, removing manual broadcasting, tensor transfer, and synchronization barriers. This results in cleaner, less error-prone code. [1] [2]

Imports and Dependency Updates:

Updated imports in engine and example files to support new batch handling logic, including adding StatefulDataLoader, RolloutWorkflow, and various utility functions. [1] [2] [3] [4] [5] [6] [7] [8]

Code Cleanup:

Removed redundant batch handling code from example scripts, such as manual calls to broadcast_tensor_container, tensor_container_to, and synchronization barriers, since these are now handled by the engine methods. [1] [2]

These changes collectively make distributed rollout workflows more robust, modular, and easier to use in both core engine code and user-facing training scripts.

gemini-code-assist · 2025-10-14T02:31:01Z

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the rollout process in training scripts by introducing a connect_engine API. It centralizes the complex distributed data handling, including batch redistribution and broadcasting, into the core engine classes (FSDPEngine and MegatronEngine). This change streamlines the training examples, making them more concise and easier to understand by abstracting away the low-level distributed synchronization and data movement operations.

Highlights

Centralized Distributed Rollout Logic: The complex logic for distributed rollout generation, including data parallel head checks, device transfer, redistribution, broadcasting, and synchronization, has been moved from individual training scripts into the FSDPEngine and MegatronEngine classes.
New Engine API Methods: New methods such as _check_rollout_engine_connected, _broadcast_and_redistribute_batch, rollout_batch, and prepare_batch have been introduced in the engine classes to encapsulate and standardize the distributed rollout process.
Simplified Training Scripts: All example training scripts have been updated to utilize the new actor.rollout_batch and actor.prepare_batch API, significantly reducing boilerplate code and improving readability by abstracting away complex distributed data handling.
Enhanced Code Reusability and Maintainability: By consolidating distributed rollout mechanisms within the engine, the overall codebase becomes cleaner, more modular, and easier to maintain, promoting better code reuse across different training setups.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the training scripts to simplify rollout generation by introducing a new connect_engine API and helper methods (rollout_batch, prepare_batch) in the FSDP and Megatron engines. This encapsulates the complex logic of distributed rollout, redistribution, and broadcasting, making the training scripts much cleaner and easier to read. The changes are a significant improvement.

However, I've found a critical issue in the new _broadcast_and_redistribute_batch method in both fsdp_engine.py and megatron_engine.py. The redistribute function returns a RedistributedData object, but the code fails to access the .data attribute, which will lead to a type error. I've also pointed out some minor inconsistencies in megatron_engine.py regarding a redundant check, an incomplete error message, and docstring style.

garrett4wade · 2025-10-20T07:20:39Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactor by centralizing the distributed rollout logic into the FSDPEngine and MegatronEngine. This greatly simplifies the training scripts, making them cleaner and more maintainable. The addition of _check_rollout_engine_connected is also a good defensive measure against runtime errors.

My main feedback focuses on a high-impact opportunity to reduce code duplication. The new rollout methods are nearly identical across both engine classes, and abstracting them into a shared mixin would significantly improve long-term maintainability. I've also noted a minor inconsistency and another opportunity for refactoring within the new methods. Overall, this is a strong improvement to the codebase's architecture.

into fw/simple-rollout

garrett4wade · 2025-10-20T14:10:49Z

test_examples.py:

gsm8k grpo performance comparision:

rchardx

LGTM in general except for one minor issue.

…imple-rollout

rchardx · 2025-10-21T03:16:44Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and valuable refactoring. It successfully centralizes the distributed rollout logic into a DistRolloutCoordinator, which greatly simplifies the training scripts by removing boilerplate code for batch redistribution and broadcasting. The introduction of rollout_batch and prepare_batch methods in the FSDP and Megatron engines abstracts away the complexities of distributed data handling, improving maintainability and readability.

My review has identified a critical logical error in the _check_rollout_engine_connected method in both FSDPEngine and MegatronEngine that would prevent the code from executing correctly. Once this is addressed, the changes will be a solid improvement to the codebase.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rchardx

LGTM!

@gemini-code-assist

…ne` API (inclusionAI#451) * add rollout methods in train engine to simplify data redistribution * fix * fix docstring * . * . * refactor rollout logic into areal.core.dist_rollout * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Wentai Zhang <zhangwentai.zwt@antgroup.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add rollout methods in train engine to simplify data redistribution

e3a0c2b

garrett4wade had a problem deploying to AReaL-unittests October 14, 2025 02:30 — with GitHub Actions Failure

garrett4wade marked this pull request as draft October 14, 2025 02:31

gemini-code-assist Bot reviewed Oct 14, 2025

View reviewed changes

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

merge main

dd7f234

garrett4wade had a problem deploying to AReaL-unittests October 20, 2025 05:44 — with GitHub Actions Failure

fix

60bb307

garrett4wade had a problem deploying to AReaL-unittests October 20, 2025 06:58 — with GitHub Actions Failure

fix docstring

4f5cd0f

garrett4wade had a problem deploying to AReaL-unittests October 20, 2025 07:19 — with GitHub Actions Failure

gemini-code-assist Bot reviewed Oct 20, 2025

View reviewed changes

Comment thread areal/engine/fsdp_engine.py

Comment thread areal/engine/fsdp_engine.py

Comment thread areal/engine/megatron_engine.py Outdated

garrett4wade added 2 commits October 20, 2025 15:26

.

9dbc3a9

Merge branch 'fw/simple-rollout' of https://github.com/inclusionAI/AReaL

e71cd6c

into fw/simple-rollout

garrett4wade had a problem deploying to AReaL-unittests October 20, 2025 07:33 — with GitHub Actions Failure

.

7f64320

garrett4wade had a problem deploying to AReaL-unittests October 20, 2025 14:09 — with GitHub Actions Failure

garrett4wade changed the title ~~[wip] refactor: simplify rollout in training scripts with the connect_engine API~~ refactor: simplify rollout in training scripts with the connect_engine API Oct 20, 2025

garrett4wade marked this pull request as ready for review October 20, 2025 14:09

garrett4wade requested review from dhh1995, nuzant and rchardx October 20, 2025 14:11

rchardx reviewed Oct 21, 2025

View reviewed changes

Comment thread areal/engine/megatron_engine.py Outdated

rchardx requested changes Oct 21, 2025

View reviewed changes

garrett4wade added 2 commits October 21, 2025 10:54

Merge branch 'main' of https://github.com/inclusionAI/AReaL into fw/s…

5ba025b

…imple-rollout

refactor rollout logic into areal.core.dist_rollout

789362b

garrett4wade had a problem deploying to AReaL-unittests October 21, 2025 03:07 — with GitHub Actions Failure

gemini-code-assist Bot reviewed Oct 21, 2025

View reviewed changes

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

Apply suggestion from @gemini-code-assist[bot]

5fd10c4

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rchardx had a problem deploying to AReaL-unittests October 21, 2025 03:20 — with GitHub Actions Error

Apply suggestion from @gemini-code-assist[bot]

9328bc7

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rchardx had a problem deploying to AReaL-unittests October 21, 2025 03:20 — with GitHub Actions Failure

rchardx self-requested a review October 21, 2025 03:21

rchardx approved these changes Oct 21, 2025

View reviewed changes

rchardx merged commit ccba1bb into main Oct 21, 2025
1 of 4 checks passed

rchardx deleted the fw/simple-rollout branch October 21, 2025 03:21

Conversation

garrett4wade commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Oct 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Oct 20, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Oct 20, 2025

Uh oh!

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

rchardx commented Oct 21, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrett4wade commented Oct 14, 2025 •

edited

Loading