refactor: add `export_stats` as engine's method by garrett4wade · Pull Request #643 · inclusionAI/AReaL

garrett4wade · 2025-11-28T04:11:02Z

Description

This refactor is required by #611 , and has the following benefits:

We don't need to maintain a separate /export_stats endpoint in the RPC server. A single /call endpoint is sufficient.
We can merge the parallelism-strategy-specific stats gathering logic to the engine, e.g., broadcasting stats from the last pipeline stage in MegatronEngine.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

gemini-code-assist · 2025-11-28T04:11:36Z

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors how statistics are exported by integrating the export_stats functionality directly into the engine classes. This change simplifies the RPC server by consolidating stat export under a single /call endpoint and enables more flexible, parallelism-strategy-aware statistics gathering within the engine implementations. Additionally, it includes minor improvements to GPU device detection and clarifies LoRA support within the vLLM integration.

Highlights

Refactor export_stats into Engine Method: The export_stats functionality has been moved from a standalone utility function to a method within the InferenceEngine and TrainEngine abstract classes, and implemented in concrete engine classes like FSDPEngine, MegatronEngine, RemoteSGLangEngine, and RemoteVLLMEngine.
Streamlined RPC Server: The dedicated /export_stats endpoint in the RPC server has been removed. All engine method calls, including export_stats, now go through a single /call endpoint, simplifying the RPC interface.
Enhanced Parallelism-Aware Stats Gathering: The new engine-level export_stats method allows for integrating parallelism-strategy-specific logic directly into the engine. For instance, MegatronEngine now broadcasts stats from the last pipeline stage.
Updated GPU Device Detection: The LocalScheduler now includes a new utility function _get_device_count_safely to detect GPU devices without initializing a CUDA context, improving robustness.
Refined LoRA Handling in vLLM: The vLLM remote engine now explicitly raises NotImplementedError for LoRA-related operations, clarifying its current lack of support for LoRA training and weight updates.
Centralized Stats Logging Logic: The stats_logger has been updated to remove pipeline model parallel specific logic for broadcasting stats, as this is now handled by the engine's export_stats method. It also filters out internal __count keys before logging.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively refactors the statistics exporting mechanism by encapsulating the logic within the engine classes. Moving export_stats into TrainEngine and InferenceEngine is a solid design choice that improves modularity and correctly assigns responsibility for handling parallelism-specific aggregations. The changes are consistently applied throughout the codebase, including the removal of the now-redundant /export_stats RPC endpoint, which simplifies the API.

However, I've noted one significant issue: the pull request, marked as a refactoring with no functional changes, also appears to disable LoRA support for the vLLM backend. This is a functional change that should be clarified and properly documented in the PR description.

Apart from this point, the refactoring is well-executed and improves the overall code structure.

areal/engine/vllm_remote.py

garrett4wade · 2025-11-28T04:15:55Z

/gemini review

gemini-code-assist

Code Review

This pull request is a well-executed refactoring that moves the export_stats functionality into an engine method. This is a solid design improvement, as it centralizes the statistics gathering logic within the engine, which has the necessary context about parallelism strategies. The changes are applied consistently across the API, engine implementations, RPC server, tests, and example files.

I have a couple of suggestions: one is to fix a minor typo in a docstring, and the other is to add unit tests for the new GPU detection logic to ensure its robustness and improve maintainability. Overall, this is a great change that improves the codebase's structure.

areal/api/engine_api.py

areal/scheduler/local.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

garrett4wade · 2025-11-28T06:39:45Z

FYI CI unit test is passing now.

nuzant

LGTM.

* add export stats as engine method and update examples * Update areal/api/engine_api.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

areal/engine/vllm_remote.py Outdated Show resolved Hide resolved

add export stats as engine method and update examples

bc6828d

garrett4wade force-pushed the fw/export-stats branch from a57767f to bc6828d Compare November 28, 2025 04:15

garrett4wade added the safe-to-test Ready to run unit-tests in a PR. label Nov 28, 2025

garrett4wade mentioned this pull request Nov 28, 2025

[feat] impl rollout controller for single controller #611

Merged

16 tasks

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

areal/api/engine_api.py Outdated Show resolved Hide resolved

areal/scheduler/local.py Show resolved Hide resolved

garrett4wade temporarily deployed to AReaL-unittests November 28, 2025 04:28 — with GitHub Actions Inactive

Update areal/api/engine_api.py

f1b052a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

nuzant approved these changes Nov 28, 2025

View reviewed changes

nuzant merged commit c25525f into main Nov 28, 2025
1 check passed

nuzant deleted the fw/export-stats branch November 28, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: add `export_stats` as engine's method#643

refactor: add `export_stats` as engine's method#643
nuzant merged 2 commits intomainfrom
fw/export-stats

garrett4wade commented Nov 28, 2025

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

garrett4wade commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Nov 28, 2025

Uh oh!

nuzant left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrett4wade commented Nov 28, 2025

Description

Type of Change

Checklist

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

garrett4wade commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Nov 28, 2025

Uh oh!

nuzant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants