Skip to content

refactor: add export_stats as engine's method#643

Merged
nuzant merged 2 commits intomainfrom
fw/export-stats
Nov 28, 2025
Merged

refactor: add export_stats as engine's method#643
nuzant merged 2 commits intomainfrom
fw/export-stats

Conversation

@garrett4wade
Copy link
Collaborator

Description

This refactor is required by #611 , and has the following benefits:

  1. We don't need to maintain a separate /export_stats endpoint in the RPC server. A single /call endpoint is sufficient.
  2. We can merge the parallelism-strategy-specific stats gathering logic to the engine, e.g., broadcasting stats from the last pipeline stage in MegatronEngine.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors how statistics are exported by integrating the export_stats functionality directly into the engine classes. This change simplifies the RPC server by consolidating stat export under a single /call endpoint and enables more flexible, parallelism-strategy-aware statistics gathering within the engine implementations. Additionally, it includes minor improvements to GPU device detection and clarifies LoRA support within the vLLM integration.

Highlights

  • Refactor export_stats into Engine Method: The export_stats functionality has been moved from a standalone utility function to a method within the InferenceEngine and TrainEngine abstract classes, and implemented in concrete engine classes like FSDPEngine, MegatronEngine, RemoteSGLangEngine, and RemoteVLLMEngine.
  • Streamlined RPC Server: The dedicated /export_stats endpoint in the RPC server has been removed. All engine method calls, including export_stats, now go through a single /call endpoint, simplifying the RPC interface.
  • Enhanced Parallelism-Aware Stats Gathering: The new engine-level export_stats method allows for integrating parallelism-strategy-specific logic directly into the engine. For instance, MegatronEngine now broadcasts stats from the last pipeline stage.
  • Updated GPU Device Detection: The LocalScheduler now includes a new utility function _get_device_count_safely to detect GPU devices without initializing a CUDA context, improving robustness.
  • Refined LoRA Handling in vLLM: The vLLM remote engine now explicitly raises NotImplementedError for LoRA-related operations, clarifying its current lack of support for LoRA training and weight updates.
  • Centralized Stats Logging Logic: The stats_logger has been updated to remove pipeline model parallel specific logic for broadcasting stats, as this is now handled by the engine's export_stats method. It also filters out internal __count keys before logging.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the statistics exporting mechanism by encapsulating the logic within the engine classes. Moving export_stats into TrainEngine and InferenceEngine is a solid design choice that improves modularity and correctly assigns responsibility for handling parallelism-specific aggregations. The changes are consistently applied throughout the codebase, including the removal of the now-redundant /export_stats RPC endpoint, which simplifies the API.

However, I've noted one significant issue: the pull request, marked as a refactoring with no functional changes, also appears to disable LoRA support for the vLLM backend. This is a functional change that should be clarified and properly documented in the PR description.

Apart from this point, the refactoring is well-executed and improves the overall code structure.

@garrett4wade
Copy link
Collaborator Author

/gemini review

@garrett4wade garrett4wade added the safe-to-test Ready to run unit-tests in a PR. label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring that moves the export_stats functionality into an engine method. This is a solid design improvement, as it centralizes the statistics gathering logic within the engine, which has the necessary context about parallelism strategies. The changes are applied consistently across the API, engine implementations, RPC server, tests, and example files.

I have a couple of suggestions: one is to fix a minor typo in a docstring, and the other is to add unit tests for the new GPU detection logic to ensure its robustness and improve maintainability. Overall, this is a great change that improves the codebase's structure.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@garrett4wade
Copy link
Collaborator Author

FYI CI unit test is passing now.

Copy link
Collaborator

@nuzant nuzant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nuzant nuzant merged commit c25525f into main Nov 28, 2025
1 check passed
@nuzant nuzant deleted the fw/export-stats branch November 28, 2025 06:44
Bruce-rl-hw pushed a commit to Bruce-rl-hw/AReaL-vllm that referenced this pull request Dec 4, 2025
* add export stats as engine method and update examples

* Update areal/api/engine_api.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants