fix: move pause/continue_generation operations into update_weights by rchardx · Pull Request #446 · inclusionAI/AReaL

rchardx · 2025-10-13T09:43:08Z

This pull request introduces a clear distinction between suspending request submission and pausing the actual generation process in inference engines. It adds new pause_generation and continue_generation methods to the inference engine APIs and ensures these are properly invoked during distributed weight updates. Additionally, it standardizes terminology across the workflow API, replacing "pause/resume" with "suspend/resume" for request submission, and updates related event and method names for clarity and consistency.

API and Method Naming Improvements:

Added pause_generation and continue_generation methods to the base inference engine API (InferenceEngine) to explicitly control the pausing and resuming of the generation process, particularly during weight updates.
Updated remote engine implementations (sglang_remote.py, vllm_remote.py) to implement pause_generation and continue_generation methods, sending appropriate HTTP requests to pause/resume generation on remote servers. [1] [2] [3] [4]

Workflow Suspension Refactor:

Renamed the workflow API's pause/resume methods and state to suspend/resume and suspended (was paused), clarifying that these control request submission, not generation itself. Updated all related logic and documentation. [1] [2] [3]

Weight Update Process Enhancements:

Modified distributed weight update logic in both FSDP and Megatron engines to pause generation before weight transfer and continue generation after, ensuring consistency and correctness during updates. [1] [2] [3] [4]

gemini-code-assist · 2025-10-13T09:43:27Z

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the control mechanisms within the inference engine by creating a clear distinction between pausing the actual generation process and suspending request submission. It introduces dedicated API methods for pausing and continuing generation, which are then strategically integrated into distributed weight update procedures to maintain system stability. Concurrently, the workflow API's terminology for request submission control has been updated for improved clarity and consistency.

Highlights

Generation Control API: Introduced new pause_generation and continue_generation methods to the base InferenceEngine API and implemented them in remote engines (sglang_remote.py, vllm_remote.py) to explicitly manage the generation process.
Workflow API Renaming: Renamed pause/resume methods and the paused state in the workflow API to suspend/resume and suspended respectively, to clearly differentiate between suspending request submission and pausing generation.
Distributed Weight Update Integration: Modified the distributed weight update logic in FSDP and Megatron engines to invoke pause_generation before weight transfers and continue_generation afterward, ensuring generation stability during updates.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull Request Overview

This PR refactors the pause/resume functionality to clearly distinguish between suspending request submission and pausing generation processes. It introduces new generation control methods and standardizes terminology across the codebase.

Adds explicit pause_generation and continue_generation methods to inference engines for controlling generation during weight updates
Renames workflow API methods from pause/resume to suspend/resume to clarify they control request submission, not generation
Integrates generation pausing into distributed weight update processes in FSDP and Megatron engines

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
areal/api/engine_api.py	Adds abstract `pause_generation` and `continue_generation` methods to base inference engine API
areal/api/workflow_api.py	Renames `paused` state to `suspended` and updates related method names for clarity
areal/engine/sglang_remote.py	Implements generation control methods and separates them from workflow suspension methods
areal/engine/vllm_remote.py	Implements generation control methods and separates them from workflow suspension methods
areal/engine/fsdp_engine.py	Integrates generation pausing into distributed weight update process
areal/experimental/megatron_engine.py	Integrates generation pausing into distributed weight update process

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

areal/experimental/megatron_engine.py

gemini-code-assist

Code Review

This pull request does a good job of refactoring the API to create a clear distinction between suspending request submission and pausing the generation process. The introduction of pause_generation and continue_generation and renaming pause to suspend in the workflow API improves clarity.

However, I've found a few critical issues that need to be addressed:

In megatron_engine.py, pause_generation is called twice, and continue_generation is never called, which will leave the generation process paused indefinitely after a weight update.
In sglang_remote.py and vllm_remote.py, there are calls to workflow_executor.pause(), but this method has been renamed to suspend() in this same PR, which will lead to a runtime AttributeError.
There's also a minor documentation issue in workflow_api.py with a broken reference.

Please see the detailed comments for suggestions on how to fix these issues.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

garrett4wade

LGTM!

fix: move pause/continue_generation operations into update_weights

99c7a5d

rchardx requested review from Copilot, dhh1995, fishcrap, garrett4wade and nuzant October 13, 2025 09:43

rchardx requested a deployment to AReaL-unittests October 13, 2025 09:43 — with GitHub Actions In progress

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:43 — with GitHub Actions Error

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:43 — with GitHub Actions Failure

Copilot AI reviewed Oct 13, 2025

View reviewed changes

areal/experimental/megatron_engine.py Outdated Show resolved Hide resolved

Fix typos

f245eb4

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:44 — with GitHub Actions Failure

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:44 — with GitHub Actions Error

Fix typos

db0ad59

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:45 — with GitHub Actions Failure

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:45 — with GitHub Actions Error

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

Apply suggestion from @Copilot

6c94ae2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rchardx had a problem deploying to AReaL-unittests October 13, 2025 09:46 — with GitHub Actions Failure

garrett4wade approved these changes Oct 13, 2025

View reviewed changes

garrett4wade merged commit 6a06baf into main Oct 13, 2025
1 of 4 checks passed

garrett4wade deleted the rchardx/fix_engine branch October 13, 2025 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move pause/continue_generation operations into update_weights#446

fix: move pause/continue_generation operations into update_weights#446
garrett4wade merged 4 commits intomainfrom
rchardx/fix_engine

rchardx commented Oct 13, 2025

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rchardx commented Oct 13, 2025

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants