refactor the async task submission logic in workflow executor into task runner#473
refactor the async task submission logic in workflow executor into task runner#473
Conversation
Summary of ChangesHello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant refactoring to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request does a great job of refactoring the asynchronous task submission logic from WorkflowExecutor into a new, generic AsyncTaskRunner class. This significantly improves separation of concerns and makes the WorkflowExecutor cleaner and more focused on its domain-specific logic. The new AsyncTaskRunner is well-designed and comes with a comprehensive suite of unit tests.
My review includes a critical suggestion to improve the robustness of the AsyncTaskRunner by handling exceptions from individual tasks without terminating the entire runner. I've also suggested an improvement to the corresponding unit test to verify this resilient behavior. Overall, this is a high-quality refactoring that improves the codebase's maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant refactoring that extracts the asynchronous task execution logic from WorkflowExecutor into a new, generic AsyncTaskRunner class. This is a great improvement for modularity and reusability. The new AsyncTaskRunner is well-designed with good error handling and clean shutdown procedures. The WorkflowExecutor is correctly updated to use the new runner, and the addition of comprehensive unit tests for AsyncTaskRunner is excellent.
My main feedback is to improve the exception handling between AsyncTaskRunner and WorkflowExecutor by using a custom exception type instead of relying on string matching of a generic RuntimeError, which will make the code more robust and maintainable.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a generic asynchronous task runner and refactors the workflow executor to use it. The new AsyncTaskRunner class provides a reusable, thread-based async task executor with queue management and pause/resume control. The WorkflowExecutor is modified to leverage this new class, replacing its internal thread management with the AsyncTaskRunner. This change improves modularity and reusability of the code. I have identified some areas where improvements can be made, particularly around error handling and queue management.
…to fw/async-task-runner
…tecture This commit integrates the segment-wise decoupled PPO feature (from feature/segment-wise-ppo-v2) with the latest main branch (04ab601), which includes the AsyncTaskRunner refactoring (inclusionAI#473). ## Integration Base - **Source**: feature/segment-wise-ppo-v2 (8e7edcb7) - **Target**: origin/main (04ab601) - **Method**: Manual integration (rebase had conflicts due to AsyncTaskRunner refactoring) ## What's New - Segment-wise Decoupled PPO Feature ### Core Algorithm Implements async RLHF with per-token policy version tracking to handle staleness in distributed PPO training: - **Staleness Control**: Automatically filters stale samples when model updates - **Proximal Recomputation**: Recomputes logprobs for head tokens with old policy - **Dynamic Capacity**: Adjusts generation capacity based on filtered samples ### New Files Added (21 files) **API Layer** (abstractions): - areal/api/cache_api.py - RolloutCache interface - areal/api/queue_api.py - RolloutQueue interface - areal/api/staleness_control.py - StalenessControlStrategy base class - areal/api/proximal_recomputer.py - ProximalRecomputer interface - areal/api/workflow_factory.py - Factory for creating WorkflowExecutor with DI **Core Layer** (implementations): - areal/core/rollout_cache.py - LocalRolloutCache implementation - areal/core/rollout_queue.py - LocalRolloutQueue implementation - areal/core/staleness_strategies.py - StandardPPOStrategy & SegmentWisePPOStrategy - areal/core/proximal_recomputer.py - Concrete proximal logprob recomputer - areal/core/capacity_modifier.py - CapacityModifier interface - areal/core/filtered_capacity_modifier.py - FilteredSamplesCapacityModifier **Tests** (10 test files, 227 tests total): - areal/tests/sdp/ - Complete test suite covering all components **Example**: - examples/math/gsm8k_grpo_segment_wise.yaml - Example configuration ## Code Quality Improvements ### 1. Removed Unnecessary Type Quotes Fixed 24+ type annotations (with __future__ annotations, quotes are unnecessary) ### 2. Decoupled Factory Dependencies Applied Dependency Injection pattern to eliminate tight coupling in RemoteInfEngine and engine classes ### 3. Comprehensive Observability Added metrics and DEBUG/INFO logging for staleness filtering, capacity management, and system monitoring ## Testing **Test Results**: 219 passed, 8 skipped (96% pass rate) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Description
For the future implementation of
RolloutController, we need to separate the task submission logic into an independent class.Related Issue
Discussion: #469
Issue: #260
Type of Change
work as expected)
Checklist
jb build docs/gemini review)Breaking Change Details (if applicable):
N/A
Additional Context
Tests passed:
test_sglang_engine.pytest_vllm_engine.pytest_examples.pyLearning curve of gsm8k example: