feat: align perf_tracer with task hierarchy#569
Conversation
Summary of ChangesHello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the performance tracing system to provide a more structured and insightful view of asynchronous workflows. By introducing a task-scoped session hierarchy, streamlining event tracking, and enhancing client-side limit enforcement, the changes aim to capture clearer lifecycle metrics and improve diagnostic capabilities. The addition of a dedicated visualization tool further empowers users to analyze performance data effectively, ensuring that rollout diagnostics are actionable and precise. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the performance tracing system to align with a new task/session hierarchy. The changes introduce task-scoped session tracing, simplify the event lifecycle, and add powerful visualization tools. The code is well-structured, and the documentation has been updated thoroughly to reflect these changes. I have one piece of feedback regarding a potential loss of diagnostic information in an exception handling path. Overall, this is a high-quality contribution that greatly improves the observability of the system.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a hierarchical task-session tracing model to improve performance diagnostics for distributed RL training. It refactors the session tracking system to align with the task hierarchy, where each dataset-level task can spawn multiple sample-level sessions (when n_samples > 1).
Key changes:
- Introduces task-session hierarchy with separate
task_id(dataset-level) andsession_id(sample-level) tracking - Refactors session lifecycle events from multi-stage (enqueued → execution_start → execution_end → consumed) to simplified model with
mark_finalized - Adds
toolcallphase tracking alongside existinggenerateandrewardphases - Includes new visualization tooling (
plot_session_trace.py) for analyzing session traces - Improves OpenAI client token limit handling
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/best_practices/perf_profiling.md | Updates documentation to reflect task-session hierarchy and new event model with code examples |
| areal/workflow/vision_rlvr.py | Refactors to use per-sample session registration and tracing within _collect_samples method |
| areal/workflow/rlvr.py | Similar refactoring to vision_rlvr.py for task-session hierarchy support |
| areal/utils/perf_tracer.py | Core tracer refactoring with task/session hierarchy, simplified event model, and context variable support |
| areal/tools/plot_session_trace.py | New visualization tool (1360 lines) for generating interactive HTML reports from session traces |
| areal/tools/perf_trace_converter.py | Adds intelligent output path inference based on input type (file/directory/glob) |
| areal/tests/test_perf_tracer.py | Updates tests to use new task-session registration API and FINALIZED event |
| areal/tests/test_perf_trace_converter.py | Adds tests for new output path inference behavior |
| areal/experimental/openai/client.py | Fixes max_tokens handling to avoid None type errors and adds default fallback |
| areal/core/workflow_executor.py | Updates to register tasks instead of sessions and use mark_finalized event |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c15da09 to
bb85629
Compare
| import plotly.graph_objects as go | ||
| from plotly.colors import qualitative | ||
| from plotly.subplots import make_subplots |
There was a problem hiding this comment.
Maybe we should add plotly in pyproject.toml?
Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.
bb85629 to
86512bb
Compare
Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.
Description
Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.
Type of Change
work as expected)
Checklist
jb build docs/gemini review)