Skip to content

feat: align perf_tracer with task hierarchy#569

Merged
garrett4wade merged 1 commit intomainfrom
rchardx/tracer
Nov 13, 2025
Merged

feat: align perf_tracer with task hierarchy#569
garrett4wade merged 1 commit intomainfrom
rchardx/tracer

Conversation

@rchardx
Copy link
Collaborator

@rchardx rchardx commented Nov 13, 2025

Description

Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@rchardx rchardx changed the title Aligns perf tracer with task hierarchy feat: align perf_tracer with task hierarchy Nov 13, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the performance tracing system to provide a more structured and insightful view of asynchronous workflows. By introducing a task-scoped session hierarchy, streamlining event tracking, and enhancing client-side limit enforcement, the changes aim to capture clearer lifecycle metrics and improve diagnostic capabilities. The addition of a dedicated visualization tool further empowers users to analyze performance data effectively, ensuring that rollout diagnostics are actionable and precise.

Highlights

  • Task-Scoped Session Tracing: Introduced a new task-session hierarchy for performance tracing, allowing for more granular tracking of dataset-level tasks and their sample-level sessions. This replaces the previous direct session plumbing.
  • Refactored Session Events and Metrics: The SessionTraceEvent enum has been streamlined, removing old lifecycle events and introducing a unified FINALIZED event, along with TOOLCALL_START and TOOLCALL_END. Derived metrics have been updated to reflect this new structure, focusing on total_s, generate_s, reward_s, and toolcall_s.
  • Enhanced OpenAI Client Limits: Hardened the OpenAI client's max_new_tokens handling to ensure a default value of 512 is applied when not explicitly set, accompanied by a warning for clarity.
  • New Visualization Tool: A new Python script, plot_session_trace.py, has been added to generate interactive HTML plots for overall and step-wise session distributions, session lifecycles, and latency, leveraging the new task-session hierarchy.
  • Improved Trace Converter Output Inference: The perf_trace_converter.py tool now intelligently infers default output paths for Chrome Trace JSON files based on whether the input is a single file, a directory, or a glob pattern.
  • Workflow Integration and Documentation Updates: Workflows (RLVRWorkflow, VisionRLVRWorkflow) have been refactored to align with the new task-session tracing, and documentation (perf_profiling.md) has been thoroughly updated to explain the new hierarchy, API usage, and troubleshooting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring of the performance tracing system to align with a new task/session hierarchy. The changes introduce task-scoped session tracing, simplify the event lifecycle, and add powerful visualization tools. The code is well-structured, and the documentation has been updated thoroughly to reflect these changes. I have one piece of feedback regarding a potential loss of diagnostic information in an exception handling path. Overall, this is a high-quality contribution that greatly improves the observability of the system.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a hierarchical task-session tracing model to improve performance diagnostics for distributed RL training. It refactors the session tracking system to align with the task hierarchy, where each dataset-level task can spawn multiple sample-level sessions (when n_samples > 1).

Key changes:

  • Introduces task-session hierarchy with separate task_id (dataset-level) and session_id (sample-level) tracking
  • Refactors session lifecycle events from multi-stage (enqueued → execution_start → execution_end → consumed) to simplified model with mark_finalized
  • Adds toolcall phase tracking alongside existing generate and reward phases
  • Includes new visualization tooling (plot_session_trace.py) for analyzing session traces
  • Improves OpenAI client token limit handling

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/best_practices/perf_profiling.md Updates documentation to reflect task-session hierarchy and new event model with code examples
areal/workflow/vision_rlvr.py Refactors to use per-sample session registration and tracing within _collect_samples method
areal/workflow/rlvr.py Similar refactoring to vision_rlvr.py for task-session hierarchy support
areal/utils/perf_tracer.py Core tracer refactoring with task/session hierarchy, simplified event model, and context variable support
areal/tools/plot_session_trace.py New visualization tool (1360 lines) for generating interactive HTML reports from session traces
areal/tools/perf_trace_converter.py Adds intelligent output path inference based on input type (file/directory/glob)
areal/tests/test_perf_tracer.py Updates tests to use new task-session registration API and FINALIZED event
areal/tests/test_perf_trace_converter.py Adds tests for new output path inference behavior
areal/experimental/openai/client.py Fixes max_tokens handling to avoid None type errors and adds default fallback
areal/core/workflow_executor.py Updates to register tasks instead of sessions and use mark_finalized event

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rchardx rchardx added the safe-to-test Ready to run unit-tests in a PR. label Nov 13, 2025
Comment on lines +13 to +15
import plotly.graph_objects as go
from plotly.colors import qualitative
from plotly.subplots import make_subplots
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add plotly in pyproject.toml?

Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.
@garrett4wade garrett4wade merged commit 35f39bb into main Nov 13, 2025
1 check passed
@garrett4wade garrett4wade deleted the rchardx/tracer branch November 13, 2025 09:30
Bruce-rl-hw pushed a commit to Bruce-rl-hw/AReaL-vllm that referenced this pull request Dec 4, 2025
Introduces task-scoped session tracing to retire direct session plumbing and capture clearer lifecycle metrics and reasons. Infers default trace outputs, hardens OpenAI client limits, and adds visualization tooling to keep rollout diagnostics actionable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants