feat(archon): add Context Parallelism (Ulysses SP) support by rchardx · Pull Request #817 · inclusionAI/AReaL

rchardx · 2026-01-09T18:50:47Z

Description

Implement Context Parallelism for Archon engine using Ulysses SP with
All-to-All communication pattern for distributed attention computation.

Key changes:

Add CP support to ArchonParallelDims with proper mesh configuration
Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling
Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group()
Add validation for TP/CP constraints on attention head counts
Enable page-aligned batching with LCM of page_size and seq_len_divisor
Strip ._orig_mod prefix in state dict adapters for torch.compile compat
Add incompatible config check for TP + AC + compile combination

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

gemini-code-assist · 2026-01-09T18:51:29Z

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Archon Engine's capabilities by introducing Ulysses Sequence Parallelism (Context Parallelism). This new feature allows for more efficient training of models with very long sequences by distributing the attention computation across multiple GPUs. The changes involve fundamental updates to the parallel dimension management, model parallelization logic for Qwen2 and Qwen3, and improvements to data handling and testing infrastructure to support this advanced parallelism strategy.

Highlights

Context Parallelism (Ulysses SP) Support: Implemented Ulysses Sequence Parallelism for the Archon Engine, enabling efficient training of long sequences by distributing attention computation across GPUs using All-to-All communication. This includes adding a 'cp' dimension to ArchonParallelDims and integrating ulysses_slice_inputs and ulysses_gather_output for input/output handling.
3D Device Mesh Configuration: The parallel dimensions now support a 3D mesh (data parallel, context parallel, tensor parallel) for more flexible and powerful distributed training setups. This involves updates to ArchonParallelDims to manage these new group structures and their ranks.
Model Parallelization Integration: Added apply_cp() functions for Qwen2 and Qwen3 models to configure Attention modules for Ulysses SP via a new set_cp_group() method, ensuring compatibility with existing Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP) strategies.
Page-Aligned Batch Padding: Introduced page-aligned batch padding using a batch_align_to parameter, which leverages DEFAULT_PAGE_SIZE_BYTES to optimize memory usage and ensure exact CP slicing without extra padding.
Enhanced Distributed Testing: Added new multi-GPU tests in test_distributed.py specifically for DP/TP/CP configurations, including a run_cp_forward.py torchrun script for CP validation. Existing weight sync tests were expanded to cover all supported model types and improve comparison logic for different padding strategies.
State Dictionary Adapter Improvements: The state_dict_adapter now strips the torch.compile wrapper prefix (_orig_mod) from parameter names, improving compatibility and simplifying weight synchronization with HuggingFace models.
GQA Support in SDPA: Enabled Grouped Query Attention (GQA) support in the Scaled Dot Product Attention (SDPA) wrapper by adding an enable_gqa flag, allowing for more efficient attention mechanisms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces Ulysses Sequence Parallelism (Context Parallelism) to the Archon Engine, a significant feature for training models on long sequences. The implementation is comprehensive and well-executed, touching on parallelism configuration, engine logic, model implementation, and testing.

Key highlights of the changes include:

A robust refactoring of ArchonParallelDims to use a 3D device mesh (dp, cp, tp), which provides a clean and extensible foundation for managing complex parallelism strategies.
Seamless integration of Context Parallelism into the ArchonEngine, including input slicing, output gathering, and page-aligned batch padding.
Modifications to Qwen2 and Qwen3 models to support Ulysses SP via All-to-All communication in the attention mechanism.
A comprehensive new test suite for distributed execution (test_distributed.py and run_cp_forward.py) that validates DP, TP, and the new CP functionality.
Significant improvements to the weight synchronization tests, making them more robust and comprehensive by parameterizing across all supported model types and adding bidirectional completeness checks.

The code is of high quality, with clear documentation and strong validation logic. The addition of a check for incompatible configurations (TP + AC + compile) is particularly valuable for preventing user errors.

I have one suggestion for a minor refactoring to reduce code duplication. Overall, this is an excellent contribution.

Copilot

Pull request overview

This PR implements Ulysses Sequence Parallelism (Context Parallelism) for the Archon Engine, enabling efficient long-sequence training by distributing attention computation across GPUs via All-to-All communication.

Key changes:

Implements 3D mesh topology (dp, cp, tp) with proper process groups and rank accessors
Adds Ulysses SP utilities for input slicing and output gathering during forward/backward passes
Integrates CP into Qwen2/Qwen3 attention modules with validation of head count constraints
Enhances batch padding logic to align with page size and CP requirements using LCM
Expands test coverage with comprehensive multi-GPU distributed tests and 100% weight sync verification

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
areal/utils/ulysses.py	Refactored Ulysses utilities; removed async_op, simplified All-to-All operations
areal/utils/data.py	Added batch_align_to parameter for CP-compatible padding alignment
areal/utils/constants.py	Added DEFAULT_PAGE_SIZE_BYTES constant for memory allocation
areal/experimental/utils/archon/parallel.py	Implemented 3D mesh (dp, cp, tp) with proper sub-meshes and process groups
areal/experimental/models/archon/ulysses.py	New file with CP-specific input slicing and output gathering utilities
areal/experimental/models/archon/qwen2/model/model.py	Added Ulysses SP support to Attention module with head scattering/gathering
areal/experimental/models/archon/qwen3/model/model.py	Added Ulysses SP support to Attention module with Q/K norm handling
areal/experimental/models/archon/qwen2/infra/parallelize.py	Added apply_cp() with head count validation and set_cp_group() integration
areal/experimental/models/archon/qwen3/infra/parallelize.py	Added apply_cp() with GQA-aware head repeating logic
areal/experimental/models/archon/qwen2/model/state_dict_adapter.py	Strip torch.compile _orig_mod prefix for weight sync compatibility
areal/experimental/models/archon/qwen3/model/state_dict_adapter.py	Strip torch.compile _orig_mod prefix for weight sync compatibility
areal/experimental/models/archon/model_spec.py	Updated ParallelizeFn protocol to include cp_group parameter
areal/experimental/models/archon/base.py	Added cu_seqlens and max_seqlen to forward signature
areal/experimental/models/archon/attention.py	Added enable_gqa flag to SDPA for GQA support
areal/experimental/engine/archon_engine.py	Integrated CP with input slicing, output gathering, and LCM-based padding
areal/api/alloc_mode.py	Removed CP validation error to enable CP configuration
areal/engine/fsdp_engine.py	Fixed import path for Scheduler (API module)
areal/tests/experimental/archon/test_distributed.py	New comprehensive multi-GPU tests for DP, TP, and CP
areal/tests/experimental/archon/torchrun/run_cp_forward.py	New CP forward test script with golden comparison
areal/tests/experimental/archon/test_weight_sync.py	Enhanced to test all supported model types with 100% weight verification
areal/tests/experimental/archon/test_forward.py	Moved multi-GPU tests to test_distributed.py
areal/tests/experimental/archon/test_grpo.py	Updated to use original batch length for padding-agnostic comparison
areal/tests/experimental/archon/utils.py	Added get_model_path_for_type() and dist cleanup in teardown
areal/tests/experimental/archon/torchrun/run_vs_fsdp.py	Updated comparison logic to handle different padding strategies
areal/tests/experimental/archon/torchrun/run_tp_forward.py	Added process group cleanup
areal/tests/experimental/archon/torchrun/run_forward.py	Added process group cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rchardx · 2026-01-09T19:07:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding Context Parallelism (Ulysses SP) support to the Archon engine. The changes are extensive and well-structured, including a major refactoring of the parallelism handling with a clearer 3D device mesh, and the addition of dedicated utility modules for Ulysses and parallelism validation. The test suite has also been commendably enhanced with new distributed tests and parameterization of existing ones, which is crucial for a change of this complexity. I've found one critical issue in the new ulysses.py file that would cause a runtime error, which I've detailed below. Besides that, the implementation appears solid and thoughtfully designed.

Implement Context Parallelism for Archon engine using Ulysses SP with All-to-All communication pattern for distributed attention computation. Key changes: - Add CP support to ArchonParallelDims with proper mesh configuration - Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling - Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group() - Add validation for TP/CP constraints on attention head counts - Enable page-aligned batching with LCM of page_size and seq_len_divisor - Strip ._orig_mod prefix in state dict adapters for torch.compile compat - Add incompatible config check for TP + AC + compile combination

garrett4wade

LGTM

…nAI#817) Implement Context Parallelism for Archon engine using Ulysses SP with All-to-All communication pattern for distributed attention computation. Key changes: - Add CP support to ArchonParallelDims with proper mesh configuration - Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling - Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group() - Add validation for TP/CP constraints on attention head counts - Enable page-aligned batching with LCM of page_size and seq_len_divisor - Strip ._orig_mod prefix in state dict adapters for torch.compile compat - Add incompatible config check for TP + AC + compile combination

rchardx requested review from Copilot, garrett4wade and nuzant January 9, 2026 18:50

Copilot started reviewing on behalf of rchardx January 9, 2026 18:51 View session

gemini-code-assist Bot reviewed Jan 9, 2026

View reviewed changes

Comment thread areal/experimental/models/archon/qwen2/infra/parallelize.py

rchardx force-pushed the rchardx/archon_usp branch from 6705c63 to dfcbfc2 Compare January 9, 2026 18:54

Copilot AI reviewed Jan 9, 2026

View reviewed changes

rchardx force-pushed the rchardx/archon_usp branch from dfcbfc2 to 454889a Compare January 9, 2026 19:06

rchardx added the safe-to-test Ready to run unit-tests in a PR. label Jan 9, 2026

gemini-code-assist Bot reviewed Jan 9, 2026

View reviewed changes

Comment thread areal/experimental/models/archon/ulysses.py

rchardx temporarily deployed to AReaL-unittests January 9, 2026 19:13 — with GitHub Actions Inactive

garrett4wade reviewed Jan 10, 2026

View reviewed changes

Comment thread areal/experimental/models/archon/ulysses.py Outdated

Comment thread areal/utils/data.py

rchardx force-pushed the rchardx/archon_usp branch from 454889a to 2542933 Compare January 10, 2026 15:15

rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Jan 10, 2026

rchardx temporarily deployed to AReaL-unittests January 10, 2026 15:19 — with GitHub Actions Inactive

garrett4wade approved these changes Jan 12, 2026

View reviewed changes

garrett4wade merged commit 7927735 into main Jan 12, 2026
7 checks passed

garrett4wade deleted the rchardx/archon_usp branch January 12, 2026 02:29

rchardx mentioned this pull request Jan 12, 2026

refactor(data): simplify pad_mb_list alignment parameters #820

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(archon): add Context Parallelism (Ulysses SP) support#817

feat(archon): add Context Parallelism (Ulysses SP) support#817
garrett4wade merged 1 commit intomainfrom
rchardx/archon_usp

rchardx commented Jan 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

rchardx commented Jan 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rchardx commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Uh oh!

gemini-code-assist Bot commented Jan 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

rchardx commented Jan 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rchardx commented Jan 9, 2026 •

edited

Loading