Skip to content

feat(archon): add Context Parallelism (Ulysses SP) support#817

Merged
garrett4wade merged 1 commit intomainfrom
rchardx/archon_usp
Jan 12, 2026
Merged

feat(archon): add Context Parallelism (Ulysses SP) support#817
garrett4wade merged 1 commit intomainfrom
rchardx/archon_usp

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Jan 9, 2026

Description

Implement Context Parallelism for Archon engine using Ulysses SP with
All-to-All communication pattern for distributed attention computation.

Key changes:

  • Add CP support to ArchonParallelDims with proper mesh configuration
  • Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling
  • Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group()
  • Add validation for TP/CP constraints on attention head counts
  • Enable page-aligned batching with LCM of page_size and seq_len_divisor
  • Strip ._orig_mod prefix in state dict adapters for torch.compile compat
  • Add incompatible config check for TP + AC + compile combination

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Archon Engine's capabilities by introducing Ulysses Sequence Parallelism (Context Parallelism). This new feature allows for more efficient training of models with very long sequences by distributing the attention computation across multiple GPUs. The changes involve fundamental updates to the parallel dimension management, model parallelization logic for Qwen2 and Qwen3, and improvements to data handling and testing infrastructure to support this advanced parallelism strategy.

Highlights

  • Context Parallelism (Ulysses SP) Support: Implemented Ulysses Sequence Parallelism for the Archon Engine, enabling efficient training of long sequences by distributing attention computation across GPUs using All-to-All communication. This includes adding a 'cp' dimension to ArchonParallelDims and integrating ulysses_slice_inputs and ulysses_gather_output for input/output handling.
  • 3D Device Mesh Configuration: The parallel dimensions now support a 3D mesh (data parallel, context parallel, tensor parallel) for more flexible and powerful distributed training setups. This involves updates to ArchonParallelDims to manage these new group structures and their ranks.
  • Model Parallelization Integration: Added apply_cp() functions for Qwen2 and Qwen3 models to configure Attention modules for Ulysses SP via a new set_cp_group() method, ensuring compatibility with existing Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP) strategies.
  • Page-Aligned Batch Padding: Introduced page-aligned batch padding using a batch_align_to parameter, which leverages DEFAULT_PAGE_SIZE_BYTES to optimize memory usage and ensure exact CP slicing without extra padding.
  • Enhanced Distributed Testing: Added new multi-GPU tests in test_distributed.py specifically for DP/TP/CP configurations, including a run_cp_forward.py torchrun script for CP validation. Existing weight sync tests were expanded to cover all supported model types and improve comparison logic for different padding strategies.
  • State Dictionary Adapter Improvements: The state_dict_adapter now strips the torch.compile wrapper prefix (_orig_mod) from parameter names, improving compatibility and simplifying weight synchronization with HuggingFace models.
  • GQA Support in SDPA: Enabled Grouped Query Attention (GQA) support in the Scaled Dot Product Attention (SDPA) wrapper by adding an enable_gqa flag, allowing for more efficient attention mechanisms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Ulysses Sequence Parallelism (Context Parallelism) to the Archon Engine, a significant feature for training models on long sequences. The implementation is comprehensive and well-executed, touching on parallelism configuration, engine logic, model implementation, and testing.

Key highlights of the changes include:

  • A robust refactoring of ArchonParallelDims to use a 3D device mesh (dp, cp, tp), which provides a clean and extensible foundation for managing complex parallelism strategies.
  • Seamless integration of Context Parallelism into the ArchonEngine, including input slicing, output gathering, and page-aligned batch padding.
  • Modifications to Qwen2 and Qwen3 models to support Ulysses SP via All-to-All communication in the attention mechanism.
  • A comprehensive new test suite for distributed execution (test_distributed.py and run_cp_forward.py) that validates DP, TP, and the new CP functionality.
  • Significant improvements to the weight synchronization tests, making them more robust and comprehensive by parameterizing across all supported model types and adding bidirectional completeness checks.

The code is of high quality, with clear documentation and strong validation logic. The addition of a check for incompatible configurations (TP + AC + compile) is particularly valuable for preventing user errors.

I have one suggestion for a minor refactoring to reduce code duplication. Overall, this is an excellent contribution.

Comment thread areal/experimental/models/archon/qwen2/infra/parallelize.py
@rchardx rchardx force-pushed the rchardx/archon_usp branch from 6705c63 to dfcbfc2 Compare January 9, 2026 18:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Ulysses Sequence Parallelism (Context Parallelism) for the Archon Engine, enabling efficient long-sequence training by distributing attention computation across GPUs via All-to-All communication.

Key changes:

  • Implements 3D mesh topology (dp, cp, tp) with proper process groups and rank accessors
  • Adds Ulysses SP utilities for input slicing and output gathering during forward/backward passes
  • Integrates CP into Qwen2/Qwen3 attention modules with validation of head count constraints
  • Enhances batch padding logic to align with page size and CP requirements using LCM
  • Expands test coverage with comprehensive multi-GPU distributed tests and 100% weight sync verification

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

Show a summary per file
File Description
areal/utils/ulysses.py Refactored Ulysses utilities; removed async_op, simplified All-to-All operations
areal/utils/data.py Added batch_align_to parameter for CP-compatible padding alignment
areal/utils/constants.py Added DEFAULT_PAGE_SIZE_BYTES constant for memory allocation
areal/experimental/utils/archon/parallel.py Implemented 3D mesh (dp, cp, tp) with proper sub-meshes and process groups
areal/experimental/models/archon/ulysses.py New file with CP-specific input slicing and output gathering utilities
areal/experimental/models/archon/qwen2/model/model.py Added Ulysses SP support to Attention module with head scattering/gathering
areal/experimental/models/archon/qwen3/model/model.py Added Ulysses SP support to Attention module with Q/K norm handling
areal/experimental/models/archon/qwen2/infra/parallelize.py Added apply_cp() with head count validation and set_cp_group() integration
areal/experimental/models/archon/qwen3/infra/parallelize.py Added apply_cp() with GQA-aware head repeating logic
areal/experimental/models/archon/qwen2/model/state_dict_adapter.py Strip torch.compile _orig_mod prefix for weight sync compatibility
areal/experimental/models/archon/qwen3/model/state_dict_adapter.py Strip torch.compile _orig_mod prefix for weight sync compatibility
areal/experimental/models/archon/model_spec.py Updated ParallelizeFn protocol to include cp_group parameter
areal/experimental/models/archon/base.py Added cu_seqlens and max_seqlen to forward signature
areal/experimental/models/archon/attention.py Added enable_gqa flag to SDPA for GQA support
areal/experimental/engine/archon_engine.py Integrated CP with input slicing, output gathering, and LCM-based padding
areal/api/alloc_mode.py Removed CP validation error to enable CP configuration
areal/engine/fsdp_engine.py Fixed import path for Scheduler (API module)
areal/tests/experimental/archon/test_distributed.py New comprehensive multi-GPU tests for DP, TP, and CP
areal/tests/experimental/archon/torchrun/run_cp_forward.py New CP forward test script with golden comparison
areal/tests/experimental/archon/test_weight_sync.py Enhanced to test all supported model types with 100% weight verification
areal/tests/experimental/archon/test_forward.py Moved multi-GPU tests to test_distributed.py
areal/tests/experimental/archon/test_grpo.py Updated to use original batch length for padding-agnostic comparison
areal/tests/experimental/archon/utils.py Added get_model_path_for_type() and dist cleanup in teardown
areal/tests/experimental/archon/torchrun/run_vs_fsdp.py Updated comparison logic to handle different padding strategies
areal/tests/experimental/archon/torchrun/run_tp_forward.py Added process group cleanup
areal/tests/experimental/archon/torchrun/run_forward.py Added process group cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rchardx rchardx force-pushed the rchardx/archon_usp branch from dfcbfc2 to 454889a Compare January 9, 2026 19:06
@rchardx
Copy link
Copy Markdown
Collaborator Author

rchardx commented Jan 9, 2026

/gemini review

@rchardx rchardx added the safe-to-test Ready to run unit-tests in a PR. label Jan 9, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding Context Parallelism (Ulysses SP) support to the Archon engine. The changes are extensive and well-structured, including a major refactoring of the parallelism handling with a clearer 3D device mesh, and the addition of dedicated utility modules for Ulysses and parallelism validation. The test suite has also been commendably enhanced with new distributed tests and parameterization of existing ones, which is crucial for a change of this complexity. I've found one critical issue in the new ulysses.py file that would cause a runtime error, which I've detailed below. Besides that, the implementation appears solid and thoughtfully designed.

Comment thread areal/experimental/models/archon/ulysses.py
@rchardx rchardx temporarily deployed to AReaL-unittests January 9, 2026 19:13 — with GitHub Actions Inactive
Comment thread areal/experimental/models/archon/ulysses.py Outdated
Comment thread areal/utils/data.py
Implement Context Parallelism for Archon engine using Ulysses SP with
All-to-All communication pattern for distributed attention computation.

Key changes:
- Add CP support to ArchonParallelDims with proper mesh configuration
- Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling
- Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group()
- Add validation for TP/CP constraints on attention head counts
- Enable page-aligned batching with LCM of page_size and seq_len_divisor
- Strip ._orig_mod prefix in state dict adapters for torch.compile compat
- Add incompatible config check for TP + AC + compile combination
@rchardx rchardx force-pushed the rchardx/archon_usp branch from 454889a to 2542933 Compare January 10, 2026 15:15
@rchardx rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Jan 10, 2026
@rchardx rchardx temporarily deployed to AReaL-unittests January 10, 2026 15:19 — with GitHub Actions Inactive
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 7927735 into main Jan 12, 2026
7 checks passed
@garrett4wade garrett4wade deleted the rchardx/archon_usp branch January 12, 2026 02:29
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…nAI#817)

Implement Context Parallelism for Archon engine using Ulysses SP with
All-to-All communication pattern for distributed attention computation.

Key changes:
- Add CP support to ArchonParallelDims with proper mesh configuration
- Implement ulysses_slice_inputs/ulysses_gather_output for input/output handling
- Integrate CP into Qwen2/Qwen3 attention modules via set_cp_group()
- Add validation for TP/CP constraints on attention head counts
- Enable page-aligned batching with LCM of page_size and seq_len_divisor
- Strip ._orig_mod prefix in state dict adapters for torch.compile compat
- Add incompatible config check for TP + AC + compile combination
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants