Skip to content

feat(archon): add Expert Parallelism (EP) support for MoE models#833

Merged
garrett4wade merged 1 commit intomainfrom
rchardx/ep
Jan 18, 2026
Merged

feat(archon): add Expert Parallelism (EP) support for MoE models#833
garrett4wade merged 1 commit intomainfrom
rchardx/ep

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Jan 17, 2026

Description

Implement EP for Mixture-of-Experts models in Archon engine, tested with PyTorch 2.9.1. Key changes:

  • Add MoE module with router, grouped experts, and token dispatch/combine
  • Implement ExpertParallel style using PyTorch all-to-all collectives
  • Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions
  • Support Qwen3 MoE model with EP-aware parallelization
  • Add ReplicateParallel style for router gate computation
  • Refactor Ulysses all-to-all to support both FSDP and Archon backends

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the Archon engine's capabilities by introducing full support for Expert Parallelism (EP) in Mixture-of-Experts (MoE) models. It provides the foundational components for distributed MoE training, including new parallel styles, mesh management, and optimized communication patterns. The changes enable efficient scaling of MoE models across multiple devices and integrate seamlessly with existing parallelism strategies and torch.compile workflows.

Highlights

  • Expert Parallelism (EP) Support: Introduced comprehensive Expert Parallelism (EP) support for Mixture-of-Experts (MoE) models within the Archon engine. This includes new parallel styles, mesh dimension management, and core logic for token dispatch and combine using all-to-all collectives.
  • MoE Model Architecture Integration: Integrated MoE capabilities into the Qwen3 model architecture, allowing for configurable MoE layers, dynamic routing via TokenChoiceTopKRouter, and efficient expert computation using GroupedExperts with torch._grouped_mm.
  • Parallelism Dimension Management: Developed ArchonParallelDims to seamlessly manage and coordinate Data Parallelism (DP), Tensor Parallelism (TP), Context Parallelism (CP), and the newly added Expert Parallelism (EP) across device meshes.
  • Torch.compile Compatibility Improvements: Refactored ulysses_all_to_all to utilize all_to_all_single_autograd for enhanced torch.compile compatibility and implemented MoE-aware compilation strategies to handle dynamic shapes and FSDP hooks within compiled graphs.
  • State Dict Adapter Enhancements: Updated the Qwen3 state dict adapter to correctly handle the conversion of MoE expert weights (from 2D HuggingFace format to 3D Archon format) and router parameters, ensuring proper model loading and saving.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Expert Parallelism (EP) for Mixture-of-Experts models in the Archon engine. It adds comprehensive MoE support including router, grouped experts, token dispatch/combine, and EP-aware parallelization for Qwen3 MoE models.

Changes:

  • Add complete MoE module implementation (router, experts, args, utils)
  • Implement ExpertParallel style using PyTorch all-to-all collectives
  • Add ArchonParallelDims with EP mesh dimension support
  • Support Qwen3 MoE model with EP-aware parallelization
  • Refactor Ulysses all-to-all to use all_to_all_single_autograd for torch.compile compatibility
  • Add comprehensive test coverage for MoE and EP

Reviewed changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
areal/experimental/models/archon/moe/* New MoE module implementation
areal/experimental/distributed/archon.py EP support in ArchonParallelDims
areal/experimental/models/archon/qwen3/* Qwen3 MoE model support
areal/models/fsdp/ulysses.py Refactored all-to-all implementation
areal/tests/experimental/archon/* Comprehensive MoE/EP tests
areal/utils/fsdp/parallel.py Moved ReplicateParallel to distributed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread areal/tests/experimental/archon/test_weight_sync.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding Expert Parallelism (EP) support for Mixture-of-Experts (MoE) models within the Archon engine. The changes are extensive, including a new MoE module with a router and grouped experts, a refactored parallel dimension management system (ArchonParallelDims) to handle the new EP dimension, and MoE-aware parallelization logic for Qwen3 models. The implementation is robust, featuring a vectorized attention mask creation for performance, a refactoring of Ulysses all-to-all to be compatible with torch.compile, and a comprehensive suite of new tests. My review found a couple of minor points to address, primarily a documentation inconsistency in the MoE configuration and the removal of some shape assertions. Overall, this is a high-quality contribution that significantly enhances the engine's capabilities.

Comment thread areal/experimental/models/archon/moe/args.py
Comment thread areal/experimental/models/archon/qwen2/model/rope.py
@rchardx rchardx force-pushed the rchardx/ep branch 4 times, most recently from 9079329 to 2b32e1b Compare January 17, 2026 09:21
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! I've left several comments to be addressed.

Comment thread areal/distributed/__init__.py Outdated
Comment thread areal/experimental/engine/archon_engine.py
Comment thread areal/experimental/models/archon/parallel_dims.py
Comment thread areal/experimental/models/archon/parallel_dims.py
Comment thread areal/experimental/engine/archon_engine.py Outdated
Comment thread areal/experimental/models/archon/qwen3/infra/parallelize.py
Comment thread areal/tests/experimental/archon/test_moe.py
Comment thread areal/tests/experimental/archon/torchrun/run_qwen3_parallelize.py Outdated
Comment thread areal/tests/experimental/archon/test_state_dict_adapter.py
Comment thread areal/experimental/distributed/archon/parallel_dims.py Outdated
Comment thread areal/experimental/engine/archon_engine.py Outdated
Implement EP for Mixture-of-Experts models in Archon engine, tested with
PyTorch 2.9.1. Key changes:

- Add MoE module with router, grouped experts, and token dispatch/combine
- Implement ExpertParallel style using PyTorch all-to-all collectives
- Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions
- Support Qwen3 MoE model with EP-aware parallelization
- Add ReplicateParallel style for router gate computation
- Refactor Ulysses all-to-all to support both FSDP and Archon backends
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garrett4wade garrett4wade merged commit 141d523 into main Jan 18, 2026
1 check passed
@garrett4wade garrett4wade deleted the rchardx/ep branch January 18, 2026 08:41
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…lusionAI#833)

Implement EP for Mixture-of-Experts models in Archon engine, tested with
PyTorch 2.9.1. Key changes:

- Add MoE module with router, grouped experts, and token dispatch/combine
- Implement ExpertParallel style using PyTorch all-to-all collectives
- Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions
- Support Qwen3 MoE model with EP-aware parallelization
- Add ReplicateParallel style for router gate computation
- Refactor Ulysses all-to-all to support both FSDP and Archon backends
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants