feat(archon): add Expert Parallelism (EP) support for MoE models by rchardx · Pull Request #833 · inclusionAI/AReaL

rchardx · 2026-01-17T03:30:03Z

Description

Implement EP for Mixture-of-Experts models in Archon engine, tested with PyTorch 2.9.1. Key changes:

Add MoE module with router, grouped experts, and token dispatch/combine
Implement ExpertParallel style using PyTorch all-to-all collectives
Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions
Support Qwen3 MoE model with EP-aware parallelization
Add ReplicateParallel style for router gate computation
Refactor Ulysses all-to-all to support both FSDP and Archon backends

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

gemini-code-assist · 2026-01-17T03:31:29Z

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the Archon engine's capabilities by introducing full support for Expert Parallelism (EP) in Mixture-of-Experts (MoE) models. It provides the foundational components for distributed MoE training, including new parallel styles, mesh management, and optimized communication patterns. The changes enable efficient scaling of MoE models across multiple devices and integrate seamlessly with existing parallelism strategies and torch.compile workflows.

Highlights

Expert Parallelism (EP) Support: Introduced comprehensive Expert Parallelism (EP) support for Mixture-of-Experts (MoE) models within the Archon engine. This includes new parallel styles, mesh dimension management, and core logic for token dispatch and combine using all-to-all collectives.
MoE Model Architecture Integration: Integrated MoE capabilities into the Qwen3 model architecture, allowing for configurable MoE layers, dynamic routing via TokenChoiceTopKRouter, and efficient expert computation using GroupedExperts with torch._grouped_mm.
Parallelism Dimension Management: Developed ArchonParallelDims to seamlessly manage and coordinate Data Parallelism (DP), Tensor Parallelism (TP), Context Parallelism (CP), and the newly added Expert Parallelism (EP) across device meshes.
Torch.compile Compatibility Improvements: Refactored ulysses_all_to_all to utilize all_to_all_single_autograd for enhanced torch.compile compatibility and implemented MoE-aware compilation strategies to handle dynamic shapes and FSDP hooks within compiled graphs.
State Dict Adapter Enhancements: Updated the Qwen3 state dict adapter to correctly handle the conversion of MoE expert weights (from 2D HuggingFace format to 3D Archon format) and router parameters, ensuring proper model loading and saving.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR implements Expert Parallelism (EP) for Mixture-of-Experts models in the Archon engine. It adds comprehensive MoE support including router, grouped experts, token dispatch/combine, and EP-aware parallelization for Qwen3 MoE models.

Changes:

Add complete MoE module implementation (router, experts, args, utils)
Implement ExpertParallel style using PyTorch all-to-all collectives
Add ArchonParallelDims with EP mesh dimension support
Support Qwen3 MoE model with EP-aware parallelization
Refactor Ulysses all-to-all to use all_to_all_single_autograd for torch.compile compatibility
Add comprehensive test coverage for MoE and EP

Reviewed changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
areal/experimental/models/archon/moe/*	New MoE module implementation
areal/experimental/distributed/archon.py	EP support in ArchonParallelDims
areal/experimental/models/archon/qwen3/*	Qwen3 MoE model support
areal/models/fsdp/ulysses.py	Refactored all-to-all implementation
areal/tests/experimental/archon/*	Comprehensive MoE/EP tests
areal/utils/fsdp/parallel.py	Moved ReplicateParallel to distributed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding Expert Parallelism (EP) support for Mixture-of-Experts (MoE) models within the Archon engine. The changes are extensive, including a new MoE module with a router and grouped experts, a refactored parallel dimension management system (ArchonParallelDims) to handle the new EP dimension, and MoE-aware parallelization logic for Qwen3 models. The implementation is robust, featuring a vectorized attention mask creation for performance, a refactoring of Ulysses all-to-all to be compatible with torch.compile, and a comprehensive suite of new tests. My review found a couple of minor points to address, primarily a documentation inconsistency in the MoE configuration and the removal of some shape assertions. Overall, this is a high-quality contribution that significantly enhances the engine's capabilities.

garrett4wade

Amazing work! I've left several comments to be addressed.

Implement EP for Mixture-of-Experts models in Archon engine, tested with PyTorch 2.9.1. Key changes: - Add MoE module with router, grouped experts, and token dispatch/combine - Implement ExpertParallel style using PyTorch all-to-all collectives - Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions - Support Qwen3 MoE model with EP-aware parallelization - Add ReplicateParallel style for router gate computation - Refactor Ulysses all-to-all to support both FSDP and Archon backends

garrett4wade

LGTM!

…lusionAI#833) Implement EP for Mixture-of-Experts models in Archon engine, tested with PyTorch 2.9.1. Key changes: - Add MoE module with router, grouped experts, and token dispatch/combine - Implement ExpertParallel style using PyTorch all-to-all collectives - Add ArchonParallelDims for managing DP/TP/CP/EP mesh dimensions - Support Qwen3 MoE model with EP-aware parallelization - Add ReplicateParallel style for router gate computation - Refactor Ulysses all-to-all to support both FSDP and Archon backends

rchardx requested review from Copilot, garrett4wade and nuzant January 17, 2026 03:30

Copilot started reviewing on behalf of rchardx January 17, 2026 03:30 View session

Copilot AI reviewed Jan 17, 2026

View reviewed changes

Comment thread areal/tests/experimental/archon/test_weight_sync.py

gemini-code-assist Bot reviewed Jan 17, 2026

View reviewed changes

Comment thread areal/experimental/models/archon/moe/args.py

Comment thread areal/experimental/models/archon/qwen2/model/rope.py

rchardx force-pushed the rchardx/ep branch 4 times, most recently from 9079329 to 2b32e1b Compare January 17, 2026 09:21

garrett4wade reviewed Jan 17, 2026

View reviewed changes

nuzant reviewed Jan 17, 2026

View reviewed changes

Comment thread areal/experimental/distributed/archon/parallel_dims.py Outdated

Comment thread areal/experimental/engine/archon_engine.py Outdated

rchardx force-pushed the rchardx/ep branch from 2b32e1b to 1bcdff2 Compare January 17, 2026 19:41

rchardx requested review from garrett4wade and nuzant January 17, 2026 19:44

garrett4wade approved these changes Jan 18, 2026

View reviewed changes

garrett4wade merged commit 141d523 into main Jan 18, 2026
1 check passed

garrett4wade deleted the rchardx/ep branch January 18, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(archon): add Expert Parallelism (EP) support for MoE models#833

feat(archon): add Expert Parallelism (EP) support for MoE models#833
garrett4wade merged 1 commit intomainfrom
rchardx/ep

rchardx commented Jan 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rchardx commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Uh oh!

gemini-code-assist Bot commented Jan 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rchardx commented Jan 17, 2026 •

edited

Loading