[FIx] Use gloo group barriers for distributed synchronization by fishcrap · Pull Request #607 · inclusionAI/AReaL

fishcrap · 2025-11-21T07:48:13Z

Description

This PR uses gloo as dist.barrier backend across the codebase.

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

N/A

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist · 2025-11-21T07:48:49Z

Summary of Changes

Hello @fishcrap, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the distributed synchronization logic across the codebase. It transitions all instances of dist.barrier from using device-specific IDs to utilizing a common CPU group barrier, specifically self.train_engine.cpu_group or actor.cpu_group. This change aims to enhance the consistency and robustness of synchronization points within distributed operations, leveraging the gloo backend for more unified coordination.

Highlights

Standardized Distributed Synchronization: The pull request standardizes the distributed synchronization mechanism by consistently using gloo group barriers (dist.barrier(group=...)) instead of device-specific barriers (dist.barrier(device_ids=...)). This change is applied across various core components and example scripts.
Code Refactoring: This change is purely a code refactoring, ensuring no functional changes to the existing distributed training and evaluation workflows. It aims to improve consistency and robustness in synchronization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly refactors distributed synchronization by replacing deprecated dist.barrier(device_ids=...) calls with dist.barrier(group=...) using a gloo backend, which is a great improvement for robustness. The changes in the core library files also correctly adopt the pattern of calling current_platform.synchronize() before dist.barrier(). However, this improved pattern is not consistently applied across all the example files. I've added comments to address these inconsistencies for correctness and maintainability. Additionally, I've pointed out a few instances of redundant synchronization calls in the examples that can be removed.

gemini-code-assist · 2025-11-21T07:52:37Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

fishcrap · 2025-11-21T08:08:15Z

/gemini review

gemini-code-assist

Code Review

This pull request is a great refactoring effort to replace the deprecated dist.barrier(device_ids=...) with the recommended dist.barrier(group=...) using a gloo process group for CPU-level synchronization. This change improves correctness and portability, especially for multi-node environments. The changes are applied consistently across the codebase. I've made a couple of suggestions to use the platform-agnostic current_platform.synchronize() for better consistency and portability, instead of torch.cuda.synchronize().

gemini-code-assist · 2025-11-21T08:11:02Z

 # Prepare batch for training
 batch = batch.to('cuda')
-dist.barrier(device_ids=[actor.device.index])
 torch.cuda.synchronize()


For consistency with the rest of the codebase and to promote platform-agnostic code, it would be better to use current_platform.synchronize() here instead of torch.cuda.synchronize(). The user of this guide would also need to import it: from areal.platforms import current_platform.

Suggested change

torch.cuda.synchronize()

current_platform.synchronize()

gemini-code-assist · 2025-11-21T08:11:02Z

        torch.cuda.synchronize()
+        dist.barrier(group=actor.cpu_group)

        # Upload statistics to the logger
        stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)
        stats_logger.commit(epoch, step, global_step, stats)

-        dist.barrier(device_ids=[actor.device.index])
        torch.cuda.synchronize()
+        dist.barrier(group=actor.cpu_group)


For consistency and portability across different hardware platforms, it's better to use current_platform.synchronize() instead of torch.cuda.synchronize(). This aligns with the practice in other parts of the codebase. You'll need to add from areal.platforms import current_platform at the top of the file.

Suggested change

torch.cuda.synchronize()

dist.barrier(group=actor.cpu_group)

# Upload statistics to the logger

stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)

stats_logger.commit(epoch, step, global_step, stats)

dist.barrier(device_ids=[actor.device.index])

torch.cuda.synchronize()

dist.barrier(group=actor.cpu_group)

current_platform.synchronize()

dist.barrier(group=actor.cpu_group)

# Upload statistics to the logger

stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)

stats_logger.commit(epoch, step, global_step, stats)

current_platform.synchronize()

dist.barrier(group=actor.cpu_group)

garrett4wade

LGTM

…ionAI#607) * use gloo barrier

fishcrap added 2 commits November 21, 2025 15:36

use gloo barrier

0c46369

fix

afd65a0

fishcrap requested review from dhh1995, garrett4wade, nuzant and rchardx November 21, 2025 07:48

fix

a87997a

gemini-code-assist Bot reviewed Nov 21, 2025

View reviewed changes

fishcrap added 3 commits November 21, 2025 16:02

fix

acb99ba

fix

de7481b

fix

ea003e8

fishcrap added the safe-to-test Ready to run unit-tests in a PR. label Nov 21, 2025

gemini-code-assist Bot reviewed Nov 21, 2025

View reviewed changes

fishcrap temporarily deployed to AReaL-unittests November 21, 2025 08:22 — with GitHub Actions Inactive

garrett4wade approved these changes Nov 21, 2025

View reviewed changes

garrett4wade merged commit 0a89bb1 into main Nov 21, 2025
7 checks passed

garrett4wade deleted the sxj/barrier branch November 21, 2025 09:36

Bruce-rl-hw pushed a commit to Bruce-rl-hw/AReaL-vllm that referenced this pull request Dec 4, 2025

[FIx] Use gloo group barriers for distributed synchronization (inclus…

f65b6d8

…ionAI#607) * use gloo barrier

leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026

[FIx] Use gloo group barriers for distributed synchronization (inclus…

b967f1e

…ionAI#607) * use gloo barrier

Conversation

fishcrap commented Nov 21, 2025

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot commented Nov 21, 2025

Uh oh!

fishcrap commented Nov 21, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants