Skip to content

[FIx] Use gloo group barriers for distributed synchronization#607

Merged
garrett4wade merged 6 commits intomainfrom
sxj/barrier
Nov 21, 2025
Merged

[FIx] Use gloo group barriers for distributed synchronization#607
garrett4wade merged 6 commits intomainfrom
sxj/barrier

Conversation

@fishcrap
Copy link
Copy Markdown
Collaborator

Description

This PR uses gloo as dist.barrier backend across the codebase.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

N/A


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @fishcrap, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the distributed synchronization logic across the codebase. It transitions all instances of dist.barrier from using device-specific IDs to utilizing a common CPU group barrier, specifically self.train_engine.cpu_group or actor.cpu_group. This change aims to enhance the consistency and robustness of synchronization points within distributed operations, leveraging the gloo backend for more unified coordination.

Highlights

  • Standardized Distributed Synchronization: The pull request standardizes the distributed synchronization mechanism by consistently using gloo group barriers (dist.barrier(group=...)) instead of device-specific barriers (dist.barrier(device_ids=...)). This change is applied across various core components and example scripts.
  • Code Refactoring: This change is purely a code refactoring, ensuring no functional changes to the existing distributed training and evaluation workflows. It aims to improve consistency and robustness in synchronization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly refactors distributed synchronization by replacing deprecated dist.barrier(device_ids=...) calls with dist.barrier(group=...) using a gloo backend, which is a great improvement for robustness. The changes in the core library files also correctly adopt the pattern of calling current_platform.synchronize() before dist.barrier(). However, this improved pattern is not consistently applied across all the example files. I've added comments to address these inconsistencies for correctness and maintainability. Additionally, I've pointed out a few instances of redundant synchronization calls in the examples that can be removed.

Comment thread examples/alignment/hhrlhf_rw.py Outdated
Comment thread examples/alignment/hhrlhf_rw.py Outdated
Comment thread examples/camel/train.py Outdated
Comment thread examples/camel/train.py Outdated
Comment thread examples/camel/train.py Outdated
Comment thread examples/vlm/clevr_count_70k_grpo.py Outdated
Comment thread examples/vlm/clevr_count_70k_grpo.py Outdated
Comment thread examples/vlm/clevr_count_70k_grpo.py Outdated
Comment thread examples/vlm/clevr_count_70k_sft.py Outdated
Comment thread examples/vlm/clevr_count_70k_sft.py Outdated
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@fishcrap
Copy link
Copy Markdown
Collaborator Author

/gemini review

@fishcrap fishcrap added the safe-to-test Ready to run unit-tests in a PR. label Nov 21, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great refactoring effort to replace the deprecated dist.barrier(device_ids=...) with the recommended dist.barrier(group=...) using a gloo process group for CPU-level synchronization. This change improves correctness and portability, especially for multi-node environments. The changes are applied consistently across the codebase. I've made a couple of suggestions to use the platform-agnostic current_platform.synchronize() for better consistency and portability, instead of torch.cuda.synchronize().

# Prepare batch for training
batch = batch.to('cuda')
dist.barrier(device_ids=[actor.device.index])
torch.cuda.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the rest of the codebase and to promote platform-agnostic code, it would be better to use current_platform.synchronize() here instead of torch.cuda.synchronize(). The user of this guide would also need to import it: from areal.platforms import current_platform.

Suggested change
torch.cuda.synchronize()
current_platform.synchronize()

Comment thread examples/tir/train_tir.py
Comment on lines 242 to +250
torch.cuda.synchronize()
dist.barrier(group=actor.cpu_group)

# Upload statistics to the logger
stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)
stats_logger.commit(epoch, step, global_step, stats)

dist.barrier(device_ids=[actor.device.index])
torch.cuda.synchronize()
dist.barrier(group=actor.cpu_group)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and portability across different hardware platforms, it's better to use current_platform.synchronize() instead of torch.cuda.synchronize(). This aligns with the practice in other parts of the codebase. You'll need to add from areal.platforms import current_platform at the top of the file.

Suggested change
torch.cuda.synchronize()
dist.barrier(group=actor.cpu_group)
# Upload statistics to the logger
stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)
stats_logger.commit(epoch, step, global_step, stats)
dist.barrier(device_ids=[actor.device.index])
torch.cuda.synchronize()
dist.barrier(group=actor.cpu_group)
current_platform.synchronize()
dist.barrier(group=actor.cpu_group)
# Upload statistics to the logger
stats = stats_tracker.export_all(reduce_group=actor.data_parallel_group)
stats_logger.commit(epoch, step, global_step, stats)
current_platform.synchronize()
dist.barrier(group=actor.cpu_group)

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 0a89bb1 into main Nov 21, 2025
7 checks passed
@garrett4wade garrett4wade deleted the sxj/barrier branch November 21, 2025 09:36
Bruce-rl-hw pushed a commit to Bruce-rl-hw/AReaL-vllm that referenced this pull request Dec 4, 2025
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants