torchcomms is a new experimental communications API for PyTorch. This provides both the high level collectives API as well as several out of the box backends.
torchcomms requires the following software and hardware:
- Python 3.10 or higher
- PyTorch 2.8 or higher
- CUDA-capable GPU (for NCCL/NCCLX or RCCL backends)
- Intel XPU (for XCCL backend)
torchcomms is available on PyPI and can be installed using pip. Alternatively, you can build torchcomms from source.
You can install torchcomms and PyTorch (2.11+) using pip:
# Cuda 12.6
pip install torch torchcomms --index-url https://download.pytorch.org/whl/cu126
# Cuda 12.8
pip install torch torchcomms --index-url https://download.pytorch.org/whl/cu128
# Cuda 13.0
pip install torch torchcomms --index-url https://download.pytorch.org/whl/cu130You can install torchcomms and PyTorch nightly builds using pip:
# Cuda 12.6
pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu126
# Cuda 12.8
pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu128
# Cuda 13.0
pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu130- CMake 3.22 or higher
- Ninja 1.10 or higher
Alternatively, you can build torchcomms from source. If you want to build the NCCLX backend, we recommend building it under a virtual conda environment. Run the following commands to build and install torchcomms:
# Create a conda environment
conda create -n torchcomms python=3.10
conda activate torchcomms
# Clone the repository
git clone git@github.com:meta-pytorch/torchcomms.git
cd torchcommsNo build needed - uses the library provided by PyTorch
If you want to install the third-party dependencies directly from conda, run the following command:
USE_SYSTEM_LIBS=1 ./build_ncclx.shIf you want to build and install the third-party dependencies from source, run the following command:
./build_ncclx.shInstall some prerequisites
conda install conda-forge::glog=0.4.0 conda-forge::gflags conda-forge::fmt -y
Environment variables to find rocm/rccl headers
export ROCM_HOME=/opt/rocm
export RCCL_INCLUDE=$ROCM_HOME/include/rccl
./build_rccl.shInstall some prerequisites
conda install conda-forge::glog=0.4.0 conda-forge::gflags conda-forge::fmt -y
Environment variables to find rocm/rcclx headers
export BUILD_DIR=${PWD}/comms/rcclx/develop/build/release/build
export ROCM_HOME=/opt/rocm
export RCCLX_INCLUDE=${BUILD_DIR}/include/rccl
export RCCLX_LIB=${BUILD_DIR}/lib
./build_rcclx.shTIP: Default builds both gfx942 and gfx950 and can take 1hr+. Narrow to your GPU:
MI300X/MI325X (gfx942):
./build_rcclx.sh --amdgpu_targets gfx942MI350X/MI355X (gfx950):
./build_rcclx.sh --amdgpu_targets gfx950Detect your arch if unsure:
rocminfo | grep -m1 gfxSource Intel oneAPI environment (update path to your oneAPI installation)
export INTEL_ONEAPI=/path/to/intel/oneapi # e.g., /opt/intel/oneapi or ~/intel/oneapi
source $INTEL_ONEAPI/compiler/latest/env/vars.sh
source $INTEL_ONEAPI/ccl/latest/env/vars.shEnable XCCL backend and install
export USE_XCCL=ON
export USE_NCCL=OFF
export USE_NCCLX=OFF
export USE_TRANSPORT=OFF
pip install --no-build-isolation -v .Set backend env vars before installing. For RCCLX-only:
export USE_NCCL=OFF
export USE_NCCLX=OFF
export USE_GLOO=OFF
export USE_RCCL=OFF
export USE_RCCLX=ON(See Build Configuration below for defaults and other mixes.)
# Install PyTorch (if not already installed)
pip install -r requirements.txt
pip install --no-build-isolation -v .
# Note: For installing torchcomms with RCCL or RCCLX backend, make sure to turn off the other backends:
USE_NCCL=OFF USE_NCCLX=OFF USE_GLOO=OFF USE_RCCL=OFF USE_RCCLX=ON USE_TRANSPORT=OFF pip install --no-build-isolation -v .You can customize the build by setting environment variables before running pip install:
# Enable/disable specific backends (ON/OFF or 1/0)
export USE_NCCL=ON # Default: ON
export USE_NCCLX=ON # Default: ON
export USE_GLOO=ON # Default: ON
export USE_RCCL=OFF # Default: OFF
export USE_RCCLX=OFF # Default: OFF
export USE_XCCL=OFF # Default: OFFThen run:
# Install PyTorch (if not already installed)
pip install -r requirements.txt
pip install --no-build-isolation -v .Here's a simple example demonstrating synchronous AllReduce communication across multiple GPUs:
#!/usr/bin/env python3
# example.py
import torch
from torchcomms import new_comm, ReduceOp
def main():
# Initialize TorchComm with device-specific backend
device = torch.device("<device>") # cuda, xpu, etc
torchcomm = new_comm("<backend>", device, name="main_comm") # nccl, ncclx, rccl, xccl, etc
# Get rank and world size
rank = torchcomm.get_rank()
world_size = torchcomm.get_size()
# Calculate device ID
num_devices = torch.<device>.device_count()
device_id = rank % num_devices
target_device = torch.device(f"<device>:{device_id}")
print(f"Rank {rank}/{world_size}: Running on device {device_id}")
# Create a tensor with rank-specific data
tensor = torch.full(
(1024,),
float(rank + 1),
dtype=torch.float32,
device=target_device
)
print(f"Rank {rank}: Before AllReduce: {tensor[0].item()}")
# Perform synchronous AllReduce (sum across all ranks)
torchcomm.all_reduce(tensor, ReduceOp.SUM, async_op=False)
# Synchronize device stream
torch.<device>.current_stream().synchronize()
print(f"Rank {rank}: After AllReduce: {tensor[0].item()}")
# Cleanup
torchcomm.finalize()
if __name__ == "__main__":
main()To run this example with multiple processes (one per GPU):
# Using torchrun (recommended)
torchrun --nproc_per_node=2 example.py
# Or using python -m torch.distributed.launch
python -m torch.distributed.launch --nproc_per_node=2 example.pyTo run this example with multiple nodes:
- Node 0
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 --rdzv-endpoint="<master-node>:<master-port>" example.py- Node 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 --rdzv-endpoint="<master-node>:<master-port>" example.pyIn the example above, we perform the following steps:
new_comm()creates a communicator with the specified backend- Each process gets its unique rank and total world size
- Each rank creates a tensor with rank-specific values
- All tensors are summed across all ranks
- Clean up communication resources
torchcomms also supports asynchronous operations for better performance.
Here is the same example as above, but with asynchronous AllReduce:
import torch
from torchcomms import new_comm, ReduceOp
# Use the correct device and backend for TorchComms initialization
device = torch.device("<device>") # cuda, xpu, etc
torchcomm = new_comm("<backend>", device, name="main_comm") # nccl, ncclx, rccl, xccl, etc
rank = torchcomm.get_rank()
device_id = rank % torch.<device>.device_count()
target_device = torch.device(f"<device>:{device_id}")
# Create tensor
tensor = torch.full((1024,), float(rank + 1), dtype=torch.float32, device=target_device)
# Start async AllReduce
work = torchcomm.all_reduce(tensor, ReduceOp.SUM, async_op=True)
# Do other work while communication happens
print(f"Rank {rank}: Doing other work while AllReduce is in progress...")
# Wait for completion
work.wait()
print(f"Rank {rank}: AllReduce completed")
torchcomm.finalize()See the CONTRIBUTING file for how to help out.
Source code is made available under a BSD 3 license, however you may have other legal obligations that govern your use of other content linked in this repository, such as the license or terms of service for third-party data and models.
torchcomms backends include third-party source code may be using other licenses. Please check the directory and relevant files to verify the license.
For convenience some of them are listed below: