[Feature] Ray Support for Multinode vLLM Instances and Proxy Server

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in
  `areal/api/`. If not, please raise a refactor issue first.

## Background

This post aims to add 3 features to the RayScheduler.

1. Refactor Ray and define placement strategies in the cli args. These will be used for feature 2. The refactor intends to explicitly define different placement strategies, such as a shared placement group for training ranks and separate placement groups for rollout instances. One functionality change is that for training, we will change to using one placement group of `n_gpu` bundles instead of 1 bundle of size `n_gpu`. This will prevent hanging due to scheduling issues.

2. Currently, RayScheduler cannot support vLLM instances that span multiple nodes. Such feature is desirable for training very large (> 100B) models. To do so, the vLLM command must be launched from each node and they discover each other upon initialization [Example](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/#example-2-node-deployment). 

3. RayScheduler does not support the proxy server. We will add support for this.

## Potential Solution

1. This is a simple refactor and defines a cli_arg to explicitly define the placement strategy. This is normally defined as "shared" for training and "separate" for rollout meaning 1 shared PG for train and separate PGs for rollout instances. A "deferred" PG is also defined for multinode instances.
2. Under a "deferred" PG, RayScheduler will launch a RayRPCServer requesting 0 resources. It will use Ray to launch a `vLLMMultinodeLauncher` for each node we wish to deploy vLLM on. Each `vLLMMultinodeLauncher` reserves the resources it needs through Ray and manages the vLLM instance on each node. One  `vLLMMultinodeLauncher` will launch the vLLM head while others launch in headless. The figure attached below illustrates the design. In addition, to support cross-node data parallelism, we must create a separate pip package and hook into the VLLM EngineCore as an extension. The current implementation in `areal_vllm_server.py` cannot support multinode as other data parallel heads cannot be hooked into by this file.

<img width="2283" height="1740" alt="Image" src="https://github.com/user-attachments/assets/3ebe901d-5915-4a7a-b67c-a59d4fea8d3e" />
3. The simplest solution would likely to have a separate Ray actor that just launches the proxy server through POpen and communicates with it through HTTP. 




## Additional Information
We currently have working implementations for 1 and 2 completed and are working to create PRs.
These changes will require Ray 2.53.0, as overriding `__ray_shutdown__` is needed to safely destroy vLLM instances.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Ray Support for Multinode vLLM Instances and Proxy Server #963

Checklist

Background

Potential Solution

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Ray Support for Multinode vLLM Instances and Proxy Server #963

Description

Checklist

Background

Potential Solution

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions