Skip to content

[Feature] Ray Support for Multinode vLLM Instances and Proxy Server #963

@hlyli

Description

@hlyli

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Background

This post aims to add 3 features to the RayScheduler.

  1. Refactor Ray and define placement strategies in the cli args. These will be used for feature 2. The refactor intends to explicitly define different placement strategies, such as a shared placement group for training ranks and separate placement groups for rollout instances. One functionality change is that for training, we will change to using one placement group of n_gpu bundles instead of 1 bundle of size n_gpu. This will prevent hanging due to scheduling issues.

  2. Currently, RayScheduler cannot support vLLM instances that span multiple nodes. Such feature is desirable for training very large (> 100B) models. To do so, the vLLM command must be launched from each node and they discover each other upon initialization Example.

  3. RayScheduler does not support the proxy server. We will add support for this.

Potential Solution

  1. This is a simple refactor and defines a cli_arg to explicitly define the placement strategy. This is normally defined as "shared" for training and "separate" for rollout meaning 1 shared PG for train and separate PGs for rollout instances. A "deferred" PG is also defined for multinode instances.
  2. Under a "deferred" PG, RayScheduler will launch a RayRPCServer requesting 0 resources. It will use Ray to launch a vLLMMultinodeLauncher for each node we wish to deploy vLLM on. Each vLLMMultinodeLauncher reserves the resources it needs through Ray and manages the vLLM instance on each node. One vLLMMultinodeLauncher will launch the vLLM head while others launch in headless. The figure attached below illustrates the design. In addition, to support cross-node data parallelism, we must create a separate pip package and hook into the VLLM EngineCore as an extension. The current implementation in areal_vllm_server.py cannot support multinode as other data parallel heads cannot be hooked into by this file.
Image 3. The simplest solution would likely to have a separate Ray actor that just launches the proxy server through POpen and communicates with it through HTTP.

Additional Information

We currently have working implementations for 1 and 2 completed and are working to create PRs.
These changes will require Ray 2.53.0, as overriding __ray_shutdown__ is needed to safely destroy vLLM instances.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions