-
Notifications
You must be signed in to change notification settings - Fork 361
Description
Checklist
- This feature will maintain backward compatibility with the current APIs in
areal/api/. If not, please raise a refactor issue first.
Background
This post aims to add 3 features to the RayScheduler.
-
Refactor Ray and define placement strategies in the cli args. These will be used for feature 2. The refactor intends to explicitly define different placement strategies, such as a shared placement group for training ranks and separate placement groups for rollout instances. One functionality change is that for training, we will change to using one placement group of
n_gpubundles instead of 1 bundle of sizen_gpu. This will prevent hanging due to scheduling issues. -
Currently, RayScheduler cannot support vLLM instances that span multiple nodes. Such feature is desirable for training very large (> 100B) models. To do so, the vLLM command must be launched from each node and they discover each other upon initialization Example.
-
RayScheduler does not support the proxy server. We will add support for this.
Potential Solution
- This is a simple refactor and defines a cli_arg to explicitly define the placement strategy. This is normally defined as "shared" for training and "separate" for rollout meaning 1 shared PG for train and separate PGs for rollout instances. A "deferred" PG is also defined for multinode instances.
- Under a "deferred" PG, RayScheduler will launch a RayRPCServer requesting 0 resources. It will use Ray to launch a
vLLMMultinodeLauncherfor each node we wish to deploy vLLM on. EachvLLMMultinodeLauncherreserves the resources it needs through Ray and manages the vLLM instance on each node. OnevLLMMultinodeLauncherwill launch the vLLM head while others launch in headless. The figure attached below illustrates the design. In addition, to support cross-node data parallelism, we must create a separate pip package and hook into the VLLM EngineCore as an extension. The current implementation inareal_vllm_server.pycannot support multinode as other data parallel heads cannot be hooked into by this file.
3. The simplest solution would likely to have a separate Ray actor that just launches the proxy server through POpen and communicates with it through HTTP.
Additional Information
We currently have working implementations for 1 and 2 completed and are working to create PRs.
These changes will require Ray 2.53.0, as overriding __ray_shutdown__ is needed to safely destroy vLLM instances.