Ray placement group refactor and preliminary architecture for multinode inference instances by hlyli · Pull Request #966 · inclusionAI/AReaL

hlyli · 2026-03-02T23:49:25Z

Description

This PR is to address phase 1 of #963. We refactor the RayScheduler to have various distinct placement strategies. These will be a precursor to inference instances that span multiple nodes.

Shared placements are similar to current training deployments where multiple training ranks share 1 placement group, each taking 1 bundle of the PG.
Separate placements are for rollouts and will have 1 placement group per instance.
Deferred placements are currently unused and will be for multinode instances.

We extract many of the placement group functions from areal/infra/scheduler/ray.py and move them into areal/infra/utils/ray_placement_group.py, where much of the logic of defining placement groups and bundles now resides.

Related Issue

#963

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality (I only updated existing tests)
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

…code to support multinode inference instances.

hlyli · 2026-03-02T23:49:40Z

/gemini review

gemini-code-assist · 2026-03-02T23:49:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the handling of Ray placement groups within the scheduling system, addressing phase 1 of issue #963. It introduces a flexible, strategy-based architecture that allows for different approaches—shared, separate, and deferred—to manage how resources are allocated to workers. This foundational change is crucial for enabling more advanced deployment patterns, particularly for multinode inference instances, by centralizing and abstracting the complex logic of resource bundling and placement.

Highlights

Ray Placement Strategy Refactoring: The core logic for managing Ray placement groups has been refactored into a new module, areal/infra/utils/ray_placement_group.py, introducing an abstract base class RayPlacementStrategy and concrete implementations for 'shared', 'separate', and 'deferred' strategies.
New ray_placement_strategy Field: A new ray_placement_strategy field has been added to SchedulingSpec, allowing users to specify how Ray resources should be allocated for different roles (e.g., training, rollout).
Rollout Controller Adaptation: The RolloutController has been updated to automatically enforce the 'separate' placement strategy for rollout jobs, ensuring each rollout instance receives its own dedicated placement group.
Enhanced Actor Cleanup: The RayScheduler's actor cleanup mechanism has been improved to include a more robust termination attempt using __ray_terminate__.remote() before resorting to ray.kill().
Support for Multinode Inference Architecture: The introduction of 'deferred' placement strategy and the overall refactoring lay the groundwork for future multinode inference instances by providing a flexible and extensible architecture for resource management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/api/cli_args.py
- Added ray_placement_strategy field to SchedulingSpec with choices 'shared', 'separate', and 'deferred'.
areal/infra/controller/rollout_controller.py
- Modified initialize function to force ray_placement_strategy to 'separate' for rollout jobs if 'shared' is initially set.
areal/infra/scheduler/ray.py
- Removed several internal helper methods related to bundle and actor resource specification, and placement group creation.
- Imported new placement strategy classes (DeferredDeviceRayPlacementStrategy, RayPlacementStrategy, SeparatedRayPlacementStrategy, SharedRayPlacementStrategy) and ray_resource_type from areal.infra.utils.ray_placement_group.
- Introduced _get_placement_strategy method to dynamically select the appropriate RayPlacementStrategy.
- Updated _create_ray_workers to utilize the new RayPlacementStrategy for creating placement groups and actor resources.
- Added a check in _create_forked_workers_internal to prevent colocation of multi-GPU workers.
- Improved actor termination logic in _cleanup_workers by attempting __ray_terminate__.remote() before ray.kill().
areal/infra/utils/ray.py
- Added create_resource_spec utility function to standardize resource dictionary creation.
areal/infra/utils/ray_placement_group.py
- Added a new file defining an abstract base class RayPlacementStrategy and its concrete implementations: SharedRayPlacementStrategy, SeparatedRayPlacementStrategy, and DeferredDeviceRayPlacementStrategy.
- Included helper functions ray_resource_type, _create_bundle_specs_split, _bundle_spec, _actor_resource_spec, and _create_placement_group.
tests/test_ray_scheduler.py
- Updated test_create_delete_workers to explicitly set ray_placement_strategy="shared" for SchedulingSpec.
- Removed assertions that verified separate placement groups for shared strategy, aligning with the new shared PG behavior.
- Modified test_fork_creates_workers_on_same_placement_group to assert that forked workers share the same bundle index as their parent.

Activity

No specific activity (comments, reviews, progress updates) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a well-structured refactoring of Ray placement group handling by using different placement strategies. This is a great improvement for modularity and future extensions like multi-node inference.

I've found a few issues that need attention:

There are a couple of critical bugs: one in resource specification that would prevent actors from being scheduled with the correct memory, and a syntax error in type hints.
There are also a couple of medium-severity suggestions to improve maintainability and restore lost configurability for timeouts.

Overall, the direction is excellent, and with these fixes, the implementation will be solid.

areal/infra/utils/ray.py

areal/infra/utils/ray_placement_group.py

areal/infra/scheduler/ray.py

areal/infra/utils/ray_placement_group.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the Ray placement group logic. By abstracting placement strategies into separate classes, the code becomes more modular, readable, and extensible, which is an excellent foundation for supporting multi-node inference. The logic has been cleanly moved from RayScheduler to the new areal/infra/utils/ray_placement_group.py file. My review includes a few suggestions to address a bug in resource specification, a hardcoded timeout, and a minor type hint correction.

areal/infra/utils/ray.py

gemini-code-assist · 2026-03-02T23:52:58Z

areal/infra/scheduler/ray.py

+                try:
+                    actor.__ray_terminate__.remote()
+                except Exception:
+                    logger.warning(
+                        f"Could not destroy remote actor {actor}, force killing actor"
+                    )
+                    ray.kill(actor, no_restart=True)


While actor.destroy.remote() is the standard way to clean up actors, it can sometimes hang. The addition of a fallback to actor.__ray_terminate__.remote() before resorting to ray.kill() is a good improvement for robustness, as __ray_terminate__ provides a more graceful termination.

Suggested change

try:

actor.__ray_terminate__.remote()

except Exception:

logger.warning(

f"Could not destroy remote actor {actor}, force killing actor"

)

ray.kill(actor, no_restart=True)

except Exception:

try:

# Attempt a more graceful termination before force killing.

actor.__ray_terminate__.remote()

except Exception:

logger.warning(

f"Could not destroy remote actor {actor}, force killing actor"

)

ray.kill(actor, no_restart=True)

This is the same code?

areal/infra/utils/ray_placement_group.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

hlyli · 2026-03-03T00:06:50Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring of the Ray scheduler's placement group handling. By introducing a strategy pattern for placement groups (Shared, Separate, Deferred), the code becomes more modular and extensible, paving the way for future features like multi-node inference. The logic has been cleanly extracted into a new ray_placement_group.py utility file. My review focuses on a couple of minor improvements to enhance user feedback and code clarity. Specifically, I've suggested adding a warning when a user's configuration is silently changed and replacing a magic number with a named constant for better readability.

areal/infra/controller/rollout_controller.py

areal/infra/utils/ray_placement_group.py

garrett4wade

⚠️ Additional Issue: `TestUtilityFunctions` Will Break (CRITICAL, Confidence 95)

tests/test_ray_scheduler.py:175 calls scheduler._create_bundle_list_gpu(1, 24, 1024), but this method was removed from RayScheduler in this PR (along with _bundle_spec, _actor_resource_spec, and _sum_resource_spec). This will raise AttributeError at runtime.

While the test file is skipped by default via pytestmark, it's still broken code.

Fix: Update TestUtilityFunctions to call the equivalent from the new module:

from areal.infra.utils.ray_placement_group import _create_bundle_specs_split
bundle_list = _create_bundle_specs_split(16, 1, 24, 1024)

garrett4wade · 2026-03-03T06:27:05Z

areal/infra/utils/ray_placement_group.py

+        return self._placement_groups
+
+    def actor_resources(
+        self, spec: SchedulingSpec, gpu_multiplier=1


🔴 CRITICAL: GPU Multiplier Removed — Will Break Forked Workers (Confidence 90)

The old _actor_resource_spec() always applied a 0.9× GPU multiplier to leave headroom for forked workers (ref/proxy actors colocated via fork_workers()). This new SeparatedRayPlacementStrategy requests full 1.0 GPU (gpu_multiplier=1), consuming the entire bundle budget.

When fork_workers() later creates ref/proxy workers on the same placement group (each requesting 0.01 GPU via _create_forked_workers_internal), the total GPU demand becomes 1.0 + 0.01 = 1.01, which exceeds the 1.0 GPU in the bundle. Ray will fail to schedule the forked actor.

Additionally, the raise RuntimeError on line 203 blocks any caller from passing a multiplier at all.

Old behavior (all workers got 0.9×):

if device == "GPU": res["num_gpus"] = float(gpu) * 0.9 # Leave room for forked workers

Suggested fix: Either apply MAIN_WORKER_GPU_FRAC_FOR_COLOCATION here too, or remove the raise and let callers choose:

def actor_resources(self, spec, gpu_multiplier=MAIN_WORKER_GPU_FRAC_FOR_COLOCATION): # Remove the RuntimeError guard options = _actor_resource_spec(spec.cpu, spec.gpu * gpu_multiplier, spec.mem)

For this one, I will leave it as gpu_multiplier=1 and let the caller decide. I loosened the RuntimeError such that it will only raise if gpu_multiplier != 1 and spec.gpu > 1 as Ray will throw an error regardless if it tries to schedule a worker with a fractional GPU > 1.

areal/infra/utils/ray_placement_group.py

areal/infra/controller/rollout_controller.py

areal/api/cli_args.py

garrett4wade

LGTM!

…lusionAI#966)

hlyli added 2 commits March 2, 2026 13:43

RayScheduler refactor to ensure packing of train ranks and precursor …

67974e8

…code to support multinode inference instances.

Update ray tests

cf654cc

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

hlyli and others added 2 commits March 2, 2026 15:52

Update areal/infra/utils/ray_placement_group.py

1d3f7ee

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/infra/utils/ray_placement_group.py

9749edb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

hlyli and others added 3 commits March 2, 2026 15:54

Update areal/infra/utils/ray.py

320bfe9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/infra/scheduler/ray.py

38b290a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Pass timeout to placement group

9364684

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

areal/infra/controller/rollout_controller.py Show resolved Hide resolved

areal/infra/utils/ray_placement_group.py Outdated Show resolved Hide resolved

Address gemini suggestions

87efccd

hlyli marked this pull request as ready for review March 3, 2026 00:14

garrett4wade reviewed Mar 3, 2026

View reviewed changes

hlyli added 2 commits March 3, 2026 09:32

Address review comments

93b9bac

Merge branch 'main' into ray-multinode-vllm

0e700b7

garrett4wade approved these changes Mar 4, 2026

View reviewed changes

garrett4wade merged commit b79b3ac into inclusionAI:main Mar 4, 2026
5 checks passed

rucnyz mentioned this pull request Mar 4, 2026

Literal type annotation in SchedulingSpec breaks omegaconf config loading #975

Closed

3 tasks

rucnyz added a commit to rucnyz/AReaL that referenced this pull request Mar 4, 2026

Fix omegaconf Literal incompatibility in SchedulingSpec (upstream inc…

b93b485

…lusionAI#966)

garrett4wade mentioned this pull request Mar 4, 2026

fix(api): replace Literal type with str for SchedulingSpec.ray_placement_strategy #976

Merged

3 tasks

Conversation

hlyli commented Mar 2, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

hlyli commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

hlyli Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlyli commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

⚠️ Additional Issue: TestUtilityFunctions Will Break (CRITICAL, Confidence 95)

Uh oh!

garrett4wade Mar 3, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: GPU Multiplier Removed — Will Break Forked Workers (Confidence 90)

Uh oh!

hlyli Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚠️ Additional Issue: `TestUtilityFunctions` Will Break (CRITICAL, Confidence 95)