Ray placement group refactor and preliminary architecture for multinode inference instances#966
Conversation
…code to support multinode inference instances.
|
/gemini review |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the handling of Ray placement groups within the scheduling system, addressing phase 1 of issue #963. It introduces a flexible, strategy-based architecture that allows for different approaches—shared, separate, and deferred—to manage how resources are allocated to workers. This foundational change is crucial for enabling more advanced deployment patterns, particularly for multinode inference instances, by centralizing and abstracting the complex logic of resource bundling and placement. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a well-structured refactoring of Ray placement group handling by using different placement strategies. This is a great improvement for modularity and future extensions like multi-node inference.
I've found a few issues that need attention:
- There are a couple of critical bugs: one in resource specification that would prevent actors from being scheduled with the correct memory, and a syntax error in type hints.
- There are also a couple of medium-severity suggestions to improve maintainability and restore lost configurability for timeouts.
Overall, the direction is excellent, and with these fixes, the implementation will be solid.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the Ray placement group logic. By abstracting placement strategies into separate classes, the code becomes more modular, readable, and extensible, which is an excellent foundation for supporting multi-node inference. The logic has been cleanly moved from RayScheduler to the new areal/infra/utils/ray_placement_group.py file. My review includes a few suggestions to address a bug in resource specification, a hardcoded timeout, and a minor type hint correction.
| try: | ||
| actor.__ray_terminate__.remote() | ||
| except Exception: | ||
| logger.warning( | ||
| f"Could not destroy remote actor {actor}, force killing actor" | ||
| ) | ||
| ray.kill(actor, no_restart=True) |
There was a problem hiding this comment.
While actor.destroy.remote() is the standard way to clean up actors, it can sometimes hang. The addition of a fallback to actor.__ray_terminate__.remote() before resorting to ray.kill() is a good improvement for robustness, as __ray_terminate__ provides a more graceful termination.
| try: | |
| actor.__ray_terminate__.remote() | |
| except Exception: | |
| logger.warning( | |
| f"Could not destroy remote actor {actor}, force killing actor" | |
| ) | |
| ray.kill(actor, no_restart=True) | |
| except Exception: | |
| try: | |
| # Attempt a more graceful termination before force killing. | |
| actor.__ray_terminate__.remote() | |
| except Exception: | |
| logger.warning( | |
| f"Could not destroy remote actor {actor}, force killing actor" | |
| ) | |
| ray.kill(actor, no_restart=True) |
There was a problem hiding this comment.
This is the same code?
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the Ray scheduler's placement group handling. By introducing a strategy pattern for placement groups (Shared, Separate, Deferred), the code becomes more modular and extensible, paving the way for future features like multi-node inference. The logic has been cleanly extracted into a new ray_placement_group.py utility file. My review focuses on a couple of minor improvements to enhance user feedback and code clarity. Specifically, I've suggested adding a warning when a user's configuration is silently changed and replacing a magic number with a named constant for better readability.
garrett4wade
left a comment
There was a problem hiding this comment.
⚠️ Additional Issue: TestUtilityFunctions Will Break (CRITICAL, Confidence 95)
tests/test_ray_scheduler.py:175 calls scheduler._create_bundle_list_gpu(1, 24, 1024), but this method was removed from RayScheduler in this PR (along with _bundle_spec, _actor_resource_spec, and _sum_resource_spec). This will raise AttributeError at runtime.
While the test file is skipped by default via pytestmark, it's still broken code.
Fix: Update TestUtilityFunctions to call the equivalent from the new module:
from areal.infra.utils.ray_placement_group import _create_bundle_specs_split
bundle_list = _create_bundle_specs_split(16, 1, 24, 1024)| return self._placement_groups | ||
|
|
||
| def actor_resources( | ||
| self, spec: SchedulingSpec, gpu_multiplier=1 |
There was a problem hiding this comment.
🔴 CRITICAL: GPU Multiplier Removed — Will Break Forked Workers (Confidence 90)
The old _actor_resource_spec() always applied a 0.9× GPU multiplier to leave headroom for forked workers (ref/proxy actors colocated via fork_workers()). This new SeparatedRayPlacementStrategy requests full 1.0 GPU (gpu_multiplier=1), consuming the entire bundle budget.
When fork_workers() later creates ref/proxy workers on the same placement group (each requesting 0.01 GPU via _create_forked_workers_internal), the total GPU demand becomes 1.0 + 0.01 = 1.01, which exceeds the 1.0 GPU in the bundle. Ray will fail to schedule the forked actor.
Additionally, the raise RuntimeError on line 203 blocks any caller from passing a multiplier at all.
Old behavior (all workers got 0.9×):
if device == "GPU":
res["num_gpus"] = float(gpu) * 0.9 # Leave room for forked workersSuggested fix: Either apply MAIN_WORKER_GPU_FRAC_FOR_COLOCATION here too, or remove the raise and let callers choose:
def actor_resources(self, spec, gpu_multiplier=MAIN_WORKER_GPU_FRAC_FOR_COLOCATION):
# Remove the RuntimeError guard
options = _actor_resource_spec(spec.cpu, spec.gpu * gpu_multiplier, spec.mem)There was a problem hiding this comment.
For this one, I will leave it as gpu_multiplier=1 and let the caller decide. I loosened the RuntimeError such that it will only raise if gpu_multiplier != 1 and spec.gpu > 1 as Ray will throw an error regardless if it tries to schedule a worker with a fractional GPU > 1.
Description
This PR is to address phase 1 of #963. We refactor the RayScheduler to have various distinct placement strategies. These will be a precursor to inference instances that span multiple nodes.
We extract many of the placement group functions from
areal/infra/scheduler/ray.pyand move them intoareal/infra/utils/ray_placement_group.py, where much of the logic of defining placement groups and bundles now resides.Related Issue
#963
Type of Change
work as expected)
Checklist
jb build docs/gemini review)Breaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!