[Feature] Add patch to accelerate SGLang weight loading#324
Merged
garrett4wade merged 35 commits intomainfrom Oct 13, 2025
Merged
[Feature] Add patch to accelerate SGLang weight loading#324garrett4wade merged 35 commits intomainfrom
garrett4wade merged 35 commits intomainfrom
Conversation
…zy/antcode/optimize-sglang-load
…zy/antcode/optimize-sglang-load
Collaborator
|
/gemini review |
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces an effective optimization for SGLang weight loading by applying a custom patch. The changes are well-structured, adding new configuration options in cli_args.py and the patching logic in launcher.py. The performance improvement from ~60s to ~30s for weight updates is significant.
My review focuses on the integration of the patch. I've identified a critical issue in the patching logic for editable installations that could cause it to fail, along with a couple of medium-severity suggestions to improve robustness and logging. Once these points are addressed, this PR will be a great addition to accelerate model loading.
garrett4wade
approved these changes
Oct 13, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds options to apply patch to SGLang and accelerate its weight loading.
Option 1
enable_multithread_load: This is a native SGLang weight-loading acceleration which has not been applied when updating weights from disk in the original SGLang code. The patch in this PR fixes this issue. This option is available for all models.Option 2
enable_fast_load: This is an option to enable an optimized, customized weight loading implementation in SGLang introduced by the patch in this PR. It is faster thanenable_multithread_load, but is only available for Qwen3 and Qwen3MoE models.Why we need this PR?
Disk weight loading is simpler and more flexible than NCCL weight loading. It has great advantages in supporting complex scenarios in the future, such as RL with elastic inference servers or heterogeneous hardware.
Example Usage
Add options in yaml or command line:
sglang.enable_multithread_load=trueorsglang.enable_fast_load=true.Performance and Correctness
On Qwen3-30B-A3B, allocation mode
sglang:d4p1t4+megatron:(attn:d1p4t2c2|ffn:d1p4t2e2), this PR accelerates weight updating from ~60s to ~30s while maintaining correctness.Performance in other conditions is pending to be tested.
Update
The patch is upgraded and tested on SGLang v0.5.2, and the performance matches previous results on v0.4.9.post2.