Skip to content

feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos#9686

Merged
mudler merged 1 commit intomudler:masterfrom
richiejp:feat/sglang-conf-mtp
May 7, 2026
Merged

feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos#9686
mudler merged 1 commit intomudler:masterfrom
richiejp:feat/sglang-conf-mtp

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

@richiejp richiejp commented May 6, 2026

This allows all of SGLang's options to be used and adds some MTP models to the gallery.

I only successfully tested MiMo-7B. We need a new Transformers library to be released for Gemma 4 with MTB to work.

Also added CUDA13 support in the process.


Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:

  • backend/python/sglang/backend.py: add _apply_engine_args, import
    dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
    sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
  • backend/python/sglang/test.py + test.sh + Makefile: six unit tests
    exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 path is now first-class):

  • backend/python/sglang/install.sh: add --prerelease=allow because
    sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
    add --index-strategy=unsafe-best-match for cublas12 so the cu128
    torch index wins over default-PyPI's cu130.
  • backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
    sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
    (new files) and cu128 torch index for cublas12 (default PyPI now
    ships cu130 torch wheels by default and breaks cu12 hosts).
  • backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
    capability mappings + image entries pointing at
    quay.io/.../-gpu-nvidia-cuda-13-sglang.
  • .github/workflows/backend.yml: new cublas13 sglang matrix entry,
    mirroring vllm's cuda13 build.

Model gallery + docs:

  • gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
  • gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
    transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
  • gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
    • online fp8 weight quantization, verified end-to-end on a 16 GB
      RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
      MTP draft worker's vocab embedding is loaded unquantised and OOMs
      the static reservation at sglang's 0.85 default.
  • gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
    gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
  • docs/content/features/text-generation.md: new SGLang section with
    setup, engine_args reference, MTP demos, version requirements.
  • .agents/sglang-backend.md (new): agent one-pager covering the flat
    ServerArgs structure, the typed-vs-engine_args precedence, the
    speculative-decoding cheatsheet, and the mem_fraction_static gotcha
    documented above.
  • AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe io@richiejp.com

mudler
mudler previously approved these changes May 6, 2026
@richiejp richiejp force-pushed the feat/sglang-conf-mtp branch 3 times, most recently from d22d1ef to 1ec0599 Compare May 7, 2026 06:21
@richiejp
Copy link
Copy Markdown
Collaborator Author

richiejp commented May 7, 2026

OK, seems like it's good to go

Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the feat/sglang-conf-mtp branch from 1ec0599 to 4d5d43a Compare May 7, 2026 08:17
@mudler mudler merged commit c894d9c into mudler:master May 7, 2026
53 checks passed
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-to-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants