mudler · mudler · May 7, 2026 · May 6, 2026
diff --git a/.agents/sglang-backend.md b/.agents/sglang-backend.md
@@ -0,0 +1,62 @@
+# Working on the SGLang Backend
+
+The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
+
+## `engine_args` is the universal escape hatch
+
+A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
+
+Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
+
+**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
+
+**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
+
+The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
+
+## Speculative decoding cheatsheet
+
+`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
+
+| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
+|-----------|--------------------|---------------------|----------------------|
+| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
+| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
+| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
+| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
+| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
+
+The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
+
+Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
+
+Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
+
+### `mem_fraction_static` + quantization + MTP on consumer GPUs
+
+When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
+
+Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
+
+This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
+
+## Tool-call and reasoning parsers stay on `Options[]`
+
+ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
+
+So the user-facing knob stays on `Options[]`:
+
+```yaml
+options:
+  - tool_parser:hermes
+  - reasoning_parser:deepseek_r1
+```
+
+Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
+
+## What's missing today (out of scope, but worth tracking)
+
+- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
+- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
+
+These should be a follow-up PR, not a blocker for the engine_args feature.
diff --git a/.github/workflows/backend.yml b/.github/workflows/backend.yml
@@ -959,6 +959,19 @@ jobs:
             dockerfile: "./backend/Dockerfile.python"
             context: "./"
             ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-sglang'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sglang"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
           - build-type: 'cublas'
             cuda-major-version: "13"
             cuda-minor-version: "0"

diff --git a/AGENTS.md b/AGENTS.md
@@ -24,6 +24,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
+| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |

diff --git a/backend/index.yaml b/backend/index.yaml
@@ -161,7 +161,7 @@
  capabilities:
    nvidia: "cuda12-rfdetr"
    intel: "intel-rfdetr"
    #amd: "rocm-rfdetr"
    nvidia-l4t: "nvidia-l4t-arm64-rfdetr"
    metal: "metal-rfdetr"
    default: "cpu-rfdetr"
@@ -287,6 +287,7 @@
     amd: "rocm-sglang"
     intel: "intel-sglang"
     nvidia-cuda-12: "cuda12-sglang"
+    nvidia-cuda-13: "cuda13-sglang"
     nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
     cpu: "cpu-sglang"
 - &vllm-omni
@@ -1965,13 +1966,19 @@
     amd: "rocm-sglang-development"
     intel: "intel-sglang-development"
     nvidia-cuda-12: "cuda12-sglang-development"
+    nvidia-cuda-13: "cuda13-sglang-development"
     nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
     cpu: "cpu-sglang-development"
 - !!merge <<: *sglang
   name: "cuda12-sglang"
   uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
   mirrors:
     - localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-sglang"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sglang"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-sglang
 - !!merge <<: *sglang
   name: "cuda13-nvidia-l4t-arm64-sglang"
   uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
@@ -1997,6 +2004,11 @@
   uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
   mirrors:
     - localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
+- !!merge <<: *sglang
+  name: "cuda13-sglang-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sglang"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-sglang
 - !!merge <<: *sglang
   name: "cuda13-nvidia-l4t-arm64-sglang-development"
   uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
@@ -2072,7 +2084,7 @@
  capabilities:
    nvidia: "cuda12-rfdetr-development"
    intel: "intel-rfdetr-development"
    #amd: "rocm-rfdetr-development"
    nvidia-l4t: "nvidia-l4t-arm64-rfdetr-development"
    metal: "metal-rfdetr-development"
    default: "cpu-rfdetr-development"

diff --git a/backend/python/sglang/Makefile b/backend/python/sglang/Makefile
@@ -8,6 +8,12 @@ run: sglang
 	bash run.sh
 	@echo "sglang run."
 
+.PHONY: test
+test: sglang
+	@echo "Testing sglang..."
+	bash test.sh
+	@echo "sglang tested."
+
 .PHONY: protogen-clean
 protogen-clean:
 	$(RM) backend_pb2_grpc.py backend_pb2.py

diff --git a/backend/python/sglang/backend.py b/backend/python/sglang/backend.py
@@ -9,10 +9,18 @@
 ReasoningParser so tool_calls and reasoning_content are emitted
 incrementally inside ChatDelta, which is a capability sglang exposes
 natively and vLLM does not.
+
+Like the vLLM backend, this one accepts an arbitrary ``engine_args:``
+map in the model YAML; keys are validated against ``ServerArgs`` fields
+and forwarded to ``Engine(**kwargs)``. That covers speculative decoding
+(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention
+backend selection, MoE knobs, hierarchical cache, and so on.
 """
 import asyncio
 from concurrent import futures
 import argparse
+import dataclasses
+import difflib
 import signal
 import sys
 import os
@@ -38,6 +46,7 @@
 # are wrapped in try/except so older / leaner installs that omit them
 # still load the backend for plain text generation.
 from sglang.srt.entrypoints.engine import Engine
+from sglang.srt.server_args import ServerArgs
 
 try:
     from sglang.srt.function_call.function_call_parser import FunctionCallParser
@@ -66,6 +75,19 @@
     HAS_TRANSFORMERS = False
 
 
+# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952).
+# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the
+# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile)
+# accept only `seed`. Detect the supported keyword once at import time so
+# both versions work without a hard pin floor.
+try:
+    import inspect as _inspect
+    from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams
+    _SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed"
+except Exception:
+    _SEED_KEY = "sampling_seed"
+
+
 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
 
@@ -82,6 +104,37 @@ def _parse_options(self, options_list) -> Dict[str, str]:
             opts[key.strip()] = value.strip()
         return opts
 
+    def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict:
+        """Merge user-supplied engine_args (JSON object) into the kwargs dict
+        that will be forwarded to ``sglang.Engine`` (which constructs a
+        ``ServerArgs`` from them).
+
+        Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but
+        operates on the kwargs dict because sglang's ``Engine.__init__``
+        accepts ``**kwargs`` directly rather than a pre-built dataclass.
+        Validation happens against ``ServerArgs`` fields so a typo fails
+        early with a close-match suggestion instead of producing a confusing
+        ``TypeError`` deep inside engine startup.
+        """
+        if not engine_args_json:
+            return engine_kwargs
+        try:
+            extra = json.loads(engine_args_json)
+        except json.JSONDecodeError as e:
+            raise ValueError(f"engine_args is not valid JSON: {e}") from e
+        if not isinstance(extra, dict):
+            raise ValueError(
+                f"engine_args must be a JSON object, got {type(extra).__name__}"
+            )
+        valid = {f.name for f in dataclasses.fields(ServerArgs)}
+        for key in extra:
+            if key not in valid:
+                suggestion = difflib.get_close_matches(key, valid, n=1)
+                hint = f" did you mean {suggestion[0]!r}?" if suggestion else ""
+                raise ValueError(f"unknown engine_args key {key!r}.{hint}")
+        engine_kwargs.update(extra)
+        return engine_kwargs
+
     def _messages_to_dicts(self, messages) -> List[dict]:
         result: List[dict] = []
         for msg in messages:
@@ -137,6 +190,16 @@ async def LoadModel(self, request, context):
         if self.reasoning_parser_name:
             engine_kwargs["reasoning_parser"] = self.reasoning_parser_name
 
+        # engine_args from YAML overrides typed fields above so operators can
+        # tune anything ServerArgs exposes (speculative decoding, attention
+        # backend, MoE, hierarchical cache, …) without waiting on protobuf
+        # changes.
+        try:
+            engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs)
+        except ValueError as err:
+            print(f"engine_args error: {err}", file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(err))
+
         try:
             self.llm = Engine(**engine_kwargs)
         except Exception as err:
@@ -221,7 +284,7 @@ def _build_sampling_params(self, request) -> dict:
             "TopP": "top_p",
             "TopK": "top_k",
             "MinP": "min_p",
-            "Seed": "seed",
+            "Seed": _SEED_KEY,
             "StopPrompts": "stop",
             "StopTokenIds": "stop_token_ids",
             "IgnoreEOS": "ignore_eos",

diff --git a/backend/python/sglang/install.sh b/backend/python/sglang/install.sh
@@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
     EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi
 
+# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without
+# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and
+# the resulting sgl-kernel can't load on our cu12 host libs.
+if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then
+    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
+fi
+
+# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but
+# upstream only publishes pre-release wheels (4.0.0b*). uv rejects
+# pre-releases by default — opt in for sglang specifically. Drop this once
+# flash-attn-4 4.0 stable lands.
+EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
+
 # JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
 # pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
-# wheel resolves cleanly. unsafe-best-match is required because the
-# jetson-ai-lab index lists transitive deps (e.g. decord) at older
-# versions only — without it uv refuses to fall through to PyPI for a
-# compatible wheel and resolution fails.
+# wheel resolves cleanly. The actual install on l4t13 goes through
+# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
+# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
+# index — leaving PyPI as the path for transitive deps like
+# markdown-it-py / anthropic / propcache that the L4T mirror's proxy
+# 503s on. No --index-strategy flag here: the explicit index keeps the
+# scoping clean.
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
     PYTHON_VERSION="3.12"
     PYTHON_PATCH="12"
     PY_STANDALONE_TAG="20251120"
-    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi
 
 # sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
@@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
         fi
         uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
     popd
+# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
+# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
+# jetson-ai-lab index, while everything else (transitive deps and
+# PyPI-resolvable packages like transformers / accelerate) comes from
+# PyPI. Bypasses installRequirements because uv pip install -r
+# requirements.txt does not honor sources — see
+# backend/python/sglang/pyproject.toml for the rationale. Mirrors the
+# equivalent path in backend/python/vllm/install.sh.
+elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+    ensureVenv
+    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
+        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
+    fi
+    pushd "${backend_dir}"
+        # Build deps first (matches installRequirements' requirements-install.txt
+        # pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
+        # venv before they can build under --no-build-isolation).
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
+        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
+    popd
+    runProtogen
 else
     installRequirements
 fi