diff --git a/.agents/sglang-backend.md b/.agents/sglang-backend.md new file mode 100644 index 000000000000..605927a3ef60 --- /dev/null +++ b/.agents/sglang-backend.md @@ -0,0 +1,62 @@ +# Working on the SGLang Backend + +The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other. + +## `engine_args` is the universal escape hatch + +A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map. + +Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend. + +**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?". + +**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat. + +The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass. + +## Speculative decoding cheatsheet + +`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP). + +| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter | +|-----------|--------------------|---------------------|----------------------| +| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant | +| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B | +| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 | +| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family | +| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a | + +The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation. + +Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection. + +Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler. + +### `mem_fraction_static` + quantization + MTP on consumer GPUs + +When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left. + +Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction. + +This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine. + +## Tool-call and reasoning parsers stay on `Options[]` + +ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own. + +So the user-facing knob stays on `Options[]`: + +```yaml +options: + - tool_parser:hermes + - reasoning_parser:deepseek_r1 +``` + +Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path. + +## What's missing today (out of scope, but worth tracking) + +- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices. +- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically. + +These should be a follow-up PR, not a blocker for the engine_args feature. diff --git a/.github/workflows/backend.yml b/.github/workflows/backend.yml index 2242be0f79ba..cffd1b76e48d 100644 --- a/.github/workflows/backend.yml +++ b/.github/workflows/backend.yml @@ -959,6 +959,19 @@ jobs: dockerfile: "./backend/Dockerfile.python" context: "./" ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-13-sglang' + runs-on: 'arc-runner-set' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "sglang" + dockerfile: "./backend/Dockerfile.python" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "13" cuda-minor-version: "0" diff --git a/AGENTS.md b/AGENTS.md index 163304e526cd..da9d8d48ec50 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -24,6 +24,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants] | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions | | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing | | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks | +| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling | | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI | | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control | | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends | diff --git a/backend/index.yaml b/backend/index.yaml index cf5fe657aa7b..6c8c57b3e674 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -287,6 +287,7 @@ amd: "rocm-sglang" intel: "intel-sglang" nvidia-cuda-12: "cuda12-sglang" + nvidia-cuda-13: "cuda13-sglang" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang" cpu: "cpu-sglang" - &vllm-omni @@ -1965,6 +1966,7 @@ amd: "rocm-sglang-development" intel: "intel-sglang-development" nvidia-cuda-12: "cuda12-sglang-development" + nvidia-cuda-13: "cuda13-sglang-development" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development" cpu: "cpu-sglang-development" - !!merge <<: *sglang @@ -1972,6 +1974,11 @@ uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang" mirrors: - localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang +- !!merge <<: *sglang + name: "cuda13-sglang" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sglang" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-13-sglang - !!merge <<: *sglang name: "cuda13-nvidia-l4t-arm64-sglang" uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang" @@ -1997,6 +2004,11 @@ uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang" mirrors: - localai/localai-backends:master-gpu-nvidia-cuda-12-sglang +- !!merge <<: *sglang + name: "cuda13-sglang-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sglang" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-13-sglang - !!merge <<: *sglang name: "cuda13-nvidia-l4t-arm64-sglang-development" uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang" diff --git a/backend/python/sglang/Makefile b/backend/python/sglang/Makefile index e1933f41a2a0..1f214afeda9a 100644 --- a/backend/python/sglang/Makefile +++ b/backend/python/sglang/Makefile @@ -8,6 +8,12 @@ run: sglang bash run.sh @echo "sglang run." +.PHONY: test +test: sglang + @echo "Testing sglang..." + bash test.sh + @echo "sglang tested." + .PHONY: protogen-clean protogen-clean: $(RM) backend_pb2_grpc.py backend_pb2.py diff --git a/backend/python/sglang/backend.py b/backend/python/sglang/backend.py index 8def22a4cb35..8b48d23233dc 100644 --- a/backend/python/sglang/backend.py +++ b/backend/python/sglang/backend.py @@ -9,10 +9,18 @@ ReasoningParser so tool_calls and reasoning_content are emitted incrementally inside ChatDelta, which is a capability sglang exposes natively and vLLM does not. + +Like the vLLM backend, this one accepts an arbitrary ``engine_args:`` +map in the model YAML; keys are validated against ``ServerArgs`` fields +and forwarded to ``Engine(**kwargs)``. That covers speculative decoding +(EAGLE/EAGLE3/DFLASH/NGRAM/STANDALONE plus MTP via NEXTN), attention +backend selection, MoE knobs, hierarchical cache, and so on. """ import asyncio from concurrent import futures import argparse +import dataclasses +import difflib import signal import sys import os @@ -38,6 +46,7 @@ # are wrapped in try/except so older / leaner installs that omit them # still load the backend for plain text generation. from sglang.srt.entrypoints.engine import Engine +from sglang.srt.server_args import ServerArgs try: from sglang.srt.function_call.function_call_parser import FunctionCallParser @@ -66,6 +75,19 @@ HAS_TRANSFORMERS = False +# sglang 0.5.11 renamed SamplingParams.seed -> sampling_seed (PR #21952). +# Earlier 0.5.x releases (e.g. 0.5.1.post2 — the wheel still pinned by the +# pypi.jetson-ai-lab.io sbsa/cu130 mirror used by the l4t13 build profile) +# accept only `seed`. Detect the supported keyword once at import time so +# both versions work without a hard pin floor. +try: + import inspect as _inspect + from sglang.srt.sampling.sampling_params import SamplingParams as _SamplingParams + _SEED_KEY = "sampling_seed" if "sampling_seed" in _inspect.signature(_SamplingParams).parameters else "seed" +except Exception: + _SEED_KEY = "sampling_seed" + + _ONE_DAY_IN_SECONDS = 60 * 60 * 24 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1')) @@ -82,6 +104,37 @@ def _parse_options(self, options_list) -> Dict[str, str]: opts[key.strip()] = value.strip() return opts + def _apply_engine_args(self, engine_kwargs: dict, engine_args_json: str) -> dict: + """Merge user-supplied engine_args (JSON object) into the kwargs dict + that will be forwarded to ``sglang.Engine`` (which constructs a + ``ServerArgs`` from them). + + Mirrors ``backend/python/vllm/backend.py::_apply_engine_args`` but + operates on the kwargs dict because sglang's ``Engine.__init__`` + accepts ``**kwargs`` directly rather than a pre-built dataclass. + Validation happens against ``ServerArgs`` fields so a typo fails + early with a close-match suggestion instead of producing a confusing + ``TypeError`` deep inside engine startup. + """ + if not engine_args_json: + return engine_kwargs + try: + extra = json.loads(engine_args_json) + except json.JSONDecodeError as e: + raise ValueError(f"engine_args is not valid JSON: {e}") from e + if not isinstance(extra, dict): + raise ValueError( + f"engine_args must be a JSON object, got {type(extra).__name__}" + ) + valid = {f.name for f in dataclasses.fields(ServerArgs)} + for key in extra: + if key not in valid: + suggestion = difflib.get_close_matches(key, valid, n=1) + hint = f" did you mean {suggestion[0]!r}?" if suggestion else "" + raise ValueError(f"unknown engine_args key {key!r}.{hint}") + engine_kwargs.update(extra) + return engine_kwargs + def _messages_to_dicts(self, messages) -> List[dict]: result: List[dict] = [] for msg in messages: @@ -137,6 +190,16 @@ async def LoadModel(self, request, context): if self.reasoning_parser_name: engine_kwargs["reasoning_parser"] = self.reasoning_parser_name + # engine_args from YAML overrides typed fields above so operators can + # tune anything ServerArgs exposes (speculative decoding, attention + # backend, MoE, hierarchical cache, …) without waiting on protobuf + # changes. + try: + engine_kwargs = self._apply_engine_args(engine_kwargs, request.EngineArgs) + except ValueError as err: + print(f"engine_args error: {err}", file=sys.stderr) + return backend_pb2.Result(success=False, message=str(err)) + try: self.llm = Engine(**engine_kwargs) except Exception as err: @@ -221,7 +284,7 @@ def _build_sampling_params(self, request) -> dict: "TopP": "top_p", "TopK": "top_k", "MinP": "min_p", - "Seed": "seed", + "Seed": _SEED_KEY, "StopPrompts": "stop", "StopTokenIds": "stop_token_ids", "IgnoreEOS": "ignore_eos", diff --git a/backend/python/sglang/install.sh b/backend/python/sglang/install.sh index f0acc08e64b7..d7108d85ff55 100755 --- a/backend/python/sglang/install.sh +++ b/backend/python/sglang/install.sh @@ -23,17 +23,32 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match" fi +# cublas12 needs a cu128 torch index (see requirements-cublas12.txt) — without +# unsafe-best-match uv falls through to default PyPI's cu130 torch wheel and +# the resulting sgl-kernel can't load on our cu12 host libs. +if [ "x${BUILD_PROFILE}" == "xcublas12" ]; then + EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match" +fi + +# sglang 0.5.11 (Gemma 4 support) declares flash-attn-4 as a hard dep, but +# upstream only publishes pre-release wheels (4.0.0b*). uv rejects +# pre-releases by default — opt in for sglang specifically. Drop this once +# flash-attn-4 4.0 stable lands. +EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow" + # JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via # pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang -# wheel resolves cleanly. unsafe-best-match is required because the -# jetson-ai-lab index lists transitive deps (e.g. decord) at older -# versions only — without it uv refuses to fall through to PyPI for a -# compatible wheel and resolution fails. +# wheel resolves cleanly. The actual install on l4t13 goes through +# pyproject.toml (see the elif branch below) so [tool.uv.sources] can +# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab +# index — leaving PyPI as the path for transitive deps like +# markdown-it-py / anthropic / propcache that the L4T mirror's proxy +# 503s on. No --index-strategy flag here: the explicit index keeps the +# scoping clean. if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then PYTHON_VERSION="3.12" PYTHON_PATCH="12" PY_STANDALONE_TAG="20251120" - EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match" fi # sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes @@ -95,6 +110,27 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then fi uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} . popd +# L4T arm64 (JetPack 7): drive the install through pyproject.toml so that +# [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the +# jetson-ai-lab index, while everything else (transitive deps and +# PyPI-resolvable packages like transformers / accelerate) comes from +# PyPI. Bypasses installRequirements because uv pip install -r +# requirements.txt does not honor sources — see +# backend/python/sglang/pyproject.toml for the rationale. Mirrors the +# equivalent path in backend/python/vllm/install.sh. +elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then + ensureVenv + if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then + export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}" + fi + pushd "${backend_dir}" + # Build deps first (matches installRequirements' requirements-install.txt + # pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the + # venv before they can build under --no-build-isolation). + uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt + uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml + popd + runProtogen else installRequirements fi diff --git a/backend/python/sglang/pyproject.toml b/backend/python/sglang/pyproject.toml new file mode 100644 index 000000000000..9f061f2b8499 --- /dev/null +++ b/backend/python/sglang/pyproject.toml @@ -0,0 +1,68 @@ +# L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend. +# +# Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it: +# +# pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel +# wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the +# rest of PyPI through `/+f//` URLs that 503 frequently. +# With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the +# historical fix in install.sh) uv would pick those proxy URLs for ordinary +# PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on +# the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0). +# +# `explicit = true` on the index makes uv consult the L4T mirror ONLY for +# packages mapped under [tool.uv.sources]. Everything else goes to PyPI. +# This breaks the historical 503 path without losing access to the L4T +# wheels we actually need from there. Mirrors the equivalent fix already +# in backend/python/vllm/pyproject.toml. +# +# `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources] +# (sources are project-mode only, not pip-compat mode), so install.sh's +# l4t13 branch invokes `uv pip install --requirement pyproject.toml` +# directly. Other BUILD_PROFILEs continue to use the requirements-*.txt +# pipeline through libbackend.sh's installRequirements and never read +# this file. +[project] +name = "localai-sglang-l4t13" +version = "0.0.0" +requires-python = ">=3.12,<3.13" +dependencies = [ + # Mirror of requirements.txt — kept in sync manually for now since the + # l4t13 path bypasses installRequirements (see install.sh). + "grpcio==1.80.0", + "protobuf", + "certifi", + "setuptools", + "pillow", + # L4T-specific accelerator stack (sourced from jetson-ai-lab below). + "torch", + "torchvision", + "torchaudio", + # sglang on jetson — the [all] extra is deliberately omitted because it + # pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere + # (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With + # [all] uv backtracks through versions trying to satisfy decord and + # lands on sglang==0.1.16. The 0.5.0 floor matches the only major + # series the jetson-ai-lab sbsa/cu130 mirror currently publishes + # (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here + # would make the build unsatisfiable until the mirror catches up. + # Gemma 4 / MTP recipes are therefore not supported on l4t13 — those + # features land on cublas12/cublas13 hosts that pull the newer wheel + # from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams + # field rename via runtime detection. + "sglang>=0.5.0", + # PyPI-resolvable packages that complete the runtime. + "accelerate", + "transformers", +] + +[[tool.uv.index]] +name = "jetson-ai-lab" +url = "https://pypi.jetson-ai-lab.io/sbsa/cu130" +explicit = true + +[tool.uv.sources] +torch = { index = "jetson-ai-lab" } +torchvision = { index = "jetson-ai-lab" } +torchaudio = { index = "jetson-ai-lab" } +sglang = { index = "jetson-ai-lab" } diff --git a/backend/python/sglang/requirements-cublas12-after.txt b/backend/python/sglang/requirements-cublas12-after.txt index 57203f125941..435075ecb5e7 100644 --- a/backend/python/sglang/requirements-cublas12-after.txt +++ b/backend/python/sglang/requirements-cublas12-after.txt @@ -1,3 +1,4 @@ # Bump this pin deliberately — sglang releases weekly and API surfaces # (FunctionCallParser, ReasoningParser) move between releases. -sglang[all]>=0.4.0 +# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952). +sglang[all]>=0.5.11 diff --git a/backend/python/sglang/requirements-cublas12.txt b/backend/python/sglang/requirements-cublas12.txt index 6f94fc995203..edd9558209dd 100644 --- a/backend/python/sglang/requirements-cublas12.txt +++ b/backend/python/sglang/requirements-cublas12.txt @@ -1,5 +1,12 @@ +# sglang 0.5.11 hard-pins torch==2.9.1. PyPI's default torch 2.9.1 wheel is +# now the cu130 build, which drags in cu130-flavoured sgl-kernel/sglang-kernel +# binaries that need libnvrtc.so.13 — incompatible with our cu12 host libs. +# Pin the cu128 PyTorch index so uv pulls cu12-flavoured torch (and the +# matching sgl-kernel cu12 wheels). install.sh adds --index-strategy=unsafe-best-match +# for cublas12 so uv consults this index alongside PyPI. +--extra-index-url https://download.pytorch.org/whl/cu128 accelerate -torch==2.7.1 +torch==2.9.1 torchvision -torchaudio==2.7.1 +torchaudio transformers diff --git a/backend/python/sglang/requirements-cublas13-after.txt b/backend/python/sglang/requirements-cublas13-after.txt new file mode 100644 index 000000000000..435075ecb5e7 --- /dev/null +++ b/backend/python/sglang/requirements-cublas13-after.txt @@ -0,0 +1,4 @@ +# Bump this pin deliberately — sglang releases weekly and API surfaces +# (FunctionCallParser, ReasoningParser) move between releases. +# 0.5.11 is the floor for Gemma 4 support (PR sgl-project/sglang#21952). +sglang[all]>=0.5.11 diff --git a/backend/python/sglang/requirements-cublas13.txt b/backend/python/sglang/requirements-cublas13.txt new file mode 100644 index 000000000000..f2771656968c --- /dev/null +++ b/backend/python/sglang/requirements-cublas13.txt @@ -0,0 +1,6 @@ +--extra-index-url https://download.pytorch.org/whl/cu130 +accelerate +torch +torchvision +torchaudio +transformers diff --git a/backend/python/sglang/requirements-l4t13.txt b/backend/python/sglang/requirements-l4t13.txt deleted file mode 100644 index 81de3f13d342..000000000000 --- a/backend/python/sglang/requirements-l4t13.txt +++ /dev/null @@ -1,12 +0,0 @@ ---extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130 -accelerate -torch -torchvision -torchaudio -transformers -# Drop the [all] extra: it pulls outlines/decord, and decord has no -# aarch64 cp312 wheel anywhere (PyPI nor the jetson-ai-lab index ships -# only legacy cp35-cp37). With [all] uv backtracks through versions -# trying to satisfy decord and lands on sglang==0.1.16. Floor at 0.5.0 -# so uv can't silently downgrade if a future resolution misfires. -sglang>=0.5.0 diff --git a/backend/python/sglang/test.py b/backend/python/sglang/test.py new file mode 100644 index 000000000000..92688f4440a5 --- /dev/null +++ b/backend/python/sglang/test.py @@ -0,0 +1,101 @@ +"""Unit tests for the sglang backend. + +Helper-level tests run without launching the gRPC server or loading model +weights — they only exercise the pure-Python helpers on +``BackendServicer``. They do still require ``sglang`` to be importable +because ``_apply_engine_args`` validates keys against +``ServerArgs``'s dataclass fields. +""" +import unittest + + +class TestSglangHelpers(unittest.TestCase): + """Tests for the pure helpers on BackendServicer (no gRPC, no engine).""" + + def _servicer(self): + import sys + import os + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from backend import BackendServicer # noqa: E402 + return BackendServicer() + + def test_parse_options(self): + servicer = self._servicer() + opts = servicer._parse_options([ + "tool_parser:hermes", + "reasoning_parser:deepseek_r1", + "invalid_no_colon", + "key_with_colons:a:b:c", + ]) + self.assertEqual(opts["tool_parser"], "hermes") + self.assertEqual(opts["reasoning_parser"], "deepseek_r1") + self.assertEqual(opts["key_with_colons"], "a:b:c") + self.assertNotIn("invalid_no_colon", opts) + + def test_apply_engine_args_known_keys(self): + """User-supplied JSON merges into the kwargs dict; pre-set typed + fields stay put when not overridden.""" + import json as _json + servicer = self._servicer() + base = { + "model_path": "facebook/opt-125m", + "mem_fraction_static": 0.7, + } + extras = _json.dumps({ + "trust_remote_code": True, + "speculative_algorithm": "EAGLE", + "speculative_num_steps": 1, + }) + out = servicer._apply_engine_args(base, extras) + self.assertIs(out, base) # in-place merge — same dict back + self.assertTrue(out["trust_remote_code"]) + self.assertEqual(out["speculative_algorithm"], "EAGLE") + self.assertEqual(out["speculative_num_steps"], 1) + self.assertEqual(out["model_path"], "facebook/opt-125m") + self.assertEqual(out["mem_fraction_static"], 0.7) + + def test_apply_engine_args_engine_args_overrides_typed_fields(self): + """engine_args wins over previously-set typed kwargs (vLLM precedence).""" + import json as _json + servicer = self._servicer() + base = {"model_path": "facebook/opt-125m", "mem_fraction_static": 0.7} + out = servicer._apply_engine_args( + base, _json.dumps({"mem_fraction_static": 0.5}), + ) + self.assertEqual(out["mem_fraction_static"], 0.5) + + def test_apply_engine_args_unknown_key_raises(self): + """Typo'd key raises ValueError with a close-match suggestion.""" + import json as _json + servicer = self._servicer() + base = {"model_path": "facebook/opt-125m"} + with self.assertRaises(ValueError) as ctx: + servicer._apply_engine_args( + base, _json.dumps({"trust_remotecode": True}), + ) + msg = str(ctx.exception) + self.assertIn("trust_remotecode", msg) + self.assertIn("trust_remote_code", msg) + + def test_apply_engine_args_empty_passthrough(self): + """Empty / None engine_args returns the kwargs dict untouched.""" + servicer = self._servicer() + base = {"model_path": "facebook/opt-125m"} + self.assertIs(servicer._apply_engine_args(base, ""), base) + self.assertIs(servicer._apply_engine_args(base, None), base) + + def test_apply_engine_args_invalid_json_raises(self): + servicer = self._servicer() + with self.assertRaises(ValueError) as ctx: + servicer._apply_engine_args({}, "not-json") + self.assertIn("not valid JSON", str(ctx.exception)) + + def test_apply_engine_args_non_object_raises(self): + servicer = self._servicer() + with self.assertRaises(ValueError) as ctx: + servicer._apply_engine_args({}, "[1,2,3]") + self.assertIn("must be a JSON object", str(ctx.exception)) + + +if __name__ == "__main__": + unittest.main() diff --git a/backend/python/sglang/test.sh b/backend/python/sglang/test.sh new file mode 100755 index 000000000000..f31ae54e47dc --- /dev/null +++ b/backend/python/sglang/test.sh @@ -0,0 +1,12 @@ +#!/bin/bash +set -e + +backend_dir=$(dirname $0) + +if [ -d $backend_dir/common ]; then + source $backend_dir/common/libbackend.sh +else + source $backend_dir/../common/libbackend.sh +fi + +runUnittests diff --git a/docs/content/features/text-generation.md b/docs/content/features/text-generation.md index b6e073b567bf..5ef0ba489d94 100644 --- a/docs/content/features/text-generation.md +++ b/docs/content/features/text-generation.md @@ -721,6 +721,121 @@ GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref "features/distributed-mode#vllm-multi-node-data-parallel" %}}) for the head/follower configuration and a worked Kimi-K2.6 example. +### SGLang + +[SGLang](https://github.com/sgl-project/sglang) is a fast serving +framework for LLMs and VLMs with a focus on prefix caching, speculative +decoding, and multi-modal generation. LocalAI ships a gRPC backend that +wraps SGLang's async `Engine`, including its native function-call and +reasoning parsers. + +#### Setup + +```yaml +name: sglang +backend: sglang +parameters: + model: "Qwen/Qwen3-4B" +template: + use_tokenizer_template: true +``` + +The backend will pull the model from HuggingFace on first load. + +#### Passing arbitrary SGLang options with `engine_args` + +The same `engine_args:` map that the vLLM backend accepts is also +honoured by the SGLang backend. Keys are validated against +[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) +— SGLang's central configuration dataclass — and forwarded verbatim to +`Engine(**kwargs)`. Unknown keys fail at load time with the closest +valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative +decoding fields are top-level (`speculative_algorithm`, +`speculative_draft_model_path`, etc.) rather than nested under a +`speculative_config:` dict. + +The typed YAML fields shared with vLLM are mapped to their SGLang +equivalents (`gpu_memory_utilization` → `mem_fraction_static`, +`enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` → +`tp_size`, `max_model_len` → `context_length`). Anything else, +including all speculative-decoding flags, goes under `engine_args:`. + +##### Speculative decoding: Gemma 4 with Multi-Token Prediction + +Google publishes paired "assistant" drafters for every Gemma 4 size. +The drafters use Multi-Token Prediction (MTP) to propose several +candidate tokens per target step, which SGLang then verifies in +parallel. Flags below are transcribed verbatim from the +[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands). + +For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total / +4 B effective parameters): + +```yaml +name: gemma-4-e4b-mtp +backend: sglang +parameters: + model: google/gemma-4-E4B-it +context_size: 4096 +template: + use_tokenizer_template: true +options: + - tool_parser:gemma4 + - reasoning_parser:gemma4 +engine_args: + mem_fraction_static: 0.85 + speculative_algorithm: NEXTN + speculative_draft_model_path: google/gemma-4-E4B-it-assistant + speculative_num_steps: 5 + speculative_num_draft_tokens: 6 + speculative_eagle_topk: 1 +``` + +For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective) +by swapping the model paths to `google/gemma-4-E2B-it` and +`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same. + +`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so +either value works — the cookbook uses `NEXTN`. `mem_fraction_static` +is the share of GPU memory SGLang reserves for the model + KV pool; +0.85 is the cookbook's default and adapts to whatever single GPU the +backend is running on. + +The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same +cookbook but require `--tp-size 2`, so they're not in the gallery as +single-GPU recipes. + +> **SGLang version requirement.** Gemma 4 support landed in SGLang via +> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The +> LocalAI sglang backend pins a release that includes it; if you've +> overridden the pin to an older version, this recipe will fail with a +> "model architecture not recognised" error at load time. + +##### Other speculative algorithms + +`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an +EAGLE-style draft head), `DFLASH` (block-diffusion drafters from +[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE` +(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft +model — pure prefix-history speculation). See SGLang's +[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html) +for the full algorithm matrix. + +#### Tool calling and reasoning parsers + +SGLang's native parsers stream `tool_calls` and `reasoning_content` +inside `ChatDelta` — the LocalAI Python backend wires them up +per-request rather than via `engine_args:`. Pick a parser by name: + +```yaml +options: + - tool_parser:hermes + - reasoning_parser:deepseek_r1 +``` + +The full list of registered parsers lives in `sglang.srt.function_call` +and `sglang.srt.parser.reasoning_parser`. + ### Transformers [Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX. diff --git a/gallery/index.yaml b/gallery/index.yaml index 4efbe5813304..9cbc9e0514a5 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -24108,6 +24108,73 @@ overrides: parameters: model: NousResearch/Hermes-3-Llama-3.1-405B +- &gemma4-sglang-mtp + name: "gemma-4-e2b-it:sglang-mtp" + url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e2b-mtp.yaml@master" + icon: https://ai.google.dev/static/gemma/images/gemma3.png + license: gemma + urls: + - https://huggingface.co/google/gemma-4-E2B-it + - https://huggingface.co/google/gemma-4-E2B-it-assistant + - https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4 + description: | + Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction + (MTP) speculative decoding. The companion drafter + google/gemma-4-E2B-it-assistant lets the target accept several + tokens per step. Flags are a 1:1 transcription of the SGLang + cookbook's MTP command (NEXTN algorithm, num_steps=5, + num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The + E2B variant has 5B total / 2B effective parameters and targets the + smaller end of consumer GPUs. + tags: + - llm + - sglang + - gpu + - speculative-decoding + - mtp + - gemma + - gemma4 + - gemma-4 +- !!merge <<: *gemma4-sglang-mtp + name: "gemma-4-e4b-it:sglang-mtp" + url: "github:mudler/LocalAI/gallery/sglang-gemma-4-e4b-mtp.yaml@master" + urls: + - https://huggingface.co/google/gemma-4-E4B-it + - https://huggingface.co/google/gemma-4-E4B-it-assistant + - https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4 + description: | + Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction + (MTP) speculative decoding. The companion drafter + google/gemma-4-E4B-it-assistant lets the target accept several + tokens per step. Flags are a 1:1 transcription of the SGLang + cookbook's MTP command (NEXTN algorithm, num_steps=5, + num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The + E4B variant has 8B total / 4B effective parameters — the natural + pick for consumer GPUs in the 16–24 GB range. +- name: "mimo-7b-mtp:sglang" + url: "github:mudler/LocalAI/gallery/sglang-mimo-7b-mtp.yaml@master" + icon: https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo.png + license: mit + urls: + - https://huggingface.co/XiaomiMiMo/MiMo-7B-RL + - https://github.com/XiaomiMiMo/MiMo + description: | + Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token + Prediction (MTP) heads (no separate drafter needed) plus online fp8 + weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance + per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070 + Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's + 0.85 default) because the MTP draft worker's vocab embedding is + loaded unquantised (~1.2 GiB) and OOMs the static reservation + otherwise. + tags: + - llm + - sglang + - gpu + - speculative-decoding + - mtp + - reasoning + - fp8 - name: codellama-7b url: github:mudler/LocalAI/gallery/codellama.yaml@master urls: diff --git a/gallery/sglang-gemma-4-e2b-mtp.yaml b/gallery/sglang-gemma-4-e2b-mtp.yaml new file mode 100644 index 000000000000..fc74eef43ccc --- /dev/null +++ b/gallery/sglang-gemma-4-e2b-mtp.yaml @@ -0,0 +1,36 @@ +--- +name: "sglang-gemma-4-e2b-mtp" + +config_file: | + backend: sglang + parameters: + model: google/gemma-4-E2B-it + max_tokens: 4096 + context_size: 4096 + function: + disable_no_action: true + grammar: + disable: true + parallel_calls: true + expect_strings_after_json: true + template: + use_tokenizer_template: true + options: + - tool_parser:gemma4 + - reasoning_parser:gemma4 + # Gemma 4 E2B-it served by SGLang with Multi-Token Prediction (MTP). + # Flags transcribed verbatim from the SGLang cookbook: + # https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands + # NEXTN is normalised to EAGLE inside ServerArgs.__post_init__. + # mem_fraction_static=0.85 adapts to the available GPU; E2B is the + # smaller variant of the Gemma 4 lineup and the natural fit for + # consumer GPUs (notably 8–12 GB cards). Requires sglang built with + # PR #21952 (Gemma 4 model support); LocalAI's pinned release + # carries it. + engine_args: + mem_fraction_static: 0.85 + speculative_algorithm: NEXTN + speculative_draft_model_path: google/gemma-4-E2B-it-assistant + speculative_num_steps: 5 + speculative_num_draft_tokens: 6 + speculative_eagle_topk: 1 diff --git a/gallery/sglang-gemma-4-e4b-mtp.yaml b/gallery/sglang-gemma-4-e4b-mtp.yaml new file mode 100644 index 000000000000..4c5cecfc7590 --- /dev/null +++ b/gallery/sglang-gemma-4-e4b-mtp.yaml @@ -0,0 +1,36 @@ +--- +name: "sglang-gemma-4-e4b-mtp" + +config_file: | + backend: sglang + parameters: + model: google/gemma-4-E4B-it + max_tokens: 4096 + context_size: 4096 + function: + disable_no_action: true + grammar: + disable: true + parallel_calls: true + expect_strings_after_json: true + template: + use_tokenizer_template: true + options: + - tool_parser:gemma4 + - reasoning_parser:gemma4 + # Gemma 4 E4B-it served by SGLang with Multi-Token Prediction (MTP). + # Flags transcribed verbatim from the SGLang cookbook: + # https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands + # NEXTN is normalised to EAGLE inside ServerArgs.__post_init__. + # mem_fraction_static=0.85 adapts to the available GPU; E4B is the + # mid-size variant (8B total / 4B effective parameters) and targets + # consumer GPUs in the 16–24 GB range. Requires sglang built with + # PR #21952 (Gemma 4 model support); LocalAI's pinned release + # carries it. + engine_args: + mem_fraction_static: 0.85 + speculative_algorithm: NEXTN + speculative_draft_model_path: google/gemma-4-E4B-it-assistant + speculative_num_steps: 5 + speculative_num_draft_tokens: 6 + speculative_eagle_topk: 1 diff --git a/gallery/sglang-mimo-7b-mtp.yaml b/gallery/sglang-mimo-7b-mtp.yaml new file mode 100644 index 000000000000..d120ee835cda --- /dev/null +++ b/gallery/sglang-mimo-7b-mtp.yaml @@ -0,0 +1,34 @@ +--- +name: "sglang-mimo-7b-mtp" + +config_file: | + backend: sglang + parameters: + model: XiaomiMiMo/MiMo-7B-RL + max_tokens: 4096 + context_size: 4096 + trust_remote_code: true + function: + disable_no_action: true + grammar: + disable: true + parallel_calls: true + expect_strings_after_json: true + template: + use_tokenizer_template: true + # Xiaomi MiMo-7B-RL with built-in Multi-Token Prediction (MTP) heads + # served via SGLang's EAGLE-aliased speculative-decoding path. ~90% + # acceptance rate per the model card. Quantised to fp8 at load time + # so the 7 B target fits on a 16 GB consumer GPU; mem_fraction_static + # is reduced from sglang's 0.85 default because the MTP draft worker + # loads its vocab embedding unquantised (bf16, ~1.2 GiB for MiMo's + # 152k vocab × 4096 hidden) and OOMs at 0.85. Verified end-to-end on + # an RTX 5070 Ti (16 GB) at ~88 tok/s. + engine_args: + dtype: bfloat16 + quantization: fp8 + mem_fraction_static: 0.7 + speculative_algorithm: EAGLE + speculative_num_steps: 1 + speculative_eagle_topk: 1 + speculative_num_draft_tokens: 2 diff --git a/gallery/sglang.yaml b/gallery/sglang.yaml new file mode 100644 index 000000000000..eb319cbd4980 --- /dev/null +++ b/gallery/sglang.yaml @@ -0,0 +1,43 @@ +--- +name: "sglang" + +config_file: | + backend: sglang + context_size: 8192 + parameters: + max_tokens: 8192 + function: + disable_no_action: true + grammar: + disable: true + parallel_calls: true + expect_strings_after_json: true + template: + use_tokenizer_template: true + # Uncomment to specify a quantization method (optional) + # quantization: "fp8" + # Uncomment to set dtype: "auto", "half", "float16", "bfloat16", "float", "float32" + # dtype: "bfloat16" + # Uncomment to limit static GPU memory (sglang's mem_fraction_static — analogous to vLLM gpu_memory_utilization) + # gpu_memory_utilization: 0.75 + # Uncomment to trust remote code from huggingface + # trust_remote_code: true + # Uncomment to disable CUDA graph capture (sglang's disable_cuda_graph) + # enforce_eager: true + # Uncomment to specify the maximum length of a sequence (sglang's context_length) + # max_model_len: 32768 + # Uncomment and specify the number of Tensor divisions + # tensor_parallel_size: 2 + # + # Anything ServerArgs exposes (~380 fields including speculative + # decoding, attention backend, MoE/EP, hierarchical cache, …) can be + # passed verbatim under engine_args:. See + # https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py + # for the canonical list. Unknown keys fail at load time with a + # close-match suggestion. + # engine_args: + # speculative_algorithm: EAGLE3 + # speculative_draft_model_path: lmsys/SGLang-EAGLE3-Llama-3.1-8B-Instruct-SpecForge + # speculative_num_steps: 3 + # speculative_eagle_topk: 4 + # speculative_num_draft_tokens: 16