Skip to content

bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta for qwen3.5.#1259

Merged
yingxudeng merged 8 commits intojd-opensource:mainfrom
JC-ut0:qwen3_5_ssmcache
Apr 16, 2026
Merged

bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta for qwen3.5.#1259
yingxudeng merged 8 commits intojd-opensource:mainfrom
JC-ut0:qwen3_5_ssmcache

Conversation

@JC-ut0
Copy link
Copy Markdown
Contributor

@JC-ut0 JC-ut0 commented Apr 10, 2026

This PR makes the following changes to the xLLM inference engine:

  1. Add mamba_ssm_dtype config field — A new mamba_ssm_dtype string property is added to ModelArgs , allowing the SSM (State Space Model) cache dtype to be specified independently from the model's primary dtype.
  2. Use mamba_ssm_dtype for KV cache capacity estimation — In llm_engine.cpp , the estimate_kv_cache_capacity() function now uses the mamba_ssm_dtype -derived byte size for calculating the SSM slot size, instead of always using the model dtype size.
  3. Fix g/beta tensor layout in GDN gating — In qwen3_gated_delta_net_base.cpp , after calling fused_gdn_gating() , the g and beta tensors are permuted from [seq_len, batch, heads] to [batch, seq_len, heads] layout via .permute({1, 0, 2}).contiguous() .
  4. Load mamba_ssm_dtype from model config — The qwen3_5.h model registration macro now loads mamba_ssm_dtype from the JSON config (with text_config.mamba_ssm_dtype fallback) using LOAD_ARG_TEXT_OR_ROOT .

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the LLMEngine and WorkerImpl components, providing the core infrastructure for distributed inference, KV cache management, and model lifecycle operations. It adds support for advanced architectures like DeepSeek v3 and Qwen3, along with optimizations such as XTensor memory management and rolling weight loading. The review feedback suggests enhancing type safety by using enums instead of string literals for model arguments. Additionally, the implementation should be refined to adhere to the project's style guide, particularly concerning fixed-width integer usage, constant naming conventions, and the appropriate use of the auto keyword.

Comment thread xllm/models/llm/qwen3_5.h Outdated
@JC-ut0 JC-ut0 force-pushed the qwen3_5_ssmcache branch from d76d019 to e42f91b Compare April 10, 2026 13:44
@JC-ut0 JC-ut0 force-pushed the qwen3_5_ssmcache branch from e42f91b to 25ef04b Compare April 10, 2026 13:47
@yingxudeng yingxudeng changed the title fix qwen3.5 ssm_cache dtype to fp32 feat: init ssm_cache by config ssm_cache_type and unify compute precision to fp32. Apr 11, 2026
Comment thread xllm/core/distributed_runtime/llm_engine.cpp Outdated
@XuZhang99 XuZhang99 changed the title feat: init ssm_cache by config ssm_cache_type and unify compute precision to fp32. bugfix: init ssm_cache by config ssm_cache_type and unify compute precision to fp32. Apr 11, 2026
Comment thread xllm/core/layers/npu_torch/qwen3_gated_delta_net_base.cpp Outdated
@JC-ut0
Copy link
Copy Markdown
Contributor Author

JC-ut0 commented Apr 14, 2026

/gemini-review

@JC-ut0
Copy link
Copy Markdown
Contributor Author

JC-ut0 commented Apr 14, 2026

/gemini-review

@JC-ut0 JC-ut0 changed the title bugfix: init ssm_cache by config ssm_cache_type and unify compute precision to fp32. bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta Apr 14, 2026
JimHsiung
JimHsiung previously approved these changes Apr 14, 2026
yingxudeng
yingxudeng previously approved these changes Apr 14, 2026
Copy link
Copy Markdown
Collaborator

@yingxudeng yingxudeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch.

@yingxudeng yingxudeng changed the title bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta. Apr 14, 2026
XuZhang99
XuZhang99 previously approved these changes Apr 14, 2026
@JC-ut0 JC-ut0 changed the title bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta. bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta for qwen3.5 Apr 14, 2026
@yingxudeng yingxudeng changed the title bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta for qwen3.5 bugfix: init ssm_cache by config ssm_cache_type and fix dimension order of g & beta for qwen3.5. Apr 14, 2026
@JC-ut0 JC-ut0 dismissed stale reviews from XuZhang99, yingxudeng, and JimHsiung via 9423f68 April 15, 2026 02:18
@yingxudeng
Copy link
Copy Markdown
Collaborator

image CUDA 还没编译完,但是 非NPU的很多已经编译通过了。这个pr已经排队两天了,先合并吧,应该cuda应该编译没问题

@yingxudeng yingxudeng merged commit 32412da into jd-opensource:main Apr 16, 2026
16 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants