Release GPT-QModel v6.0.3 · ModelCloud/GPTQModel

Notable Changes:

Quantization and inference

Major ParoQuant improvements across speed, inference, and accuracy.
Added Paro inference support and a new layer optimizer.
Auto-enables AMP for the fast Paro implementation to better match reference behavior.
Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
Fixed a layer output replay/re-capture regression.
Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.

Model and backend support

Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
Added PrismML/Bonsai model support for inference.
Fixed Qwen3_5QModel definition issues.
Fixed Qwen 3.5 rotary embedding behavior.
Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
Fixed awq_processor.dynamic so skipped layers are handled correctly.
Improved dtype compatibility.
Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.

Evaluation, calibration, and usability

Integrated Evalution into the workflow.
Added evalution.VLLM and evalution.SGLang backends.
Fixed SGLang evaluation engine initialization.
Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
Improved calibration data device handling.
Updated tokenizer handling, and collation now respects tokenizer padding_size.
Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
Cleaned up warning behavior and added an option to suppress warnings.
Removed forced random seed overrides.

Dependency and compatibility updates

Updated pypcre to 0.2.14.
Pinned logbar to >=0.4.1.
Updated transformers and defuser package versions.
Fixed SAVE_PATH handling and import path resolution issues.

Breaking and removed

Removed GPTQModel.upload_to_hub().
Removed MLX export support.

What's Changed

[CI] fix pkgs' order & fix flashinfer version was overridden by @CSY-ModelCloud in #2575
allow to disable warning by @CSY-ModelCloud in #2576
lazy load _DEVICE_THREAD_POOL, to speed up import by @CSY-ModelCloud in #2577
remove disable env check by @CSY-ModelCloud in #2578
[CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2579
Update pypcre version to 0.2.14 by @Qubitium in #2581
Nothing to see here... by @Qubitium in #2456
dtype compat by @Qubitium in #2582
fix test_moe_config by @ZX-ModelCloud in #2583
fix new format test by @ZX-ModelCloud in #2586
[CI] add test config by @CSY-ModelCloud in #2587
fix Qwen3_5QModel definition by @ZX-ModelCloud in #2588
speed up paroquant quant speed and resolve accuracy issues by @Qubitium in #2590
append last commit to version by @CSY-ModelCloud in #2591
speedup paroquant test by @ZX-ModelCloud in #2592
[CI] generate release matrix from torch registry by @CSY-ModelCloud in #2593
Evalution integration by @Qubitium in #2585
move eval.sh to tests by @Qubitium in #2594
remove warning by @Qubitium in #2595
[CI] use new docker image by @CSY-ModelCloud in #2596
[CI] install required pkg by @CSY-ModelCloud in #2597
Automatically Determine MODEL_COMPAT_FAST_LAYER_COUNT by @ZX-ModelCloud in #2598
[CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2599
Fix: Paroquant impl accuracy by @Qubitium in #2601
remove forced random seed override in cls proper by @Qubitium in #2603
Paro test by @Qubitium in #2604
[FIX] incorrect SAVE_PATH by @ZX-ModelCloud in #2605
pin logbar to >= 0.4.1 by @Qubitium in #2606
Update the evalution scores by @ZX-ModelCloud in #2600
Paro: auto enable amp for fast impl to sync with reference by @Qubitium in #2607
paro: fix seeding and cleanup by @Qubitium in #2609
gate hf kernel to non-nogil builds of python until upsteram fix wheels by @Qubitium in #2610
[CI] use Ubuntu 24.04 docker image by @CSY-ModelCloud in #2612
Fix layer output re-capture (replay) regression by @Qubitium in #2611
remove legacy ppl codes by @Qubitium in #2613
replace marlin_fp16 backend with env flag control for fp32 reduction … by @Qubitium in #2614
[CI] default py 3.14t & install latest Evalution by @CSY-ModelCloud in #2616
[CI] fix Evalution is private by @CSY-ModelCloud in #2617
updat tokenicer by @Qubitium in #2618
make collate respect tokenier padding_size by @Qubitium in #2620
paro: clamp learned channel scales to avoid collapse by @Qubitium in #2622
Calibration data device by @avtc in #2608
[FIX] qwen3_5 rotary_embedding by @ZX-ModelCloud in #2624
Temporarily disable gptqmodel spit_by feature by @ZX-ModelCloud in #2625
use evalution.VLLM by @CSY-ModelCloud in #2615
use evalution.SGLang by @ZX-ModelCloud in #2626
paro: enter the dragon by @Qubitium in #2623
[CI] use torch 2.11 by @CSY-ModelCloud in #2627
[FIX] sglang evaluation engine initialization error. by @ZX-ModelCloud in #2629
[MODEL] Add minicpmo support by @ZX-ModelCloud in #2630
[CI] update CI path by @CSY-ModelCloud in #2633
[FIX] qwen3_5_moe / llama4 / qwen2_moe / qwen3_next awq layer grouping by @ZX-ModelCloud in #2634
Remove GPTQModel.upload_to_hub() api by @ZX-ModelCloud in #2635
remove export to mlx option by @ZX-ModelCloud in #2636
[MODEL] supports minicpmv by @ZX-ModelCloud in #2637
Paro: layer optimizer by @Qubitium in #2628
Paro inference by @Qubitium in #2638
PrismAI/Bonsai Model Support (inference only) by @Qubitium in #2640
Update README.md by @Qubitium in #2641
Update transformers and defuser package versions by @Qubitium in #2642
[CI] install gguf for test_local_model_paths by @CSY-ModelCloud in #2645
fix imported path not found by @CSY-ModelCloud in #2646
[MODEL] support glm4_moe_lite by @ZX-ModelCloud in #2644
[FEATURE] Add FOEM: First-Order Error Matters; Accurate Compensation for Quantized LLM by @Xingyu-Zheng in #2639
Revise README with latest news and article references by @Qubitium in #2647
FIX paroquant bf16 rotation support for fused cuda kernel by @Qubitium in #2648
paroquant rotation autotune by @Qubitium in #2649
[FIX] In awq_processor, dynamic did not correctly skip layers. by @ZX-ModelCloud in #2650
ruff fix by @Qubitium in #2651
Ruff fix by @Qubitium in #2652
update readme by @Qubitium in #2653
fix: ensure contagious tensors by @Qubitium in #2655
fix failed test by @ZX-ModelCloud in #2654
[CI] move complex sh logics into py by @CSY-ModelCloud in #2656
ci: direct post-quantized eval by @Qubitium in #2657
fix failed test by @ZX-ModelCloud in #2658
fix problems in the FOEM data processing pipeline by @Xingyu-Zheng in #2659
Fix typo in latest news section of README by @Qubitium in #2660
fix sdist version by @Qubitium in #2661
Update README.md by @Qubitium in #2662
add gemma4 by @Qubitium in #2663
fix cpu fallback device restore by @Qubitium in #2664
Update README.md by @Qubitium in #2665

New Contributors

@Xingyu-Zheng made their first contribution in #2639

Full Changelog: v5.8.0...v6.0.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-QModel v6.0.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Notable Changes:

Quantization and inference

Model and backend support

Evaluation, calibration, and usability

Dependency and compatibility updates

Breaking and removed

What's Changed

New Contributors

Contributors

Uh oh!