GPT-QModel v6.0.3
Notable Changes:
Quantization and inference
- Major ParoQuant improvements across speed, inference, and accuracy.
- Added Paro inference support and a new layer optimizer.
- Auto-enables AMP for the fast Paro implementation to better match reference behavior.
- Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
- Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
- Fixed a layer output replay/re-capture regression.
- Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
- Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.
Model and backend support
- Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
- Added PrismML/Bonsai model support for inference.
- Fixed Qwen3_5QModel definition issues.
- Fixed Qwen 3.5 rotary embedding behavior.
- Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
- Fixed awq_processor.dynamic so skipped layers are handled correctly.
- Improved dtype compatibility.
- Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.
Evaluation, calibration, and usability
- Integrated Evalution into the workflow.
- Added evalution.VLLM and evalution.SGLang backends.
- Fixed SGLang evaluation engine initialization.
- Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
- Improved calibration data device handling.
- Updated tokenizer handling, and collation now respects tokenizer padding_size.
- Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
- Cleaned up warning behavior and added an option to suppress warnings.
- Removed forced random seed overrides.
Dependency and compatibility updates
- Updated pypcre to 0.2.14.
- Pinned logbar to >=0.4.1.
- Updated transformers and defuser package versions.
- Fixed SAVE_PATH handling and import path resolution issues.
Breaking and removed
- Removed GPTQModel.upload_to_hub().
- Removed MLX export support.
What's Changed
- [CI] fix pkgs' order & fix flashinfer version was overridden by @CSY-ModelCloud in #2575
- allow to disable warning by @CSY-ModelCloud in #2576
- lazy load _DEVICE_THREAD_POOL, to speed up import by @CSY-ModelCloud in #2577
- remove disable env check by @CSY-ModelCloud in #2578
- [CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2579
- Update pypcre version to 0.2.14 by @Qubitium in #2581
- Nothing to see here... by @Qubitium in #2456
- dtype compat by @Qubitium in #2582
- fix test_moe_config by @ZX-ModelCloud in #2583
- fix new format test by @ZX-ModelCloud in #2586
- [CI] add test config by @CSY-ModelCloud in #2587
- fix Qwen3_5QModel definition by @ZX-ModelCloud in #2588
- speed up paroquant quant speed and resolve accuracy issues by @Qubitium in #2590
- append last commit to version by @CSY-ModelCloud in #2591
- speedup paroquant test by @ZX-ModelCloud in #2592
- [CI] generate release matrix from torch registry by @CSY-ModelCloud in #2593
- Evalution integration by @Qubitium in #2585
- move eval.sh to tests by @Qubitium in #2594
- remove warning by @Qubitium in #2595
- [CI] use new docker image by @CSY-ModelCloud in #2596
- [CI] install required pkg by @CSY-ModelCloud in #2597
- Automatically Determine MODEL_COMPAT_FAST_LAYER_COUNT by @ZX-ModelCloud in #2598
- [CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2599
- Fix: Paroquant impl accuracy by @Qubitium in #2601
- remove forced random seed override in cls proper by @Qubitium in #2603
- Paro test by @Qubitium in #2604
- [FIX] incorrect SAVE_PATH by @ZX-ModelCloud in #2605
- pin logbar to >= 0.4.1 by @Qubitium in #2606
- Update the evalution scores by @ZX-ModelCloud in #2600
- Paro: auto enable amp for fast impl to sync with reference by @Qubitium in #2607
- paro: fix seeding and cleanup by @Qubitium in #2609
- gate hf kernel to non-nogil builds of python until upsteram fix wheels by @Qubitium in #2610
- [CI] use Ubuntu 24.04 docker image by @CSY-ModelCloud in #2612
- Fix layer output re-capture (replay) regression by @Qubitium in #2611
- remove legacy ppl codes by @Qubitium in #2613
- replace marlin_fp16 backend with env flag control for fp32 reduction … by @Qubitium in #2614
- [CI] default py 3.14t & install latest Evalution by @CSY-ModelCloud in #2616
- [CI] fix Evalution is private by @CSY-ModelCloud in #2617
- updat tokenicer by @Qubitium in #2618
- make collate respect tokenier padding_size by @Qubitium in #2620
- paro: clamp learned channel scales to avoid collapse by @Qubitium in #2622
- Calibration data device by @avtc in #2608
- [FIX] qwen3_5 rotary_embedding by @ZX-ModelCloud in #2624
- Temporarily disable gptqmodel spit_by feature by @ZX-ModelCloud in #2625
- use evalution.VLLM by @CSY-ModelCloud in #2615
- use evalution.SGLang by @ZX-ModelCloud in #2626
- paro: enter the dragon by @Qubitium in #2623
- [CI] use torch 2.11 by @CSY-ModelCloud in #2627
- [FIX] sglang evaluation engine initialization error. by @ZX-ModelCloud in #2629
- [MODEL] Add minicpmo support by @ZX-ModelCloud in #2630
- [CI] update CI path by @CSY-ModelCloud in #2633
- [FIX] qwen3_5_moe / llama4 / qwen2_moe / qwen3_next awq layer grouping by @ZX-ModelCloud in #2634
- Remove GPTQModel.upload_to_hub() api by @ZX-ModelCloud in #2635
- remove export to mlx option by @ZX-ModelCloud in #2636
- [MODEL] supports minicpmv by @ZX-ModelCloud in #2637
- Paro: layer optimizer by @Qubitium in #2628
- Paro inference by @Qubitium in #2638
- PrismAI/Bonsai Model Support (inference only) by @Qubitium in #2640
- Update README.md by @Qubitium in #2641
- Update transformers and defuser package versions by @Qubitium in #2642
- [CI] install gguf for test_local_model_paths by @CSY-ModelCloud in #2645
- fix imported path not found by @CSY-ModelCloud in #2646
- [MODEL] support glm4_moe_lite by @ZX-ModelCloud in #2644
- [FEATURE] Add
FOEM: First-Order Error Matters; Accurate Compensation for Quantized LLM by @Xingyu-Zheng in #2639 - Revise README with latest news and article references by @Qubitium in #2647
- FIX paroquant bf16 rotation support for fused cuda kernel by @Qubitium in #2648
- paroquant rotation autotune by @Qubitium in #2649
- [FIX] In
awq_processor,dynamicdid not correctly skip layers. by @ZX-ModelCloud in #2650 - ruff fix by @Qubitium in #2651
- Ruff fix by @Qubitium in #2652
- update readme by @Qubitium in #2653
- fix: ensure contagious tensors by @Qubitium in #2655
- fix failed test by @ZX-ModelCloud in #2654
- [CI] move complex sh logics into py by @CSY-ModelCloud in #2656
- ci: direct post-quantized eval by @Qubitium in #2657
- fix failed test by @ZX-ModelCloud in #2658
- fix problems in the FOEM data processing pipeline by @Xingyu-Zheng in #2659
- Fix typo in latest news section of README by @Qubitium in #2660
- fix sdist version by @Qubitium in #2661
- Update README.md by @Qubitium in #2662
- add gemma4 by @Qubitium in #2663
- fix cpu fallback device restore by @Qubitium in #2664
- Update README.md by @Qubitium in #2665
New Contributors
- @Xingyu-Zheng made their first contribution in #2639
Full Changelog: v5.8.0...v6.0.3