Skip to content

GPT-QModel v6.0.3

Choose a tag to compare

@Qubitium Qubitium released this 02 Apr 23:56
· 157 commits to main since this release
6a65d69

Notable Changes:

Quantization and inference

  • Major ParoQuant improvements across speed, inference, and accuracy.
  • Added Paro inference support and a new layer optimizer.
  • Auto-enables AMP for the fast Paro implementation to better match reference behavior.
  • Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
  • Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
  • Fixed a layer output replay/re-capture regression.
  • Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
  • Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.

Model and backend support

  • Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
  • Added PrismML/Bonsai model support for inference.
  • Fixed Qwen3_5QModel definition issues.
  • Fixed Qwen 3.5 rotary embedding behavior.
  • Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
  • Fixed awq_processor.dynamic so skipped layers are handled correctly.
  • Improved dtype compatibility.
  • Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.

Evaluation, calibration, and usability

  • Integrated Evalution into the workflow.
  • Added evalution.VLLM and evalution.SGLang backends.
  • Fixed SGLang evaluation engine initialization.
  • Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
  • Improved calibration data device handling.
  • Updated tokenizer handling, and collation now respects tokenizer padding_size.
  • Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
  • Cleaned up warning behavior and added an option to suppress warnings.
  • Removed forced random seed overrides.

Dependency and compatibility updates

  • Updated pypcre to 0.2.14.
  • Pinned logbar to >=0.4.1.
  • Updated transformers and defuser package versions.
  • Fixed SAVE_PATH handling and import path resolution issues.

Breaking and removed

  • Removed GPTQModel.upload_to_hub().
  • Removed MLX export support.

What's Changed

New Contributors

Full Changelog: v5.8.0...v6.0.3