You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Add FP8 quantization section to Qwen3 notebook
Add a new section demonstrating how to load and use FP8 quantized
Qwen3 models with preserve_source_types: true option. Updated
introduction and summary to reflect the new capability.
In this notebook we explore the [Qwen3](https://qwenlm.github.io/blog/qwen3/) model family from Alibaba Cloud. Qwen3 is a series of large language models that includes:
17
17
18
18
***Text Generation** - Instruction-tuned models for conversational AI
19
+
***FP8 Quantization** - Memory-efficient 8-bit floating point models
19
20
***Embeddings** - Dense vector representations for semantic search
20
21
***Rerankers** - Models to rerank search results for better relevance
Qwen3 models are also available in FP8 (8-bit floating point) quantized format, which significantly reduces memory usage while maintaining good quality. FP8 models use approximately half the memory of BF16 models.
The key option here is `preserve_source_types: true`, which keeps the FP8 weights in their native format instead of converting them to the model's default type. The model will automatically dequantize the weights during inference.
0 commit comments