Skip to content

Commit cb36413

Browse files
committed
docs: Add FP8 quantization section to Qwen3 notebook
Add a new section demonstrating how to load and use FP8 quantized Qwen3 models with preserve_source_types: true option. Updated introduction and summary to reflect the new capability.
1 parent 61bccb0 commit cb36413

1 file changed

Lines changed: 67 additions & 4 deletions

File tree

notebooks/qwen3.livemd

Lines changed: 67 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Nx.global_default_backend(EXLA.Backend)
1616
In this notebook we explore the [Qwen3](https://qwenlm.github.io/blog/qwen3/) model family from Alibaba Cloud. Qwen3 is a series of large language models that includes:
1717

1818
* **Text Generation** - Instruction-tuned models for conversational AI
19+
* **FP8 Quantization** - Memory-efficient 8-bit floating point models
1920
* **Embeddings** - Dense vector representations for semantic search
2021
* **Rerankers** - Models to rerank search results for better relevance
2122

@@ -79,6 +80,67 @@ Nx.Serving.batched_run(Qwen3, prompt) |> Enum.each(&IO.write/1)
7980

8081
<!-- livebook:{"branch_parent_index":0} -->
8182

83+
## Text Generation with FP8 Quantization
84+
85+
Qwen3 models are also available in FP8 (8-bit floating point) quantized format, which significantly reduces memory usage while maintaining good quality. FP8 models use approximately half the memory of BF16 models.
86+
87+
```elixir
88+
repo = {:hf, "Qwen/Qwen3-4B-Instruct-2507-FP8"}
89+
90+
{:ok, model_info} = Bumblebee.load_model(repo,
91+
backend: EXLA.Backend,
92+
preserve_source_types: true
93+
)
94+
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
95+
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
96+
97+
:ok
98+
```
99+
100+
The key option here is `preserve_source_types: true`, which keeps the FP8 weights in their native format instead of converting them to the model's default type. The model will automatically dequantize the weights during inference.
101+
102+
Configure generation and create a serving:
103+
104+
```elixir
105+
generation_config =
106+
Bumblebee.configure(generation_config,
107+
max_new_tokens: 256,
108+
temperature: 0.7,
109+
strategy: %{type: :multinomial_sampling, top_p: 0.8, top_k: 20}
110+
)
111+
112+
serving =
113+
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
114+
compile: [batch_size: 1, sequence_length: 1024],
115+
stream: true,
116+
defn_options: [compiler: EXLA]
117+
)
118+
119+
Kino.start_child({Nx.Serving, name: Qwen3FP8, serving: serving})
120+
```
121+
122+
Test the FP8 model with the same chat template:
123+
124+
```elixir
125+
user_input_fp8 = Kino.Input.textarea("User prompt (FP8)", default: "What are the benefits of quantized models?")
126+
```
127+
128+
```elixir
129+
user = Kino.Input.read(user_input_fp8)
130+
131+
prompt = """
132+
<|im_start|>system
133+
You are a helpful assistant.<|im_end|>
134+
<|im_start|>user
135+
#{user}<|im_end|>
136+
<|im_start|>assistant
137+
"""
138+
139+
Nx.Serving.batched_run(Qwen3FP8, prompt) |> Enum.each(&IO.write/1)
140+
```
141+
142+
<!-- livebook:{"branch_parent_index":0} -->
143+
82144
## Embeddings
83145

84146
Qwen3 embedding models convert text into dense vector representations, useful for semantic search and similarity tasks.
@@ -214,10 +276,11 @@ The reranker correctly identifies that the document directly answering "What is
214276

215277
## Summary
216278

217-
This notebook demonstrated three key capabilities of the Qwen3 model family:
279+
This notebook demonstrated four key capabilities of the Qwen3 model family:
218280

219281
1. **Text Generation** - Conversational AI using instruction-tuned models
220-
2. **Embeddings** - Creating semantic vector representations for similarity search
221-
3. **Reranking** - Scoring and ranking documents by relevance to a query
282+
2. **FP8 Quantization** - Memory-efficient inference using 8-bit floating point weights
283+
3. **Embeddings** - Creating semantic vector representations for similarity search
284+
4. **Reranking** - Scoring and ranking documents by relevance to a query
222285

223-
All three models work seamlessly with Bumblebee and can be used for various NLP applications.
286+
All models work seamlessly with Bumblebee and can be used for various NLP applications.

0 commit comments

Comments
 (0)