You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -40,11 +40,11 @@ This tutorial shows how to build and serve speculative decoding models in Triton
40
40
> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md).
41
41
42
42
According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks.
43
-
In this tutorial, we'll focus on [EAGLE3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
43
+
In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
44
44
45
-
## EAGLE3
45
+
## EAGLE-3
46
46
47
-
EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE3 further improves acceptance rates through training-time test enhancements.
47
+
EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE-3 further improves acceptance rates through training-time test enhancements.
48
48
49
49
### Download the Target and Draft Models (Optional)
More EAGLE3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA.
60
+
More EAGLE-3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA.
61
61
62
62
### Launch Triton TensorRT-LLM Container
63
63
@@ -78,7 +78,7 @@ Copy the LLM API model template inside the container:
Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE3:
81
+
Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE-3:
82
82
83
83
```yaml
84
84
model: meta-llama/Llama-3.1-8B-Instruct
@@ -146,7 +146,7 @@ This adds fields like `acceptance_rate`, `total_accepted_draft_tokens`, and `tot
146
146
### Evaluating Performance with Gen-AI Perf
147
147
148
148
Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.
149
-
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE3 over the base model.
149
+
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE-3 over the base model.
150
150
151
151
*NOTE: below experiment is done on a single node with one GPU - RTX 5880 (48GB GPU memory). The number below is only for reference. The actual number may vary due to the different hardware and environment.*
152
152
@@ -187,7 +187,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
187
187
188
188
4. Run Gen-AI Perf on Base Model
189
189
190
-
To compare performance between EAGLE3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
190
+
To compare performance between EAGLE-3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
191
191
192
192
```yaml
193
193
model: meta-llama/Llama-3.1-8B-Instruct
@@ -205,33 +205,33 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
205
205
206
206
5. Compare Performance
207
207
208
-
From sample runs, EAGLE3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
208
+
From sample runs, EAGLE-3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
209
209
210
210
As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment.
211
211
212
212
## MEDUSA
213
213
214
214
> **Important:** MEDUSA is **not supported** in the modern LLM API / PyTorch backend. It only works with the legacy TRT engine backend.
215
215
>
216
-
> For new deployments, we recommend using [EAGLE3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
216
+
> For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
217
217
>
218
218
> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions.
219
219
220
220
## Draft Model-Based Speculative Decoding
221
221
222
-
Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE3:
222
+
Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE-3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE-3:
223
223
224
-
- Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE3's feature-level extrapolation using a lightweight draft head embedded into the target model.
224
+
- Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE-3's feature-level extrapolation using a lightweight draft head embedded into the target model.
225
225
226
-
- Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE3 which uses tree-based attention mechanisms.
226
+
- Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE-3 which uses tree-based attention mechanisms.
227
227
228
-
- Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE3.
228
+
- Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE-3.
229
229
230
-
- Efficiency: While effective, it is generally slower than EAGLE3.
230
+
- Efficiency: While effective, it is generally slower than EAGLE-3.
231
231
232
232
- Implementation: it requires a separate draft model that shares the same tokenizer as the target model. The draft model can be any HuggingFace-compatible LLM.
233
233
234
-
To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE3](#eagle-3) section above, but configure `model.yaml` as follows:
234
+
To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE-3](#eagle-3) section above, but configure `model.yaml` as follows:
0 commit comments