Skip to content

Commit 9f1886b

Browse files
committed
fix: standardize EAGLE 3 spelling to EAGLE-3 throughout the guide
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
1 parent 6a55bdd commit 9f1886b

1 file changed

Lines changed: 16 additions & 16 deletions

File tree

  • Feature_Guide/Speculative_Decoding/TRT-LLM

Feature_Guide/Speculative_Decoding/TRT-LLM/README.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
# Speculative Decoding with TensorRT-LLM
3030

3131
- [About Speculative Decoding](#about-speculative-decoding)
32-
- [EAGLE 3](#eagle-3)
32+
- [EAGLE-3](#eagle-3)
3333
- [MEDUSA](#medusa)
3434
- [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding)
3535

@@ -40,11 +40,11 @@ This tutorial shows how to build and serve speculative decoding models in Triton
4040
> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md).
4141
4242
According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks.
43-
In this tutorial, we'll focus on [EAGLE 3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
43+
In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
4444

45-
## EAGLE 3
45+
## EAGLE-3
4646

47-
EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE 3 further improves acceptance rates through training-time test enhancements.
47+
EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE-3 further improves acceptance rates through training-time test enhancements.
4848

4949
### Download the Target and Draft Models (Optional)
5050

@@ -57,7 +57,7 @@ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
5757
huggingface-cli download yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
5858
```
5959

60-
More EAGLE 3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA.
60+
More EAGLE-3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA.
6161

6262
### Launch Triton TensorRT-LLM Container
6363

@@ -78,7 +78,7 @@ Copy the LLM API model template inside the container:
7878
cp -R /app/all_models/llmapi/ /opt/tritonserver/llmapi_repo/
7979
```
8080

81-
Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE 3:
81+
Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE-3:
8282

8383
```yaml
8484
model: meta-llama/Llama-3.1-8B-Instruct
@@ -146,7 +146,7 @@ This adds fields like `acceptance_rate`, `total_accepted_draft_tokens`, and `tot
146146
### Evaluating Performance with Gen-AI Perf
147147

148148
Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.
149-
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE 3 over the base model.
149+
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE-3 over the base model.
150150

151151
*NOTE: below experiment is done on a single node with one GPU - RTX 5880 (48GB GPU memory). The number below is only for reference. The actual number may vary due to the different hardware and environment.*
152152

@@ -187,7 +187,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
187187

188188
4. Run Gen-AI Perf on Base Model
189189

190-
To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
190+
To compare performance between EAGLE-3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
191191

192192
```yaml
193193
model: meta-llama/Llama-3.1-8B-Instruct
@@ -205,33 +205,33 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
205205

206206
5. Compare Performance
207207

208-
From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
208+
From sample runs, EAGLE-3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
209209

210210
As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment.
211211

212212
## MEDUSA
213213

214214
> **Important:** MEDUSA is **not supported** in the modern LLM API / PyTorch backend. It only works with the legacy TRT engine backend.
215215
>
216-
> For new deployments, we recommend using [EAGLE 3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
216+
> For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
217217
>
218218
> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions.
219219

220220
## Draft Model-Based Speculative Decoding
221221

222-
Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE 3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE 3:
222+
Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE-3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE-3:
223223

224-
- Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE 3's feature-level extrapolation using a lightweight draft head embedded into the target model.
224+
- Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE-3's feature-level extrapolation using a lightweight draft head embedded into the target model.
225225

226-
- Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE 3 which uses tree-based attention mechanisms.
226+
- Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE-3 which uses tree-based attention mechanisms.
227227

228-
- Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE 3.
228+
- Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE-3.
229229

230-
- Efficiency: While effective, it is generally slower than EAGLE 3.
230+
- Efficiency: While effective, it is generally slower than EAGLE-3.
231231

232232
- Implementation: it requires a separate draft model that shares the same tokenizer as the target model. The draft model can be any HuggingFace-compatible LLM.
233233

234-
To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE 3](#eagle-3) section above, but configure `model.yaml` as follows:
234+
To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE-3](#eagle-3) section above, but configure `model.yaml` as follows:
235235

236236
```yaml
237237
model: meta-llama/Llama-3.1-8B-Instruct

0 commit comments

Comments
 (0)