Skip to content

Commit 47a5e03

Browse files
committed
docs: address whoisj and yinggeh review comments
- Standardize EAGLE-3 naming (hyphen) throughout — was inconsistent - Indent numbered list content 3 spaces so items render as a single list - Fix launch_triton_server.py URL: was tensorrtllm_backend/scripts (404), now NVIDIA/TensorRT-LLM/triton_backend/scripts (correct repo) - Fix engine backend archive links: point to tensorrtllm_backend#tensorrt-engine-backend instead of deleted archive file Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
1 parent 9f1886b commit 47a5e03

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

  • Feature_Guide/Speculative_Decoding/TRT-LLM

Feature_Guide/Speculative_Decoding/TRT-LLM/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737

3838
This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM LLM API / PyTorch backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llmapi.md) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends.
3939

40-
> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md).
40+
> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend).
4141
4242
According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks.
4343
In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
@@ -101,7 +101,7 @@ triton_config:
101101

102102
### Serving with Triton
103103

104-
Launch Triton Server with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script:
104+
Launch Triton Server with the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py) script:
105105

106106
```bash
107107
python3 /app/scripts/launch_triton_server.py --model_repo=/opt/tritonserver/llmapi_repo/
@@ -215,7 +215,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
215215
>
216216
> For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
217217
>
218-
> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions.
218+
> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend) for the legacy instructions.
219219

220220
## Draft Model-Based Speculative Decoding
221221

0 commit comments

Comments
 (0)