docs: address whoisj and yinggeh review comments

faradawn · faradawn · commit 47a5e035cf99 · 2026-04-20T16:51:15.000Z
- Standardize EAGLE-3 naming (hyphen) throughout — was inconsistent
- Indent numbered list content 3 spaces so items render as a single list
- Fix launch_triton_server.py URL: was tensorrtllm_backend/scripts (404),
  now NVIDIA/TensorRT-LLM/triton_backend/scripts (correct repo)
- Fix engine backend archive links: point to
  tensorrtllm_backend#tensorrt-engine-backend instead of deleted archive file

Signed-off-by: Faradawn Yang &lt;73060648+faradawn@users.noreply.github.com&gt;
diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md
@@ -37,7 +37,7 @@
 
 This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM LLM API / PyTorch backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llmapi.md) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends.
 
-> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md).
+> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend).
 
 According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks.
 In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
@@ -101,7 +101,7 @@ triton_config:
 
 ### Serving with Triton
 
-Launch Triton Server with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script:
+Launch Triton Server with the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py) script:
 
 ```bash
 python3 /app/scripts/launch_triton_server.py --model_repo=/opt/tritonserver/llmapi_repo/
@@ -215,7 +215,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
 >
 > For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy.
 >
-> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions.
+> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend) for the legacy instructions.
 
 ## Draft Model-Based Speculative Decoding