|
37 | 37 |
|
38 | 38 | This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM LLM API / PyTorch backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llmapi.md) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. |
39 | 39 |
|
40 | | -> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md). |
| 40 | +> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend). |
41 | 41 |
|
42 | 42 | According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. |
43 | 43 | In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. |
@@ -101,7 +101,7 @@ triton_config: |
101 | 101 |
|
102 | 102 | ### Serving with Triton |
103 | 103 |
|
104 | | -Launch Triton Server with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script: |
| 104 | +Launch Triton Server with the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py) script: |
105 | 105 |
|
106 | 106 | ```bash |
107 | 107 | python3 /app/scripts/launch_triton_server.py --model_repo=/opt/tritonserver/llmapi_repo/ |
@@ -215,7 +215,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ |
215 | 215 | > |
216 | 216 | > For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy. |
217 | 217 | > |
218 | | -> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions. |
| 218 | +> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend) for the legacy instructions. |
219 | 219 |
|
220 | 220 | ## Draft Model-Based Speculative Decoding |
221 | 221 |
|
|
0 commit comments