From 31efda7b245fc721c5f27b701bc6c684c552ae19 Mon Sep 17 00:00:00 2001 From: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Date: Fri, 10 Apr 2026 16:50:28 +0000 Subject: [PATCH 1/4] docs: update speculative decoding guide to use LLM API / PyTorch backend Replace the legacy TRT engine backend approach (trtllm-build, inflight_batcher_llm, fill_template.py) with the modern LLM API / PyTorch backend workflow. Update EAGLE section to use EAGLE 3 with Llama-3.1-8B-Instruct, add deprecation notice for MEDUSA (unsupported on PyTorch backend), and update Draft Model section to use DraftTargetDecodingConfig via model.yaml. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> --- .../Speculative_Decoding/TRT-LLM/README.md | 382 +++++------------- 1 file changed, 95 insertions(+), 287 deletions(-) diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md index 768867cf..d2d4f2bd 100644 --- a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md +++ b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -29,144 +29,86 @@ # Speculative Decoding with TensorRT-LLM - [About Speculative Decoding](#about-speculative-decoding) -- [EAGLE](#eagle) +- [EAGLE 3](#eagle-3) - [MEDUSA](#medusa) - [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) ## About Speculative Decoding -This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. +This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM LLM API / PyTorch backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llmapi.md) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. -According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. -In this tutorial, we'll focus on [EAGLE](#eagle) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [MEDUSA](#medusa) and [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. +> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md). -## EAGLE +According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. +In this tutorial, we'll focus on [EAGLE 3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. -EAGLE ([paper](https://arxiv.org/pdf/2401.15077) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is a speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features extracted from the LLM's second-top layer. It employs a lightweight Auto-regression Head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality and distribution consistency. EAGLE-2, an improved version, further enhances performance by using confidence scores from the draft model to dynamically adjust the draft tree structure, resulting in even faster inference speeds. +## EAGLE 3 -*NOTE: EAGLE-2 is not supported via Triton Inference Server using TensorRT-LLM backend yet.* +EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE 3 further improves acceptance rates through training-time test enhancements. -### Acquiring EAGLE Model and its Base Model +### Download the Target and Draft Models (Optional) -In this example, we will be using the [EAGLE-Vicuna-7B-v1.3](https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3) model. -More types of EAGLE models can be found [here](https://huggingface.co/yuhuili). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work. +In this example, we use [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) as the target model and [yuhuili/EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B) as the draft model. Both models can be auto-downloaded from HuggingFace at server startup by mounting your HuggingFace cache into the container. Alternatively, you can pre-download them: -To download both models, run the following command: ```bash -# Install git-lfs if needed -apt-get update && apt-get install git-lfs -y --no-install-recommends -git lfs install -git clone https://huggingface.co/lmsys/vicuna-7b-v1.3 -git clone https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3 +# Authenticate first if needed (Llama-3.1 requires accepting the license on HuggingFace) +huggingface-cli login +huggingface-cli download meta-llama/Llama-3.1-8B-Instruct +huggingface-cli download yuhuili/EAGLE3-LLaMA3.1-Instruct-8B ``` -### Launch Triton TensorRT-LLM container +More EAGLE 3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA. + +### Launch Triton TensorRT-LLM Container -Launch Triton docker container with TensorRT-LLM backend. -Note that we're mounting the downloaded EAGLE and base models to `/hf-models` in the docker container. -Make an `engines` folder outside docker to reuse engines for future runs. -Please, make sure to replace with the version of Triton that you want -to use (must be >= 25.01). The latest Triton Server container is recommended and can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags). +Launch the Triton container with TensorRT-LLM backend. Mount your HuggingFace cache so models can be auto-downloaded. Replace `` with the version of Triton you want to use (must be >= 25.01). The latest Triton Server container is recommended and can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags). ```bash docker run --rm -it --net host --shm-size=2g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ - -v :/hf-models \ - -v :/engines \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/tritonserver:-trtllm-python-py3 ``` -### Create Engines for Each Model [skip this step if you already have an engine] +### Prepare the Model Repository -TensorRT-LLM requires each model to be compiled for the configuration -you need before running. To do so, before you run your model for the first time -on Triton Server you will need to create a TensorRT-LLM engine. - -Starting with [24.04 release](https://github.com/triton-inference-server/server/releases/tag/v2.45.0), -Triton Server TensrRT-LLM container comes with -pre-installed TensorRT-LLM package, which allows users to build engines inside -the Triton container. Simply follow the next steps in the container: +Copy the LLM API model template inside the container: ```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -EAGLE_MODEL=/hf-models/EAGLE-Vicuna-7B-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna/7b/ -ENGINE_DIR=/engines/eagle-vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/eagle/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --eagle_model_dir ${EAGLE_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 \ - --max_draft_len 63 \ - --num_eagle_layers 4 \ - --max_non_leaves_per_layer 10 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --gemm_plugin float16 \ - --use_paged_context_fmha enable \ - --speculative_decoding_mode eagle \ - --max_batch_size 4 +cp -R /app/all_models/llmapi/ /opt/tritonserver/llmapi_repo/ ``` -To verify that the engine is built correctly, run the following command: -```bash -python3 /app/examples/run.py --engine_dir ${ENGINE_DIR} \ - --tokenizer_dir ${BASE_MODEL} \ - --max_output_len=100 \ - --input_text "Once upon" -``` -Sample output: -``` -> Input [Text 0]: " Once upon" -> Output [Text 0 Beam 0]: "a time, there was a young girl who loved to read. She would spend hours in the library, devouring books of all genres. She had a special love for fairy tales, and would often dream of living in a magical world where she could meet princes and princesses, and have adventures with talking animals. -> One day, while she was reading a book, she came across a passage that spoke to her heart. It said, "You are the author of" -> [TensorRT-LLM][INFO] Refreshed the MPI local session -``` +Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE 3: -### Serving with Triton +```yaml +model: meta-llama/Llama-3.1-8B-Instruct +backend: pytorch -The last step is to create a Triton readable model and serve it. You can find a template of a model that uses inflight batching in -[tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/NVIDIA/TensorRT-LLM/tree/main/triton_backend/all_models/inflight_batcher_llm). To run EAGLE model, you will need to: +tensor_parallel_size: 1 +pipeline_parallel_size: 1 -1. Copy over the inflight batcher models repository -```bash -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/. -``` +speculative_config: + decoding_type: Eagle3 + max_draft_len: 3 + speculative_model: yuhuili/EAGLE3-LLaMA3.1-Instruct-8B -2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. - -```bash -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/eagle-vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} +triton_config: + max_batch_size: 0 + decoupled: False ``` -*NOTE: you can specify `eagle_choices` by manually changing tensorrt_llm/config.pbtxt. If you do not specify any choices, the default, [mc_sim_7b_63](https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py#L1) choices are used. For more information regarding choices tree, refer to [Medusa Tree](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html#medusa-tree).* +*NOTE: You can also specify a local filesystem path for `model` and `speculative_model` if you have pre-downloaded the models.* -3. Launch Tritonserver +### Serving with Triton -Launch Tritonserver with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. Here, we launch a single instance of `tritonserver` with MPI by setting `--world_size=1`. +Launch Triton Server with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script: ```bash -python3 /app/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm +python3 /app/scripts/launch_triton_server.py --model_repo=/opt/tritonserver/llmapi_repo/ ``` -> You should expect the following response: +> You should expect the following response once the server is ready: > ``` -> ... > I0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001 > I0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000 > I0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002 @@ -176,53 +118,42 @@ To stop Triton Server inside the container, run: ```bash pkill tritonserver ``` -*NOTE: do not forget to run above command to stop Triton Server if launch Tritionserver failed due to various reasons. Otherwise, it could cause OOM or MPI issues.* +*NOTE: do not forget to run the above command to stop Triton Server if launching Triton Server failed due to various reasons. Otherwise, it could cause OOM or MPI issues.* ### Send Inference Requests -You can test the results of the run with: -1. The [inflight_batcher_llm_client.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py) script. Run below in another terminal: +You can test the results of the run with the [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html): ```bash -# Using the SDK container as an example. is the version of Triton Server you are using. -docker run --rm -it --net host --shm-size=2g \ - --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ - -v :/hf-models \ - nvcr.io/nvidia/tritonserver:-py3-sdk -# Install extra dependencies for the script -pip3 install transformers sentencepiece -python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 50 --tokenizer-dir /hf-models/vicuna-7b-v1.3 --text "What is ML?" +curl -X POST localhost:8000/v2/models/tensorrt_llm/generate \ + -d '{"text_input": "What is ML?", "sampling_param_max_tokens": 50}' ``` + > You should expect the following response: -> ``` -> ... -> Input: What is ML? -> Output beam 0: -> ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation. -> ... +> ```json +> {"model_name":"tensorrt_llm","model_version":"1","text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."} > ``` -2. The [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html). +Optionally, include speculative decoding performance metrics by adding `"sampling_param_return_perf_metrics": true`: ```bash -curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +curl -X POST localhost:8000/v2/models/tensorrt_llm/generate \ + -d '{"text_input": "What is ML?", "sampling_param_max_tokens": 50, "sampling_param_return_perf_metrics": true}' | jq ``` -> You should expect the following response: -> ``` -> {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."} -> ``` + +This adds fields like `acceptance_rate`, `total_accepted_draft_tokens`, and `total_draft_tokens` to the response, which are useful for evaluating speculative decoding effectiveness. ### Evaluating Performance with Gen-AI Perf Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. -You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE model over the base model. +You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE 3 over the base model. *NOTE: below experiment is done on a single node with one GPU - RTX 5880 (48GB GPU memory). The number below is only for reference. The actual number may vary due to the different hardware and environment.* 1. Prepare Dataset -We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the -format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset. +We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset. + ```bash wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl @@ -243,203 +174,80 @@ Run the following command in the SDK container: ```bash genai-perf \ profile \ - -m ensemble \ + -m tensorrt_llm \ --service-kind triton \ --backend tensorrtllm \ --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ - --tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \ + --tokenizer meta-llama/Llama-3.1-8B-Instruct \ --profile-export-file my_profile_export.json \ --url localhost:8001 \ --concurrency 1 ``` *NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.* -A sample output that looks like this: -``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ -│ Request Latency (ms) │ 1,355.35 │ 387.84 │ 2,002.81 │ 2,000.44 │ 1,868.83 │ 1,756.85 │ -│ Output Sequence Length (tokens) │ 348.27 │ 153.00 │ 534.00 │ 517.25 │ 444.50 │ 426.75 │ -│ Input Sequence Length (tokens) │ 156.54 │ 63.00 │ 278.00 │ 265.75 │ 203.00 │ 185.75 │ -│ Output Token Throughput (per sec) │ 256.94 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Throughput (per sec) │ 0.74 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Count (count) │ 26.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ -└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴──────────┴──────────┘ -``` - 4. Run Gen-AI Perf on Base Model -To compare performance between EAGLE and base model (i.e. vanilla LLM w/o speculative decoding), we need to run Gen-AI Perf Tool on the base model as well. To do so, we need to repeat the steps above for the base model with minor changes. +To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block: -Kill the existing Triton Server and run the following command in the Triton Server container: -```bash -pkill tritonserver -``` +```yaml +model: meta-llama/Llama-3.1-8B-Instruct +backend: pytorch -Build the TRT-LLM engine for the base model: -```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna-base/7b/ -ENGINE_DIR=/engines/vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/llama/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --remove_input_padding enable \ - --gpt_attention_plugin float16 \ - --context_fmha enable \ - --gemm_plugin float16 \ - --paged_kv_cache enable \ - --max_batch_size 4 -``` +tensor_parallel_size: 1 +pipeline_parallel_size: 1 -Create a Triton readable model for the base model: -```bash -mkdir -p /opt/tritonserver/vicuna_base -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/vicuna_base/. - -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/vicuna_base/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} +triton_config: + max_batch_size: 0 + decoupled: False ``` -Launch Triton Server with the base model: -```bash -python3 /app/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/vicuna_base/inflight_batcher_llm -``` - -Run Gen-AI Perf Tool on Base Model: -```bash -genai-perf \ - profile \ - -m ensemble \ - --service-kind triton \ - --backend tensorrtllm \ - --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ - --tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 \ - --concurrency 1 -``` - -Sample performance output for base model: -``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ -│ Request Latency (ms) │ 2,663.13 │ 1,017.15 │ 4,197.72 │ 4,186.59 │ 4,096.25 │ 4,090.93 │ -│ Output Sequence Length (tokens) │ 310.75 │ 153.00 │ 441.00 │ 440.12 │ 431.70 │ 415.50 │ -│ Input Sequence Length (tokens) │ 145.67 │ 63.00 │ 195.00 │ 194.12 │ 186.90 │ 185.25 │ -│ Output Token Throughput (per sec) │ 116.68 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Throughput (per sec) │ 0.38 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Count (count) │ 12.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ -└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ -``` +Then re-run the Gen-AI Perf command above. 5. Compare Performance -From the sample runs above, we can see that the EAGLE model has a lower latency and higher throughput than the base model. Specifically, the EAGLE model can generate 256.94 tokens per second, while the base model can only generate 116.68 tokens per second with a speed up of 2.2x. +From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset. As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment. -## Medusa +## MEDUSA -MEDUSA ([paper](https://arxiv.org/pdf/2401.10774) | [github](https://github.com/FasterDecoding/Medusa) | [blog](https://sites.google.com/view/medusa-llm)) is a speculative decoding framework that, like EAGLE, aims to accelerate LLM inference. However, there are several key differences between the two approaches: +> **Important:** MEDUSA is **not supported** in the modern LLM API / PyTorch backend. It only works with the legacy TRT engine backend. +> +> For new deployments, we recommend using [EAGLE 3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy. +> +> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions. - - Architecture: MEDUSA adds extra decoding heads to LLMs to predict multiple subsequent tokens in parallel, while EAGLE extrapolates second-top-layer contextual feature vectors of LLMs. - - - Generation structure: MEDUSA generates a fully connected tree across adjacent layers through the Cartesian product, often resulting in nonsensical combinations. In contrast, EAGLE creates a sparser, more selective tree structure that is more context-aware1. - - - Consistency: MEDUSA's non-greedy generation does not guarantee lossless performance, while EAGLE provably maintains consistency with vanilla decoding in the distribution of generated texts. - - - Accuracy: MEDUSA achieves an accuracy of about 0.6 in generating drafts, whereas EAGLE attains a higher accuracy of approximately 0.8 as claimed in the EAGLE paper. - - - Speed: EAGLE is reported to be 1.6x faster than MEDUSA for certained models as claimed in the EAGLE paper. - -To run MEDUSA with Triton Inference Server, it is very similar to the steps above for EAGLE with only a few simple configuration changes. We only list the changes below. The rest steps not listed below are the same as the steps for EAGLE above, e.g. launch docker, launch triton server, send requests, evalaution. - -### Download the MEDUSA model - -We will be using [medusa-vicuna-7b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3), same model family as what we used for EAGLE above: +## Draft Model-Based Speculative Decoding -```bash -git clone https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3 -``` +Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE 3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE 3: -### Build the TRT-LLM engine for MEDUSA: -```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -MEDUSA_MODEL=/hf-models/medusa-vicuna-7b-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna-medusa/7b/ -ENGINE_DIR=/engines/medusa-vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/medusa/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --medusa_model_dir ${MEDUSA_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 \ - --num_medusa_heads 4 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --gemm_plugin float16 \ - --speculative_decoding_mode medusa \ - --max_batch_size 4 -``` + - Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE 3's feature-level extrapolation using a lightweight draft head embedded into the target model. -### Create a Triton readable model for MEDUSA: -```bash -mkdir -p /opt/tritonserver/vicuna_medusa -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/vicuna_medusa/. - -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/medusa-vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/vicuna_medusa/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} -``` + - Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE 3 which uses tree-based attention mechanisms. -## Draft Model-Based Speculative Decoding + - Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE 3. -Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another (and earlier) approach to accelerate LLM inference, distinct from both EAGLE and MEDUSA. Here are the key differences: + - Efficiency: While effective, it is generally slower than EAGLE 3. - - Draft Generation: it uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE's feature-level extrapolation and MEDUSA's additional decoding heads. + - Implementation: it requires a separate draft model that shares the same tokenizer as the target model. The draft model can be any HuggingFace-compatible LLM. - - Verification Process: it employs a chain-like structure for draft generation and verification, unlike EAGLE and MEDUSA which use tree-based attention mechanisms. +To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE 3](#eagle-3) section above, but configure `model.yaml` as follows: - - Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE but different from MEDUSA. +```yaml +model: meta-llama/Llama-3.1-8B-Instruct +backend: pytorch - - Efficiency: While effective, it is generally slower than both EAGLE and MEDUSA. +tensor_parallel_size: 1 +pipeline_parallel_size: 1 - - Implementation: it requires a separate draft model, which can be challenging to implement effectively for smaller target models. EAGLE and MEDUSA, in contrast, modify the existing model architecture. +speculative_config: + decoding_type: Draft_Target + max_draft_len: 3 + speculative_model: /path/to/draft_model # Must share the same tokenizer as the target model - - Accuracy: its draft accuracy can vary depending on the draft model used, while EAGLE achieves a higher draft accuracy (about 0.8) compared to MEDUSA (about 0.6). +triton_config: + max_batch_size: 0 + decoupled: False +``` - Please follow the steps [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/speculative-decoding.md#draft-target-model) to run Draft Model-Based Speculative Decoding with Triton Inference Server. \ No newline at end of file +*NOTE: The draft and target models must be trained with the same tokenizer. If they are not compatible, the acceptance rate will be extremely low and performance will regress rather than improve.* From 6a55bdde04ac8e3d904affebd09c2663f15bd105 Mon Sep 17 00:00:00 2001 From: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Date: Wed, 15 Apr 2026 23:32:27 +0000 Subject: [PATCH 2/4] fix: indent numbered list content by 3 spaces so nested paragraphs and code blocks render correctly in GFM Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> --- .../Speculative_Decoding/TRT-LLM/README.md | 74 +++++++++---------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md index d2d4f2bd..915cb56a 100644 --- a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md +++ b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -152,62 +152,62 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ 1. Prepare Dataset -We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset. + We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset. -```bash -wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl + ```bash + wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl -# dataset-converter.py file can be found in the parent folder of this README. -python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl -``` + # dataset-converter.py file can be found in the parent folder of this README. + python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl + ``` 2. Install GenAI-Perf (Ubuntu 24.04, Python 3.10+) -```bash -pip install genai-perf -``` -*NOTE: you must already have CUDA 12 installed.* + ```bash + pip install genai-perf + ``` + *NOTE: you must already have CUDA 12 installed.* 3. Run Gen-AI Perf -Run the following command in the SDK container: -```bash -genai-perf \ - profile \ - -m tensorrt_llm \ - --service-kind triton \ - --backend tensorrtllm \ - --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ - --tokenizer meta-llama/Llama-3.1-8B-Instruct \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 \ - --concurrency 1 -``` -*NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.* + Run the following command in the SDK container: + ```bash + genai-perf \ + profile \ + -m tensorrt_llm \ + --service-kind triton \ + --backend tensorrtllm \ + --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ + --tokenizer meta-llama/Llama-3.1-8B-Instruct \ + --profile-export-file my_profile_export.json \ + --url localhost:8001 \ + --concurrency 1 + ``` + *NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.* 4. Run Gen-AI Perf on Base Model -To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block: + To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block: -```yaml -model: meta-llama/Llama-3.1-8B-Instruct -backend: pytorch + ```yaml + model: meta-llama/Llama-3.1-8B-Instruct + backend: pytorch -tensor_parallel_size: 1 -pipeline_parallel_size: 1 + tensor_parallel_size: 1 + pipeline_parallel_size: 1 -triton_config: - max_batch_size: 0 - decoupled: False -``` + triton_config: + max_batch_size: 0 + decoupled: False + ``` -Then re-run the Gen-AI Perf command above. + Then re-run the Gen-AI Perf command above. 5. Compare Performance -From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset. + From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset. -As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment. + As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment. ## MEDUSA From 9f1886b63a089506cfc39968820eb4bf08271813 Mon Sep 17 00:00:00 2001 From: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Date: Wed, 15 Apr 2026 23:39:46 +0000 Subject: [PATCH 3/4] fix: standardize EAGLE 3 spelling to EAGLE-3 throughout the guide Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> --- .../Speculative_Decoding/TRT-LLM/README.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md index 915cb56a..cea770b5 100644 --- a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md +++ b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -29,7 +29,7 @@ # Speculative Decoding with TensorRT-LLM - [About Speculative Decoding](#about-speculative-decoding) -- [EAGLE 3](#eagle-3) +- [EAGLE-3](#eagle-3) - [MEDUSA](#medusa) - [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) @@ -40,11 +40,11 @@ This tutorial shows how to build and serve speculative decoding models in Triton > **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md). According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. -In this tutorial, we'll focus on [EAGLE 3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. +In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. -## EAGLE 3 +## EAGLE-3 -EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE 3 further improves acceptance rates through training-time test enhancements. +EAGLE-3 ([paper](https://arxiv.org/pdf/2503.01840) | [github](https://github.com/SafeAILab/EAGLE) | [blog](https://sites.google.com/view/eagle-llm)) is the latest generation of the EAGLE speculative decoding technique that accelerates Large Language Model (LLM) inference by predicting future tokens based on contextual features. It employs a lightweight draft head to predict the next feature vector, which is then used to generate tokens through the LLM's frozen classification head, achieving significant speedups (2x-3x faster than vanilla decoding) while maintaining output quality. Compared to EAGLE (v1/v2), EAGLE-3 further improves acceptance rates through training-time test enhancements. ### Download the Target and Draft Models (Optional) @@ -57,7 +57,7 @@ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct huggingface-cli download yuhuili/EAGLE3-LLaMA3.1-Instruct-8B ``` -More EAGLE 3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA. +More EAGLE-3 compatible draft model checkpoints can be found in the [Speculative Decoding Modules](https://huggingface.co/collections/nvidia/speculative-decoding-modules) collection from NVIDIA. ### Launch Triton TensorRT-LLM Container @@ -78,7 +78,7 @@ Copy the LLM API model template inside the container: cp -R /app/all_models/llmapi/ /opt/tritonserver/llmapi_repo/ ``` -Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE 3: +Edit `/opt/tritonserver/llmapi_repo/tensorrt_llm/1/model.yaml` to configure EAGLE-3: ```yaml model: meta-llama/Llama-3.1-8B-Instruct @@ -146,7 +146,7 @@ This adds fields like `acceptance_rate`, `total_accepted_draft_tokens`, and `tot ### Evaluating Performance with Gen-AI Perf Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. -You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE 3 over the base model. +You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE-3 over the base model. *NOTE: below experiment is done on a single node with one GPU - RTX 5880 (48GB GPU memory). The number below is only for reference. The actual number may vary due to the different hardware and environment.* @@ -187,7 +187,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ 4. Run Gen-AI Perf on Base Model - To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block: + To compare performance between EAGLE-3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block: ```yaml model: meta-llama/Llama-3.1-8B-Instruct @@ -205,7 +205,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ 5. Compare Performance - From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset. + From sample runs, EAGLE-3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset. As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment. @@ -213,25 +213,25 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ > **Important:** MEDUSA is **not supported** in the modern LLM API / PyTorch backend. It only works with the legacy TRT engine backend. > -> For new deployments, we recommend using [EAGLE 3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy. +> For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy. > > If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions. ## Draft Model-Based Speculative Decoding -Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE 3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE 3: +Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another approach to accelerate LLM inference that uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This approach is distinct from EAGLE-3 and is supported in the modern LLM API / PyTorch backend. Here are the key differences compared to EAGLE-3: - - Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE 3's feature-level extrapolation using a lightweight draft head embedded into the target model. + - Draft Generation: it uses a separate, independent LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE-3's feature-level extrapolation using a lightweight draft head embedded into the target model. - - Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE 3 which uses tree-based attention mechanisms. + - Verification Process: it employs a chain-like (linear) structure for draft generation and verification, unlike EAGLE-3 which uses tree-based attention mechanisms. - - Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE 3. + - Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE-3. - - Efficiency: While effective, it is generally slower than EAGLE 3. + - Efficiency: While effective, it is generally slower than EAGLE-3. - Implementation: it requires a separate draft model that shares the same tokenizer as the target model. The draft model can be any HuggingFace-compatible LLM. -To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE 3](#eagle-3) section above, but configure `model.yaml` as follows: +To use Draft Model-Based Speculative Decoding with Triton via the LLM API, follow the same container setup and model repository preparation steps as in the [EAGLE-3](#eagle-3) section above, but configure `model.yaml` as follows: ```yaml model: meta-llama/Llama-3.1-8B-Instruct From 47a5e035cf993e0520ff7ea978025d975fa3947b Mon Sep 17 00:00:00 2001 From: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Date: Mon, 20 Apr 2026 16:51:08 +0000 Subject: [PATCH 4/4] docs: address whoisj and yinggeh review comments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Standardize EAGLE-3 naming (hyphen) throughout — was inconsistent - Indent numbered list content 3 spaces so items render as a single list - Fix launch_triton_server.py URL: was tensorrtllm_backend/scripts (404), now NVIDIA/TensorRT-LLM/triton_backend/scripts (correct repo) - Fix engine backend archive links: point to tensorrtllm_backend#tensorrt-engine-backend instead of deleted archive file Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> --- Feature_Guide/Speculative_Decoding/TRT-LLM/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md index cea770b5..92f90b69 100644 --- a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md +++ b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -37,7 +37,7 @@ This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM LLM API / PyTorch backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llmapi.md) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. -> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md). +> **Note:** This tutorial uses the modern **LLM API / PyTorch backend**, which works directly with HuggingFace model checkpoints and does not require building TensorRT engines. If you are looking for the legacy TRT engine-based approach, see the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend). According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. In this tutorial, we'll focus on [EAGLE-3](#eagle-3) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. @@ -101,7 +101,7 @@ triton_config: ### Serving with Triton -Launch Triton Server with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script: +Launch Triton Server with the [launch_triton_server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/triton_backend/scripts/launch_triton_server.py) script: ```bash python3 /app/scripts/launch_triton_server.py --model_repo=/opt/tritonserver/llmapi_repo/ @@ -215,7 +215,7 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/ > > For new deployments, we recommend using [EAGLE-3](#eagle-3) instead, which is fully supported on the LLM API / PyTorch backend and achieves higher draft accuracy. > -> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/README-engine-backend-archive.md) for the legacy instructions. +> If you specifically need MEDUSA with the TRT engine backend, refer to the [engine backend archive](https://github.com/triton-inference-server/tensorrtllm_backend#tensorrt-engine-backend) for the legacy instructions. ## Draft Model-Based Speculative Decoding