Skip to content

Commit 6a55bdd

Browse files
committed
fix: indent numbered list content by 3 spaces so nested paragraphs and code blocks render correctly in GFM
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
1 parent 31efda7 commit 6a55bdd

1 file changed

Lines changed: 37 additions & 37 deletions

File tree

  • Feature_Guide/Speculative_Decoding/TRT-LLM

Feature_Guide/Speculative_Decoding/TRT-LLM/README.md

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -152,62 +152,62 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
152152

153153
1. Prepare Dataset
154154

155-
We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset.
155+
We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset.
156156

157-
```bash
158-
wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl
157+
```bash
158+
wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl
159159
160-
# dataset-converter.py file can be found in the parent folder of this README.
161-
python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl
162-
```
160+
# dataset-converter.py file can be found in the parent folder of this README.
161+
python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl
162+
```
163163

164164
2. Install GenAI-Perf (Ubuntu 24.04, Python 3.10+)
165165

166-
```bash
167-
pip install genai-perf
168-
```
169-
*NOTE: you must already have CUDA 12 installed.*
166+
```bash
167+
pip install genai-perf
168+
```
169+
*NOTE: you must already have CUDA 12 installed.*
170170

171171
3. Run Gen-AI Perf
172172

173-
Run the following command in the SDK container:
174-
```bash
175-
genai-perf \
176-
profile \
177-
-m tensorrt_llm \
178-
--service-kind triton \
179-
--backend tensorrtllm \
180-
--input-file /path/to/converted/dataset/converted_humaneval.jsonl \
181-
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
182-
--profile-export-file my_profile_export.json \
183-
--url localhost:8001 \
184-
--concurrency 1
185-
```
186-
*NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.*
173+
Run the following command in the SDK container:
174+
```bash
175+
genai-perf \
176+
profile \
177+
-m tensorrt_llm \
178+
--service-kind triton \
179+
--backend tensorrtllm \
180+
--input-file /path/to/converted/dataset/converted_humaneval.jsonl \
181+
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
182+
--profile-export-file my_profile_export.json \
183+
--url localhost:8001 \
184+
--concurrency 1
185+
```
186+
*NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.*
187187

188188
4. Run Gen-AI Perf on Base Model
189189

190-
To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
190+
To compare performance between EAGLE 3 and the base model (i.e. vanilla LLM without speculative decoding), restart Triton Server with a `model.yaml` that omits the `speculative_config` block:
191191

192-
```yaml
193-
model: meta-llama/Llama-3.1-8B-Instruct
194-
backend: pytorch
192+
```yaml
193+
model: meta-llama/Llama-3.1-8B-Instruct
194+
backend: pytorch
195195
196-
tensor_parallel_size: 1
197-
pipeline_parallel_size: 1
196+
tensor_parallel_size: 1
197+
pipeline_parallel_size: 1
198198
199-
triton_config:
200-
max_batch_size: 0
201-
decoupled: False
202-
```
199+
triton_config:
200+
max_batch_size: 0
201+
decoupled: False
202+
```
203203

204-
Then re-run the Gen-AI Perf command above.
204+
Then re-run the Gen-AI Perf command above.
205205

206206
5. Compare Performance
207207

208-
From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
208+
From sample runs, EAGLE 3 typically delivers 2x or greater token throughput improvement over the base model at low concurrency. The exact speedup varies by hardware, model, and dataset.
209209

210-
As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment.
210+
As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment.
211211

212212
## MEDUSA
213213

0 commit comments

Comments
 (0)