You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Feature_Guide/Speculative_Decoding/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,6 +54,6 @@ may prove simpler than generating a summary for an article. [Spec-Bench](https:/
54
54
shows the performance of different speculative decoding approaches on different tasks.
55
55
56
56
## Speculative Decoding with Triton Inference Server
57
-
Triton Inference Server supports speculative decoding on different types of Triton backend. See what a Triton backend is [here](https://github.com/triton-inference-server/tensorrtllm_backend).
57
+
Triton Inference Server supports speculative decoding on different types of Triton backends. See what a Triton backend is [here](https://github.com/triton-inference-server/tensorrtllm_backend).
58
58
- Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend).
59
59
- Follow [here](vLLM/README.md) to learn how Triton Inference Server supports speculative decoding with [vLLM Backend](https://github.com/triton-inference-server/vllm_backend).
In this example, we will be using the [EAGLE-Vicuna-7B-v1.3](https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3) model.
52
-
More types of EAGLE models could be found [here](https://huggingface.co/yuhuili). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work.
52
+
More types of EAGLE models can be found [here](https://huggingface.co/yuhuili). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work.
53
53
54
54
To download both models, run the following command:
Note that we're mounting the downloaded EAGLE and base models to `/hf-models` in the docker container.
67
67
Make an `engines` folder outside docker to reuse engines for future runs.
68
68
Please, make sure to replace <xx.yy> with the version of Triton that you want
69
-
to use (must be >= 25.01). The latest Triton Server container is recommended and could be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).
69
+
to use (must be >= 25.01). The latest Triton Server container is recommended and can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).
70
70
71
71
```bash
72
72
docker run --rm -it --net host --shm-size=2g \
@@ -226,7 +226,7 @@ format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-A
In this example, we will be using the [EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model.
48
-
More types of EAGLE models could be found [here](https://huggingface.co/yuhuili). The base model [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) is also needed for EAGLE to work.
48
+
More types of EAGLE models can be found [here](https://huggingface.co/yuhuili). The base model [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) is also needed for EAGLE to work.
49
49
50
50
To download both models, run the following command:
0 commit comments