address comments

ziqifan617 · ziqifan617 · commit 001f1d98f673 · 2025-03-19T10:02:27.000-07:00
diff --git a/Feature_Guide/Speculative_Decoding/README.md b/Feature_Guide/Speculative_Decoding/README.md
@@ -54,6 +54,6 @@ may prove simpler than generating a summary for an article. [Spec-Bench](https:/
 shows the performance of different speculative decoding approaches on different tasks.
 
 ## Speculative Decoding with Triton Inference Server
- Triton Inference Server supports speculative decoding on different types of Triton backends. See what a Triton backend is [here](https://github.com/triton-inference-server/tensorrtllm_backend).
+ Triton Inference Server supports speculative decoding on different types of Triton backends. See what a Triton backend is [here](https://github.com/triton-inference-server/backend).
 - Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend).
 - Follow [here](vLLM/README.md) to learn how Triton Inference Server supports speculative decoding with [vLLM Backend](https://github.com/triton-inference-server/vllm_backend).
diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md
@@ -202,7 +202,7 @@ python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len
 > ...
 > ```
 
-2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint).
+2. The [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html).
 
 ```bash
 curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/README.md b/Feature_Guide/Speculative_Decoding/vLLM/README.md
@@ -88,7 +88,7 @@ docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G \
 
 ### Send Inference Requests
 
-Let's send an inference request to the [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint).
+Let's send an inference request to the [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html).
 
 ```bash
 curl -X POST localhost:8000/v2/models/eagle_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}' | jq
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/base_model/1/model.json b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/base_model/1/model.json
@@ -1,3 +1,3 @@
 {
     "model": "/hf-models/Meta-Llama-3-8B-Instruct"
-}
+}
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/base_model/config.pbtxt b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/base_model/config.pbtxt
@@ -34,4 +34,4 @@ instance_group [
     count: 1
     kind: KIND_MODEL
   }
-]
+]
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model copy/1/model.json b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model copy/1/model.json
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model copy/config.pbtxt b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model copy/config.pbtxt
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model/1/model.json b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model/1/model.json
@@ -3,4 +3,4 @@
     "speculative_model": "/hf-models/EAGLE-LLaMA3-Instruct-8B",
     "speculative_draft_tensor_parallel_size": 1,
     "num_speculative_tokens": 5
-}
+}
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model/config.pbtxt b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/eagle_model/config.pbtxt
@@ -34,4 +34,4 @@ instance_group [
     count: 1
     kind: KIND_MODEL
   }
-]
+]
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/opt_model/1/model.json b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/opt_model/1/model.json
@@ -3,4 +3,4 @@
     "speculative_model": "facebook/opt-125m",
     "tensor_parallel_size": 1,
     "num_speculative_tokens": 5
-}
+}
diff --git a/Feature_Guide/Speculative_Decoding/vLLM/model_repository/opt_model/config.pbtxt b/Feature_Guide/Speculative_Decoding/vLLM/model_repository/opt_model/config.pbtxt
@@ -34,4 +34,4 @@ instance_group [
     count: 1
     kind: KIND_MODEL
   }
-]
+]
diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md
@@ -129,7 +129,7 @@ triton start
 ```
 
 ### Send an inference request
-Use the [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint).
+Use the [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html).
 to send an inference request to the deployed model.
 
 ```bash

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`{`
`2`	`2`	`"model": "/hf-models/Meta-Llama-3-8B-Instruct"`
`3`		`-}`
	`3`	`+}`
Original file line number	Diff line number	Diff line change
`@@ -34,4 +34,4 @@ instance_group [`
`34`	`34`	`count: 1`
`35`	`35`	`kind: KIND_MODEL`
`36`	`36`	`}`
`37`		`-]`
	`37`	`+]`