triton-inference-server
diff --git a/‎Conceptual_Guide/Part_2-improving_resource_utilization/README.md‎
Lines changed: 4 additions & 4 deletions b/‎Conceptual_Guide/Part_2-improving_resource_utilization/README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md‎
Lines changed: 2 additions & 2 deletions b/‎Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Conceptual_Guide/Part_4-inference_acceleration/README.md‎
Lines changed: 2 additions & 2 deletions b/‎Conceptual_Guide/Part_4-inference_acceleration/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Conceptual_Guide/Part_5-Model_Ensembles/README.md‎
Lines changed: 6 additions & 10 deletions b/‎Conceptual_Guide/Part_5-Model_Ensembles/README.md‎
Lines changed: 6 additions & 10 deletions
diff --git a/‎Conceptual_Guide/Part_6-building_complex_pipelines/README.md‎
Lines changed: 2 additions & 2 deletions b/‎Conceptual_Guide/Part_6-building_complex_pipelines/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Conceptual_Guide/Part_8-semantic_caching/README.md‎
Lines changed: 2 additions & 2 deletions b/‎Conceptual_Guide/Part_8-semantic_caching/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/3. Deploy_Triton.md‎
Lines changed: 2 additions & 2 deletions b/‎Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/3. Deploy_Triton.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md‎
Lines changed: 2 additions & 2 deletions b/‎Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md‎
Lines changed: 2 additions & 2 deletions
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,7 +59,7 @@ Using Dynamic batching in this case leads to more efficient packing of requests
 
 **Note:** The above is an extreme version of an ideal case scenario. In practice, not all elements of execution can be perfectly parallelized, resulting in longer execution time for larger batches.
 
-As observed from the above, the use of Dynamic Batching can lead to improvements in both latency and throughput while serving models. This batching feature is mainly focused on providing a solution for stateless models(models which do not maintain a state between execution, like object detection models). Triton's [sequence batcher](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#sequence-batcher) can be used to manage multiple inference requests for stateful models. For more information and configurations regarding dynamic batching, refer to the Triton Inference Server [documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching).
+As observed from the above, the use of Dynamic Batching can lead to improvements in both latency and throughput while serving models. This batching feature is mainly focused on providing a solution for stateless models(models which do not maintain a state between execution, like object detection models). Triton's [sequence batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#sequence-batcher) can be used to manage multiple inference requests for stateful models. For more information and configurations regarding dynamic batching, refer to the Triton Inference Server [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher).
 
 ## Concurrent model execution
 
@@ -77,7 +77,7 @@ instance_group [
 
 Let's take the previous example and discuss the effect of adding multiple models for parallel execution. In this example, instead of having a single model process five queries, two models are spawned. ![Multiple Model Instances](./img/multi_instance.PNG)
 
-For a "no dynamic batching" case, as there are two models to execute, the queries are distributed equally. Users can also add [priorities](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#priority) to prioritize or de-prioritize any specific instance group.
+For a "no dynamic batching" case, as there are two models to execute, the queries are distributed equally. Users can also add [priorities](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#priority) to prioritize or de-prioritize any specific instance group.
 
 When considering the case of multiple instances with dynamic batches enabled, the following happens. Owing to the availability of another instance, query `B` which arrives with some delay can be executed using the second instance. With some delay allocated, instance 1 gets filled and launched by time `T = X/2` and since queries `D` and `E` stack up to fill up to the maximum batch size, the second model can start inference without any delay.
 
@@ -139,7 +139,7 @@ dynamic_batching { }
 ```
 With `instance_group` users can primarily tweak two things. First, the number of instances of that model deployed on each GPU. The above example will deploy `2` instances of the model `per GPU`. Secondly, the target GPUs for this group can be specified with `gpus: [ <device number>, ... <device number> ]`.
 
-Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) documentation for more information.
+Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-configuration) documentation for more information.
 
 With the model repository set up, the Triton Inference Server can be launched.
 ```
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -82,7 +82,7 @@ Before diving into details with an example, a discussion about the overall funct
 
 - **objectives**: Users can choose to order the results on the basis of their deployment goals, throughput, latency, or tailoring to specific resource constraints. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#objective).
 
-    Model Analyzer has two modes, Online and Offline. In online mode users can specify latency budgets for their deployments to cater to their requirements. For Offline mode a similar specification can be mode for minimum throughput. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#model-analyze-modes)
+    Model Analyzer has two modes, Online and Offline. In online mode users can specify latency budgets for their deployments to cater to their requirements. For Offline mode a similar specification can be mode for minimum throughput. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#model-analyzer-modes)
 
 - **constraints**: Users can also choose to constrain the selection of sweeps to specific requirements for throughput, latency or gpu memory utilization. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#constraint)
 
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -122,7 +122,7 @@ optimization {
 }
 ```
 While OpenVINO provides software level optimizations, it is also important to consider the CPU hardware being used. CPUs comprise multiple cores, memory resources, and interconnects. With multiple CPUs these resources can be shared with NUMA (Non uniform memory access).
-Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md#numa-optimization) for more.
+Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md#numa-optimization) for more.
 
 ## Accelerating Shallow models
 Shallow models like Gradient Boosted Decision Trees are often used in many pipelines. These models are typically built with libraries like [XGBoost](https://xgboost.readthedocs.io/en/stable/), [LightGBM](https://lightgbm.readthedocs.io/en/stable/), [Scikit-learn](https://scikit-learn.org/stable/), [cuML](https://github.com/rapidsai/cuml) and more. These models can be deployed on the Triton Inference Server via the Forest Inference Library backend. Check out [these examples](https://github.com/triton-inference-server/fil_backend/tree/main/notebooks) for more information.
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -90,19 +90,15 @@ bash utils/export_text_recognition.sh
 ```
 
 ## Deploy Pre/Post Processing Scripts with the Python Backend
-In previous parts of this this tutorial, we've created client scripts that perform various pre and post processing steps within the client process. For example, in [Part 1](../Part_1-model_deployment/README.md), we created a script [`client.py`](../Part_1-model_deployment/clients/client.py) which
+In previous parts of this this tutorial, we've created client scripts that perform various pre and post processing steps within the client process. For example, in [Part 1](../Part_1-model_deployment/README.md), we created a script [`client.py`](../Part_1-model_deployment/client.py) which
 1. Read in images
 2. Performed scaling and normalization on the image
 3. Sent the images to the Triton server
 4. Cropped the images based on the bounding boxes returned by the text detection model
-5. Saved the cropped images back to disk
-
-Then, we had a second client, [`client2.py`](../Part_1-model_deployment/clients/client2.py), which
-1. Read in the cropped images from `client.py`
-2. Performed scaling and normalization on the images
-3. Sent the cropped images to the Triton server
-4. Decoded the tensor returned by the text recognition model into text
-5. Printed the decoded text
+5. Performed scaling and normalization on the images
+6. Sent the cropped images to the Triton server
+7. Decoded the tensor returned by the text recognition model into text
+8. Printed the decoded text
 
 In order to move many of these steps to the Triton server, we can create a set of scripts that will run in the [Python Backend for Triton](https://github.com/triton-inference-server/python_backend). The Python backend can be used to execute any Python code, so we can port our client code directly over to Triton with only a few changes.
 
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,7 @@
 
 # Building Complex Pipelines: Stable Diffusion
 
-| Navigate to | [Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/) | [Part 7: Iterative Scheduling Tutorial](./Part_7-iterative_scheduling) | [Documentation: BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
+| Navigate to | [Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/) | [Part 7: Iterative Scheduling Tutorial](../Part_7-iterative_scheduling) | [Documentation: BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
 | ------------ | --------------- | --------------- |  --------------- |
 
 **Watch [this explainer video](https://youtu.be/JgP2WgNIq_w) with discusses the pipeline, before proceeding with the example**. This example focuses on showcasing two of Triton Inference Server's features:
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -137,7 +137,7 @@ cd vllm_backend
 
 With the repository successfully cloned, the next step is to apply all
 necessary modifications. To simplify this process, we've prepared a
-[semantic_cache.patch](tutorials/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch)
+[semantic_cache.patch](./artifacts/semantic_cache.patch)
 that consolidates all changes into a single step:
 
 ```bash
 
@@ -256,7 +256,7 @@ NAME                   REFERENCE                                TARGETS   MINPOD
 multinode_deployment   LeaderWorkerSet/leaderworkerset-sample   0/1       1         2         1          66m
 ```
 
-From the output above, the current metric value is 0 and the target value is 1. Note that in this example, our metric is a custom metric defined in Prometheus Rule. You can find more details in the [Install Prometheus rule for Triton metrics](Cluster_Setup_Steps.md#8-install-prometheus-rule-for-triton-metrics) step. When the current value exceed 1, the HPA will start to create a new replica. We can either increase traffic by sending a large amount of requests to the LoadBalancer or manually increase minimum number of replicas to let the HPA create the second replica. In this example, we are going to choose the latter and run the following command:
+From the output above, the current metric value is 0 and the target value is 1. Note that in this example, our metric is a custom metric defined in Prometheus Rule. You can find more details in the [Install Prometheus rule for Triton metrics](./2.%20Configure_EKS_Cluster.md#8-install-prometheus-rule-for-triton-metrics) step. When the current value exceed 1, the HPA will start to create a new replica. We can either increase traffic by sending a large amount of requests to the LoadBalancer or manually increase minimum number of replicas to let the HPA create the second replica. In this example, we are going to choose the latter and run the following command:
 
 ```
 kubectl patch hpa multinode_deployment -p '{"spec":{"minReplicas": 2}}'
@@ -282,7 +282,7 @@ Events:
   Normal   TriggeredScaleUp  15s   cluster-autoscaler  pod triggered scale-up: [{eks-efa-compute-ng-2-7ac8948c-e79a-9ad8-f27f-70bf073a9bfa 2->4 (max: 4)}]
 ```
 
-The first event means that there are no available nodes to schedule any pods. This explains why the second 2 pods are in `Pending` status. The second event states that the Cluster Autoscaler detects that this pod is `unschedulable`, so it is going to increase number of nodes in our cluster until maximum is reached. You can find more details in the [Install Cluster Autoscaler](Cluster_Setup_Steps.md#10-install-cluster-autoscaler) step. This process can take some time depending on whether AWS have enough nodes available to add to your cluster. Eventually, the Cluster Autoscaler will add 2 more nodes in your node group so that the 2 `Pending` pods can be scheduled on them. Your `kubectl get nodes` and `kubectl get pods` commands should output something similar to below:
+The first event means that there are no available nodes to schedule any pods. This explains why the second 2 pods are in `Pending` status. The second event states that the Cluster Autoscaler detects that this pod is `unschedulable`, so it is going to increase number of nodes in our cluster until maximum is reached. You can find more details in the [Install Cluster Autoscaler](./2.%20Configure_EKS_Cluster.md#10-install-cluster-autoscaler) step. This process can take some time depending on whether AWS have enough nodes available to add to your cluster. Eventually, the Cluster Autoscaler will add 2 more nodes in your node group so that the 2 `Pending` pods can be scheduled on them. Your `kubectl get nodes` and `kubectl get pods` commands should output something similar to below:
 
 ```
 NAME                             STATUS   ROLES    AGE   VERSION
 
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,7 +34,7 @@ We have 1 pod per node, so the main challenge in deploying models that require m
 
  1. **LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods:**  To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws/tree/main), which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods -  and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in [`deployment.yaml`](multinode_helm_chart/chart/templates/deployment.yaml) and [server.py](multinode_helm_chart/containers/server.py).
  2. **Gang Scheduling:** Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use `kubessh` to achieve this in the `wait_for_workers` function of [server.py](multinode_helm_chart/containers/server.py).
- 3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
+ 3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](./2.%20Configure_EKS_Cluster.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](./2.%20Configure_EKS_Cluster.md#10-install-cluster-autoscaler)
  4. **LoadBalancer Setup:** Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in [`service.yaml`](multinode_helm_chart/chart/templates/service.yaml)
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`<!--`
`2`		`-# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.`
	`2`	`+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.`
`3`	`3`	`#`
`4`	`4`	`# Redistribution and use in source and binary forms, with or without`
`5`	`5`	`# modification, are permitted provided that the following conditions`
`@@ -122,7 +122,7 @@ optimization {`
`122`	`122`	`}`
`123`	`123`	```
`124`	`124`	`While OpenVINO provides software level optimizations, it is also important to consider the CPU hardware being used. CPUs comprise multiple cores, memory resources, and interconnects. With multiple CPUs these resources can be shared with NUMA (Non uniform memory access).`
`125`		`-Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md#numa-optimization) for more.`
	`125`	`+Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md#numa-optimization) for more.`
`126`	`126`
`127`	`127`	`## Accelerating Shallow models`
`128`	`128`	Shallow models like Gradient Boosted Decision Trees are often used in many pipelines. These models are typically built with libraries like [XGBoost](https://xgboost.readthedocs.io/en/stable/), [LightGBM](https://lightgbm.readthedocs.io/en/stable/), [Scikit-learn](https://scikit-learn.org/stable/), [cuML](https://github.com/rapidsai/cuml) and more. These models can be deployed on the Triton Inference Server via the Forest Inference Library backend. Check out [these examples](https://github.com/triton-inference-server/fil_backend/tree/main/notebooks) for more information.