You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_2-improving_resource_utilization/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
+
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
#
4
4
# Redistribution and use in source and binary forms, with or without
5
5
# modification, are permitted provided that the following conditions
@@ -59,7 +59,7 @@ Using Dynamic batching in this case leads to more efficient packing of requests
59
59
60
60
**Note:** The above is an extreme version of an ideal case scenario. In practice, not all elements of execution can be perfectly parallelized, resulting in longer execution time for larger batches.
61
61
62
-
As observed from the above, the use of Dynamic Batching can lead to improvements in both latency and throughput while serving models. This batching feature is mainly focused on providing a solution for stateless models(models which do not maintain a state between execution, like object detection models). Triton's [sequence batcher](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#sequence-batcher) can be used to manage multiple inference requests for stateful models. For more information and configurations regarding dynamic batching, refer to the Triton Inference Server [documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching).
62
+
As observed from the above, the use of Dynamic Batching can lead to improvements in both latency and throughput while serving models. This batching feature is mainly focused on providing a solution for stateless models(models which do not maintain a state between execution, like object detection models). Triton's [sequence batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#sequence-batcher) can be used to manage multiple inference requests for stateful models. For more information and configurations regarding dynamic batching, refer to the Triton Inference Server [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher).
63
63
64
64
## Concurrent model execution
65
65
@@ -77,7 +77,7 @@ instance_group [
77
77
78
78
Let's take the previous example and discuss the effect of adding multiple models for parallel execution. In this example, instead of having a single model process five queries, two models are spawned. 
79
79
80
-
For a "no dynamic batching" case, as there are two models to execute, the queries are distributed equally. Users can also add [priorities](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#priority) to prioritize or de-prioritize any specific instance group.
80
+
For a "no dynamic batching" case, as there are two models to execute, the queries are distributed equally. Users can also add [priorities](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#priority) to prioritize or de-prioritize any specific instance group.
81
81
82
82
When considering the case of multiple instances with dynamic batches enabled, the following happens. Owing to the availability of another instance, query `B` which arrives with some delay can be executed using the second instance. With some delay allocated, instance 1 gets filled and launched by time `T = X/2` and since queries `D` and `E` stack up to fill up to the maximum batch size, the second model can start inference without any delay.
83
83
@@ -139,7 +139,7 @@ dynamic_batching { }
139
139
```
140
140
With `instance_group` users can primarily tweak two things. First, the number of instances of that model deployed on each GPU. The above example will deploy `2` instances of the model `per GPU`. Secondly, the target GPUs for this group can be specified with `gpus: [ <device number>, ... <device number> ]`.
141
141
142
-
Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) documentation for more information.
142
+
Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-configuration) documentation for more information.
143
143
144
144
With the model repository set up, the Triton Inference Server can be launched.
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
+
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
#
4
4
# Redistribution and use in source and binary forms, with or without
5
5
# modification, are permitted provided that the following conditions
@@ -82,7 +82,7 @@ Before diving into details with an example, a discussion about the overall funct
82
82
83
83
-**objectives**: Users can choose to order the results on the basis of their deployment goals, throughput, latency, or tailoring to specific resource constraints. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#objective).
84
84
85
-
Model Analyzer has two modes, Online and Offline. In online mode users can specify latency budgets for their deployments to cater to their requirements. For Offline mode a similar specification can be mode for minimum throughput. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#model-analyze-modes)
85
+
Model Analyzer has two modes, Online and Offline. In online mode users can specify latency budgets for their deployments to cater to their requirements. For Offline mode a similar specification can be mode for minimum throughput. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#model-analyzer-modes)
86
86
87
87
-**constraints**: Users can also choose to constrain the selection of sweeps to specific requirements for throughput, latency or gpu memory utilization. [Learn more](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#constraint)
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_4-inference_acceleration/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
+
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
#
4
4
# Redistribution and use in source and binary forms, with or without
5
5
# modification, are permitted provided that the following conditions
@@ -122,7 +122,7 @@ optimization {
122
122
}
123
123
```
124
124
While OpenVINO provides software level optimizations, it is also important to consider the CPU hardware being used. CPUs comprise multiple cores, memory resources, and interconnects. With multiple CPUs these resources can be shared with NUMA (Non uniform memory access).
125
-
Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md#numa-optimization) for more.
125
+
Refer this [section of the Triton Documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md#numa-optimization) for more.
126
126
127
127
## Accelerating Shallow models
128
128
Shallow models like Gradient Boosted Decision Trees are often used in many pipelines. These models are typically built with libraries like [XGBoost](https://xgboost.readthedocs.io/en/stable/), [LightGBM](https://lightgbm.readthedocs.io/en/stable/), [Scikit-learn](https://scikit-learn.org/stable/), [cuML](https://github.com/rapidsai/cuml) and more. These models can be deployed on the Triton Inference Server via the Forest Inference Library backend. Check out [these examples](https://github.com/triton-inference-server/fil_backend/tree/main/notebooks) for more information.
## Deploy Pre/Post Processing Scripts with the Python Backend
93
-
In previous parts of this this tutorial, we've created client scripts that perform various pre and post processing steps within the client process. For example, in [Part 1](../Part_1-model_deployment/README.md), we created a script [`client.py`](../Part_1-model_deployment/clients/client.py) which
93
+
In previous parts of this this tutorial, we've created client scripts that perform various pre and post processing steps within the client process. For example, in [Part 1](../Part_1-model_deployment/README.md), we created a script [`client.py`](../Part_1-model_deployment/client.py) which
94
94
1. Read in images
95
95
2. Performed scaling and normalization on the image
96
96
3. Sent the images to the Triton server
97
97
4. Cropped the images based on the bounding boxes returned by the text detection model
98
-
5. Saved the cropped images back to disk
99
-
100
-
Then, we had a second client, [`client2.py`](../Part_1-model_deployment/clients/client2.py), which
101
-
1. Read in the cropped images from `client.py`
102
-
2. Performed scaling and normalization on the images
103
-
3. Sent the cropped images to the Triton server
104
-
4. Decoded the tensor returned by the text recognition model into text
105
-
5. Printed the decoded text
98
+
5. Performed scaling and normalization on the images
99
+
6. Sent the cropped images to the Triton server
100
+
7. Decoded the tensor returned by the text recognition model into text
101
+
8. Printed the decoded text
106
102
107
103
In order to move many of these steps to the Triton server, we can create a set of scripts that will run in the [Python Backend for Triton](https://github.com/triton-inference-server/python_backend). The Python backend can be used to execute any Python code, so we can port our client code directly over to Triton with only a few changes.
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_6-building_complex_pipelines/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
+
# Copyright 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
#
4
4
# Redistribution and use in source and binary forms, with or without
5
5
# modification, are permitted provided that the following conditions
@@ -28,7 +28,7 @@
28
28
29
29
# Building Complex Pipelines: Stable Diffusion
30
30
31
-
| Navigate to |[Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/)|[Part 7: Iterative Scheduling Tutorial](./Part_7-iterative_scheduling)|[Documentation: BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting)|
31
+
| Navigate to |[Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/)|[Part 7: Iterative Scheduling Tutorial](../Part_7-iterative_scheduling)|[Documentation: BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting)|
**Watch [this explainer video](https://youtu.be/JgP2WgNIq_w) with discusses the pipeline, before proceeding with the example**. This example focuses on showcasing two of Triton Inference Server's features:
From the output above, the current metric value is 0 and the target value is 1. Note that in this example, our metric is a custom metric defined in Prometheus Rule. You can find more details in the [Install Prometheus rule for Triton metrics](Cluster_Setup_Steps.md#8-install-prometheus-rule-for-triton-metrics) step. When the current value exceed 1, the HPA will start to create a new replica. We can either increase traffic by sending a large amount of requests to the LoadBalancer or manually increase minimum number of replicas to let the HPA create the second replica. In this example, we are going to choose the latter and run the following command:
259
+
From the output above, the current metric value is 0 and the target value is 1. Note that in this example, our metric is a custom metric defined in Prometheus Rule. You can find more details in the [Install Prometheus rule for Triton metrics](./2.%20Configure_EKS_Cluster.md#8-install-prometheus-rule-for-triton-metrics) step. When the current value exceed 1, the HPA will start to create a new replica. We can either increase traffic by sending a large amount of requests to the LoadBalancer or manually increase minimum number of replicas to let the HPA create the second replica. In this example, we are going to choose the latter and run the following command:
Normal TriggeredScaleUp 15s cluster-autoscaler pod triggered scale-up: [{eks-efa-compute-ng-2-7ac8948c-e79a-9ad8-f27f-70bf073a9bfa 2->4 (max: 4)}]
283
283
```
284
284
285
-
The first event means that there are no available nodes to schedule any pods. This explains why the second 2 pods are in `Pending` status. The second event states that the Cluster Autoscaler detects that this pod is `unschedulable`, so it is going to increase number of nodes in our cluster until maximum is reached. You can find more details in the [Install Cluster Autoscaler](Cluster_Setup_Steps.md#10-install-cluster-autoscaler) step. This process can take some time depending on whether AWS have enough nodes available to add to your cluster. Eventually, the Cluster Autoscaler will add 2 more nodes in your node group so that the 2 `Pending` pods can be scheduled on them. Your `kubectl get nodes` and `kubectl get pods` commands should output something similar to below:
285
+
The first event means that there are no available nodes to schedule any pods. This explains why the second 2 pods are in `Pending` status. The second event states that the Cluster Autoscaler detects that this pod is `unschedulable`, so it is going to increase number of nodes in our cluster until maximum is reached. You can find more details in the [Install Cluster Autoscaler](./2.%20Configure_EKS_Cluster.md#10-install-cluster-autoscaler) step. This process can take some time depending on whether AWS have enough nodes available to add to your cluster. Eventually, the Cluster Autoscaler will add 2 more nodes in your node group so that the 2 `Pending` pods can be scheduled on them. Your `kubectl get nodes` and `kubectl get pods` commands should output something similar to below:
Copy file name to clipboardExpand all lines: Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
# Copyright 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
+
# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
#
4
4
# Redistribution and use in source and binary forms, with or without
5
5
# modification, are permitted provided that the following conditions
@@ -34,7 +34,7 @@ We have 1 pod per node, so the main challenge in deploying models that require m
34
34
35
35
1.**LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods:** To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws/tree/main), which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in [`deployment.yaml`](multinode_helm_chart/chart/templates/deployment.yaml) and [server.py](multinode_helm_chart/containers/server.py).
36
36
2.**Gang Scheduling:** Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use `kubessh` to achieve this in the `wait_for_workers` function of [server.py](multinode_helm_chart/containers/server.py).
37
-
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
37
+
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](./2.%20Configure_EKS_Cluster.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](./2.%20Configure_EKS_Cluster.md#10-install-cluster-autoscaler)
38
38
4.**LoadBalancer Setup:** Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in [`service.yaml`](multinode_helm_chart/chart/templates/service.yaml)
0 commit comments