Skip to content

Commit d09394f

Browse files
authored
Add tutorial for model ensembles (#5)
* Add tutorial for model ensembles * Reformat code and remove commented out lines * Expand introduction and add next steps
1 parent d2b8a83 commit d09394f

17 files changed

Lines changed: 1361 additions & 0 deletions

File tree

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Pretrained Models
2+
*.onnx
3+
4+
# Python Stuff
5+
__pycache__
6+
7+
# Downloaded Assets
8+
downloads
Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
# Executing Multiple Models with Model Ensembles
2+
3+
Modern machine learning systems often involve the execution of several models, whether that is because of pre- and post-processing steps, aggregating the prediction of multiple models, or having different models executing different tasks. In this example, we'll be exploring the use of Model Ensembles for executing multiple models serverside with only a single network call. This offers the benefit of reducing the number of times we need to copy data between the client and the server, and eliminating some of the latency inherent to network calls.
4+
5+
To illustrate the process of creating a model ensemble, we'll be reusing the model pipeline first introduced in [Part 1](../Part_1-model_deployment/README.md). In the previous examples, we've executed the text detection and recognition models seperately, with our client making two different network calls and performing various processing steps -- such as cropping and resizing images, or decoding tensors into text -- in between. Below is a simplified diagram of the pipeline, with some steps occuring on the client and some on the server.
6+
7+
```mermaid
8+
sequenceDiagram
9+
Client ->> Triton: Full Image
10+
activate Triton
11+
Note right of Triton: Text Detection
12+
Triton ->> Client: Text Bounding Boxes
13+
deactivate Triton
14+
activate Client
15+
Note left of Client: Image Cropping
16+
Client ->> Triton: Cropped Images
17+
deactivate Client
18+
activate Triton
19+
Note right of Triton: Text Recognition
20+
Triton ->> Client: Parsed Text
21+
deactivate Triton
22+
```
23+
24+
In order to reduce the number of network calls and data copying necessary (and also take advantage of the potentially more powerful server to perform pre/post processing), we can use Triton's [Model Ensemble](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html#ensemble-models) feature to execute multiple models with one network call.
25+
26+
```mermaid
27+
sequenceDiagram
28+
Client ->> Triton: Full Image
29+
activate Triton
30+
activate Triton
31+
Note right of Triton: Text Detection
32+
deactivate Triton
33+
activate Triton
34+
Note right of Triton: Image Cropping (Serverside)
35+
Note left of Triton: Ensemble Model
36+
deactivate Triton
37+
activate Triton
38+
Note right of Triton: Text Recognition
39+
Triton ->> Client: Parsed Text
40+
deactivate Triton
41+
deactivate Triton
42+
```
43+
Let's go over how to create a Triton model ensemble.
44+
45+
## Deploy Base Models
46+
The first step is to deploy the text detection and text recognition models as regular Triton models, just as we've done in the past. For a detailed overview of deploying models to Triton, see [Part 1](../Part_1-model_deployment/README.md) of this tutorial. For convenience, we've included two shell scripts for exporting these models.
47+
48+
>Note: We recommend executing the following step within the NGC TensorFlow container environment, which you can launch with `docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:<yy.mm>-tf2-py3`
49+
```bash
50+
bash utils/export_text_detection.sh
51+
```
52+
53+
>Note: We recommend executing the following step within the NGC PyTorch container environment, which you can launch with `docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:<yy.mm>-py3`
54+
```bash
55+
bash utils/export_text_recognition.sh
56+
```
57+
58+
## Deploy Pre/Post Processing Scripts with the Python Backend
59+
In previous parts of this this tutorial, we've created client scripts that perform various pre and post processing steps within the client process. For example, in [Part 1](../Part_1-model_deployment/README.md), we created a script [`client.py`](../Part_1-model_deployment/clients/client.py) which
60+
1. Read in images
61+
2. Performed scaling and normalization on the image
62+
3. Sent the images to the Triton server
63+
4. Cropped the images based on the bounding boxes returned by the text detection model
64+
5. Saved the cropped images back to disk
65+
66+
Then, we had a second client, [`client2.py`](../Part_1-model_deployment/clients/client2.py), which
67+
1. Read in the cropped images from `client.py`
68+
2. Performed scaling and normalization on the images
69+
3. Sent the cropped images to the Triton server
70+
4. Decoded the tensor returned by the text recogntion model into text
71+
5. Printed the decoded text
72+
73+
In order to move many of these steps to the Triton server, we can create a set of scripts that will run in the [Python Backend for Triton](https://github.com/triton-inference-server/python_backend). The Python backend can be used to execute any Python code, so we can port our client code directly over to Triton with only a few changes.
74+
75+
To deploy a model for the Python Backend, we can create a directory in our model repository as below (where `my_python_model` can be any name):
76+
77+
```
78+
my_python_model/
79+
├── 1
80+
│ └── model.py
81+
└── config.pbtxt
82+
```
83+
84+
In total, we'll create 3 different python backend models to go with our existing onnx models to serve with Triton:
85+
1. `detection_preprocessing`
86+
2. `detection_postprocessing`
87+
3. `recognition_postprocessing`
88+
89+
You can find the complete `model.py` scripts for each of these in the `model_repository` folder in this directory.
90+
91+
Let's go through an example. Within `model.py`, we create a class definition for `TritonPythonModel` with the following methods:
92+
93+
```python
94+
class TritonPythonModel:
95+
def initialize(self, args):
96+
...
97+
def execute(self, requests):
98+
...
99+
def finalize(self):
100+
...
101+
```
102+
103+
The `initialize` and `finalize` methods are optional, and are called when the model is loaded and unloaded respectively. The bulk of logic will go into the `execute` method, which takes in a _list_ of request objects, and must return a list of response objects.
104+
105+
In our original client, we had the following code to read in an image and perform some simple transformations to it:
106+
107+
```python
108+
### client.py
109+
110+
image = cv2.imread("./img1.jpg")
111+
image_height, image_width, image_channels = image.shape
112+
113+
# Pre-process image
114+
blob = cv2.dnn.blobFromImage(image, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)
115+
blob = np.transpose(blob, (0, 2,3,1))
116+
117+
# Create input object
118+
input_tensors = [
119+
httpclient.InferInput('input_images:0', blob.shape, "FP32")
120+
]
121+
input_tensors[0].set_data_from_numpy(blob, binary_data=True)
122+
```
123+
124+
When executing in the python backend, we need to make sure that our code can handle a list of inputs. In addition, we won't be reading in the images from disk -- instead, we'll retrieve them directly from the input tensor that's provided by the Triton server.
125+
126+
```python
127+
### model.py
128+
129+
responses = []
130+
for request in requests:
131+
# Read input tensor from Triton
132+
in_0 = pb_utils.get_input_tensor_by_name(request, "detection_preprocessing_input")
133+
img = in_0.as_numpy()
134+
image = Image.open(io.BytesIO(img.tobytes()))
135+
136+
# Pre-process image
137+
img_out = image_loader(image)
138+
img_out = np.array(img_out)*255.0
139+
140+
# Create object to send to next model
141+
out_tensor_0 = pb_utils.Tensor("detection_preprocessing_output", img_out.astype(output0_dtype))
142+
inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
143+
responses.append(inference_response)
144+
return responses
145+
```
146+
147+
148+
149+
## Tying the models together with Model Ensembles
150+
Now that we have every individal part of our pipeline ready to deploy individually, we can create an ensemble "model" that can execute each model in order, and pass the various inputs and outputs between each model.
151+
152+
To do this, we'll create another entry in our model repository
153+
```
154+
ensemble_model/
155+
├── 1
156+
└── config.pbtxt
157+
```
158+
This time, we only need the configuration file to describe our ensemble along with an empty version folder (which you will need to create with `mkdir -p model_repository/ensemble_model/1`). Within the config file, we'll define the execution graph of our ensemble. This graph describes what the overall inputs and outputs of the ensemble will be, as well as how the data will flow through the models in the form of a Directed Acyclic Graph. Below is a graphical representation of our model pipeline. The diamonds represent the final input and output of the ensemble, which is all the client will interact with. The circles are the different deployed models, and the rectangles are the tensors that get passed between models.
159+
160+
```mermaid
161+
flowchart LR
162+
in{input image} --> m1((detection_preprocessing))
163+
m1((detection_preprocessing)) --> t1((preprocessed_image))
164+
t1((preprocessed_image)) --> m2((text_detection))
165+
m2((text_detection)) --> t2(Sigmoid:0)
166+
m2((text_detection)) --> t3(concat_3:0)
167+
t2(Sigmoid:0) --> m3((detection_postprocessing))
168+
t3(concat_3:0) --> m3((detection_postprocessing))
169+
t1(preprocessed_image) --> m3((detection_postprocessing))
170+
m3((detection_postprocessing)) --> t4(cropped_images)
171+
t4(cropped_images) --> m4((text_recognition))
172+
m4((text_recognition)) --> t5(recognition_output)
173+
t5(recognition_output) --> m5((recognition_postprocessing))
174+
m5((recognition_postprocessing)) --> out{recognized_text}
175+
```
176+
177+
To represent this graph to Triton, we'll create the below config file. Notice how we define the platform as `"ensemble"` and specify the inputs and outputs of the ensemble itself. Then, in the `ensemble_scheduling` block, we create an entry for each `step` of the ensemble that includes the name of the model to be executed, and how that model's inputs and outputs map to the inputs and outputs of either the full ensemble or the other models.
178+
179+
<details>
180+
<summary> Expand for ensemble config file </summary>
181+
182+
```text proto
183+
name: "ensemble_model"
184+
platform: "ensemble"
185+
max_batch_size: 256
186+
input [
187+
{
188+
name: "input_image"
189+
data_type: TYPE_UINT8
190+
dims: [ -1 ]
191+
}
192+
]
193+
output [
194+
{
195+
name: "recognized_text"
196+
data_type: TYPE_STRING
197+
dims: [ -1 ]
198+
}
199+
]
200+
201+
ensemble_scheduling {
202+
step [
203+
{
204+
model_name: "detection_preprocessing"
205+
model_version: -1
206+
input_map {
207+
key: "detection_preprocessing_input"
208+
value: "input_image"
209+
}
210+
output_map {
211+
key: "detection_preprocessing_output"
212+
value: "preprocessed_image"
213+
}
214+
},
215+
{
216+
model_name: "text_detection"
217+
model_version: -1
218+
input_map {
219+
key: "input_images:0"
220+
value: "preprocessed_image"
221+
}
222+
output_map {
223+
key: "feature_fusion/Conv_7/Sigmoid:0"
224+
value: "Sigmoid:0"
225+
},
226+
output_map {
227+
key: "feature_fusion/concat_3:0"
228+
value: "concat_3:0"
229+
}
230+
},
231+
{
232+
model_name: "detection_postprocessing"
233+
model_version: -1
234+
input_map {
235+
key: "detection_postprocessing_input_1"
236+
value: "Sigmoid:0"
237+
}
238+
input_map {
239+
key: "detection_postprocessing_input_2"
240+
value: "concat_3:0"
241+
}
242+
input_map {
243+
key: "detection_postprocessing_input_3"
244+
value: "preprocessed_image"
245+
}
246+
output_map {
247+
key: "detection_postprocessing_output"
248+
value: "cropped_images"
249+
}
250+
},
251+
{
252+
model_name: "text_recognition"
253+
model_version: -1
254+
input_map {
255+
key: "INPUT__0"
256+
value: "cropped_images"
257+
}
258+
output_map {
259+
key: "OUTPUT__0"
260+
value: "recognition_output"
261+
}
262+
},
263+
{
264+
model_name: "recognition_postprocessing"
265+
model_version: -1
266+
input_map {
267+
key: "recognition_postprocessing_input"
268+
value: "recognition_output"
269+
}
270+
output_map {
271+
key: "recognition_postprocessing_output"
272+
value: "recognized_text"
273+
}
274+
}
275+
]
276+
}
277+
```
278+
279+
</details>
280+
281+
## Launching Triton
282+
We'll again be launching Triton using docker containers. This time, we'll start an interactive session within the container instead of directly launching the triton server.
283+
284+
```bash
285+
docker run --gpus=all -it --shm-size=256m --rm \
286+
-p8000:8000 -p8001:8001 -p8002:8002 \
287+
-v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models \
288+
nvcr.io/nvidia/tritonserver:22.12-py3
289+
```
290+
291+
We'll need to install a couple of dependencies for our Python backend scripts.
292+
293+
```bash
294+
pip install torchvision opencv-python-headless
295+
```
296+
297+
Then, we can launch Triton
298+
```bash
299+
tritonserver --model-repository=/models
300+
```
301+
302+
## Creating a new client
303+
304+
Now that we've moved much of the complexity of our previous client into different Triton backend scripts, we can create a much simplified client to communicate with Triton.
305+
306+
```python
307+
## client.py
308+
309+
import tritonclient.grpc as grpcclient
310+
import numpy as np
311+
312+
client = grpcclient.InferenceServerClient(url="localhost:8001")
313+
314+
image_data = np.fromfile("img1.jpg", dtype="uint8")
315+
image_data = np.expand_dims(image_data, axis=0)
316+
317+
input_tensors = [grpcclient.InferInput("input_image", image_data.shape, "UINT8")]
318+
input_tensors[0].set_data_from_numpy(image_data)
319+
results = client.infer(model_name="ensemble_model", inputs=input_tensors)
320+
output_data = results.as_numpy("recognized_text").astype(str)
321+
print(output_data)
322+
```
323+
324+
Now, run the full inference pipeline by executing the following command
325+
```
326+
python client.py
327+
```
328+
You should see the parsed text printed out to your console.
329+
330+
## What's Next
331+
In this example, we showed how you can use Model Ensembles to execute multiple models on Triton with a single network call. Model Ensembles are a geat solution when your model pipelines are in the form of a Directed Acyclic Graph. However, not all pipelines can be expressed this way. For example, if your pipeline logic requires conditional branching or looped execution, you might need a more expressive way to define your pipeline. In the [next example](../Part_6-building_complex_pipelines/), we'll explore how you can create define more complex pipelines in Python using [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting).
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import tritonclient.grpc as grpcclient
2+
import numpy as np
3+
4+
client = grpcclient.InferenceServerClient(url="localhost:8001")
5+
6+
image_data = np.fromfile("img1.jpg", dtype="uint8")
7+
image_data = np.expand_dims(image_data, axis=0)
8+
9+
input_tensors = [grpcclient.InferInput("input_image", image_data.shape, "UINT8")]
10+
input_tensors[0].set_data_from_numpy(image_data)
11+
results = client.infer(model_name="ensemble_model", inputs=input_tensors)
12+
output_data = results.as_numpy("recognized_text").astype(str)
13+
print(output_data)
82.5 KB
Loading

0 commit comments

Comments
 (0)