Skip to content

Commit f74f861

Browse files
Added Huggingface example (#6)
* huggingface example wip * added huggingface example * updated the repo readme * worked in feedback, fixed typos & broken links * updated with suggestions * Added onnx example (#7) * added onnx example * fixed broken link * changes per feedback * added feedback
1 parent d09394f commit f74f861

17 files changed

Lines changed: 454 additions & 14 deletions

File tree

Conceptual_Guide/Part_5-Model_Ensembles/model_repository/detection_postprocessing/1/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ def initialize(self, args):
6060
"""
6161

6262
# You must parse model_config. JSON string is not parsed here
63-
self.model_config = model_config = json.loads(args['model_config'])
63+
model_config = json.loads(args['model_config'])
6464

6565
# Get OUTPUT0 configuration
6666
output0_config = pb_utils.get_output_config_by_name(

Conceptual_Guide/Part_5-Model_Ensembles/model_repository/detection_preprocessing/1/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def initialize(self, args):
5959
"""
6060

6161
# You must parse model_config. JSON string is not parsed here
62-
self.model_config = model_config = json.loads(args['model_config'])
62+
model_config = json.loads(args['model_config'])
6363

6464
# Get OUTPUT0 configuration
6565
output0_config = pb_utils.get_output_config_by_name(

Conceptual_Guide/Part_5-Model_Ensembles/model_repository/recognition_postprocessing/1/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ def initialize(self, args):
5555
"""
5656

5757
# You must parse model_config. JSON string is not parsed here
58-
self.model_config = model_config = json.loads(args['model_config'])
58+
model_config = json.loads(args['model_config'])
5959

6060
# Get OUTPUT0 configuration
6161
output0_config = pb_utils.get_output_config_by_name(

HuggingFace/README.md

Lines changed: 152 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,152 @@
1-
https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel.forward.example
1+
# Deploying HuggingFace models
2+
3+
**Note**: If you are new to the Triton Inference Server, it is recommended to review [Part 1 of the Conceptual Guide](../Conceptual_Guide/Part_1-model_deployment/README.md). This tutorial assumes basic understanding about the Triton Inference Server.
4+
5+
Developers often work with open source models. HuggingFace is a popular source of many open source models. The discussion in this guide will focus on how a user can deploy almost any model from HuggingFace with the Triton Inference Server. For this example, the [ViT](https://arxiv.org/abs/2010.11929) model available on [HuggingFace](https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/vit#transformers.ViTModel) is being used.
6+
7+
There are two primary methods of deploying a model pipeline on the Triton Inference Server:
8+
* **Approach 1:** Deploy the pipeline without explicitly breaking apart model from a pipeline. The core advantage of this approach is that users can quickly deploy their pipeline. This can be achieved with the use of Triton's ["Python Backend"](https://github.com/triton-inference-server/python_backend). Refer [this example](https://github.com/triton-inference-server/python_backend#usage) for more information.
9+
10+
* **Approach 2:** Break apart the pipeline, use a different backends for pre/post processing and deploying the core model on a framework backend. The advantage in this case is that running the core network on a dedicated framework backend provides higher performance. Additionally, many framework specific optimizations can be leveraged. See [Part 4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for more information. This is achieved with Triton's Ensembles. An explanation for the same can be found in [Part 5](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) of the Conceptual Guide. Refer to the documentation for more [information](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models).
11+
12+
![multiple models](./img/Approach.PNG)
13+
14+
## Examples
15+
16+
For the purposes of this explanation, the `ViT` model([Link to HuggingFace](https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/vit#transformers.ViTModel)) is being used. This specific ViT model doesn't have an application head (like image classification) but [HuggingFace provides](https://huggingface.co/models?search=google/vit) ViT models with different heads which users can utilize.
17+
18+
19+
A good practice while deploying models is to understand the and explore the structure of the model if you are unfamiliar with it. An easy way to see the structure with a graphical interface is by using tools like [Netron](https://netron.app/). While Triton autogenerates configuration files for the models, the users may still require names of the input and output layers to build clients/model ensembles for which we can use this tool.
20+
21+
![multiple models](./img/netron.PNG)
22+
23+
### Deploying on the Python Backend
24+
25+
Making use of Triton's python backend requires users to define up to three functions of the `TritonPythonModel` class:
26+
* `initialize()`: This function runs when Triton loads the model. It is recommended to use this function to initialize/load any models and/or data objects. Defining this function is optional.
27+
```
28+
def initialize(self, args):
29+
self.feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
30+
self.model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
31+
```
32+
* `execute()`: This function is executed upon every request. This can be used to house all the required pipeline logic.
33+
```
34+
def execute(self, requests):
35+
responses = []
36+
for request in requests:
37+
inp = pb_utils.get_input_tensor_by_name(request, "image")
38+
input_image = np.squeeze(inp.as_numpy()).transpose((2,0,1))
39+
inputs = self.feature_extractor(images=input_image, return_tensors="pt")
40+
41+
outputs = self.model(**inputs)
42+
43+
# Sending results
44+
inference_response = pb_utils.InferenceResponse(output_tensors=[
45+
pb_utils.Tensor(
46+
"label",
47+
outputs.last_hidden_state.numpy()
48+
)
49+
])
50+
responses.append(inference_response)
51+
return responses
52+
```
53+
* `finialize()`: This function is executed when Triton unloads the model. It can be used to free any memory, or any other operations required to safely unload the model. Defining this function is optional.
54+
55+
To run this example open two terminals and use the following commands:
56+
* **Terminal 1**: This terminal will be used to launch the Triton Inference Server.
57+
```
58+
# Pick the pre-made model repository
59+
mv python_model_repository model_repository
60+
61+
# Pull and run the Triton container & replace yy.mm
62+
# with year and month of release. Eg. 22.12
63+
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
64+
65+
# Install dependencies
66+
pip install torch torchvision
67+
pip install transformers
68+
pip install Image
69+
70+
# Launch the server
71+
tritonserver --model-repository=/models
72+
```
73+
* **Terminal 2**: This terminal will be used to run the client.
74+
```
75+
# Pull & run the container
76+
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash
77+
78+
# Run the client
79+
python3 client.py --model_name "python_vit"
80+
```
81+
82+
### Deploying using a Triton Ensemble
83+
84+
Before the specifics around deploying the models can be discussed, the first step is to download and export the model. It is recommended to run the following inside the [PyTorch container available on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). If this is your first try at setting up a model ensemble in Triton, it is highly recommended to review [this guide](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) before proceeding. The key advantages of breaking down the pipeline is improved performance and access to a multitude of acceleration options. Explore [Part-4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for details about model acceleration.
85+
86+
```
87+
# Pull the PyTorch Container from NGC
88+
docker run -it --gpus=all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:22.12-py3
89+
90+
# Install dependencies
91+
pip install transformers
92+
pip install transformers[onnx]
93+
94+
# Export the model
95+
python -m transformers.onnx --model=google/vit-base-patch16-224 --atol=1e-3 onnx/vit
96+
```
97+
98+
With the model downloaded, set up the model repository in the structure described below. The basic structure of the model repository along with the required configuration files are available in `ensemble_model_repository`.
99+
```
100+
model_repository/
101+
|-- ensemble_model
102+
| |-- 1
103+
| `-- config.pbtxt
104+
|-- preprocessing
105+
| |-- 1
106+
| | `-- model.py
107+
| `-- config.pbtxt
108+
`-- vit
109+
`-- 1
110+
`-- model.onnx
111+
```
112+
113+
In this approach, there are three points to consider.
114+
* **Preprocessing**: The feature extraction step for ViT is done on a python backend. The implementation details for this step are same as the process followed in the [section above](#deploying-on-the-python-backend).
115+
* **The ViT model**: Simply place the model in the repository as described above. The Triton Inference Server will auto generate the required configurations files. If you wish to see the generated config, append `--log-verbose=1` while launching the server.
116+
* **Ensemble Configuration**: In this configuration we map the input and output layers of the two pieces in the ensemble, `preprocessing` which is handled on the python backend, and the ViT model which is deployed on the ONNX backend.
117+
118+
To run this example, similar to the previous flow, make use of two terminals:
119+
* **Terminal 1**: This terminal will be used to launch the Triton Inference Server.
120+
121+
```
122+
# Pick the pre-made model repository and add the model
123+
mv ensemble_model_repository model_repository
124+
mkdir -p model_repository/vit/1
125+
mv vit/model.onnx model_repository/vit/1/
126+
mkdir model_repository/ensemble_model/1
127+
128+
# Pull and run the Triton container & replace yy.mm
129+
# with year and month of release. Eg. 22.12
130+
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
131+
132+
# Install dependencies
133+
pip install torch torchvision torchaudio
134+
pip install transformers
135+
pip install Image
136+
137+
# Launch the server
138+
tritonserver --model-repository=/models
139+
```
140+
* **Terminal 2**: This terminal will be used to run the client.
141+
```
142+
# Pull & run the container
143+
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash
144+
145+
# Run the client
146+
python3 client.py --model_name "ensemble_model"
147+
```
148+
149+
## Summary
150+
151+
In summary, there are two methods in which most HuggingFace models can be deployed, either deploy the entire pipeline on a python backend, or construct and ensemble.
152+

HuggingFace/client.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
import numpy as np
2+
import time
3+
from tritonclient.utils import *
4+
from PIL import Image
5+
import tritonclient.http as httpclient
6+
import requests
7+
import argparse
8+
9+
10+
def main(model_name):
11+
client = httpclient.InferenceServerClient(url="localhost:8000")
12+
13+
# Inputs
14+
url = "http://images.cocodataset.org/val2017/000000161642.jpg"
15+
image = np.asarray(Image.open(requests.get(url, stream=True).raw)).astype(np.float32)
16+
image = np.expand_dims(image, axis=0)
17+
18+
# Set Inputs
19+
input_tensors = [
20+
httpclient.InferInput("image", image.shape, datatype="FP32")
21+
]
22+
input_tensors[0].set_data_from_numpy(image)
23+
24+
# Set outputs
25+
outputs = [
26+
httpclient.InferRequestedOutput("last_hidden_state")
27+
]
28+
29+
# Query
30+
query_response = client.infer(model_name=model_name,
31+
inputs=input_tensors,
32+
outputs=outputs)
33+
34+
# Output
35+
last_hidden_state = query_response.as_numpy("last_hidden_state")
36+
print(last_hidden_state.shape)
37+
38+
if __name__ == "__main__":
39+
parser = argparse.ArgumentParser()
40+
parser.add_argument("--model_name", default="Select between enemble_model and python_vit")
41+
args = parser.parse_args()
42+
main(args.model_name)
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
name: "ensemble_model"
2+
platform: "ensemble"
3+
max_batch_size: 4
4+
5+
input [
6+
{
7+
name: "image"
8+
data_type: TYPE_FP32
9+
dims: [-1, -1, -1]
10+
}
11+
]
12+
output [
13+
{
14+
name: "last_hidden_state"
15+
data_type: TYPE_FP32
16+
dims: [-1, -1]
17+
},
18+
{
19+
name: "1519"
20+
data_type: TYPE_FP32
21+
dims: [768]
22+
}
23+
]
24+
ensemble_scheduling {
25+
step [
26+
{
27+
model_name: "preprocessing"
28+
model_version: 1
29+
input_map {
30+
key: "image"
31+
value: "image"
32+
}
33+
output_map {
34+
key: "pixel_values"
35+
value: "pixel_values"
36+
}
37+
},
38+
{
39+
model_name: "vit"
40+
model_version: 1
41+
input_map {
42+
key: "pixel_values"
43+
value: "pixel_values"
44+
}
45+
output_map {
46+
key: "last_hidden_state"
47+
value: "last_hidden_state"
48+
}
49+
output_map {
50+
key: "1519"
51+
value: "1519"
52+
}
53+
}
54+
]
55+
}
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import numpy as np
2+
import triton_python_backend_utils as pb_utils
3+
from transformers import ViTFeatureExtractor
4+
5+
6+
class TritonPythonModel:
7+
8+
def initialize(self, args):
9+
self.feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
10+
11+
def execute(self, requests):
12+
responses = []
13+
for request in requests:
14+
inp = pb_utils.get_input_tensor_by_name(request, "image")
15+
input_image = np.squeeze(inp.as_numpy()).transpose((2,0,1))
16+
17+
inputs = self.feature_extractor(images=input_image, return_tensors="pt")
18+
pixel_values = inputs['pixel_values'].numpy()
19+
20+
inference_response = pb_utils.InferenceResponse(output_tensors=[
21+
pb_utils.Tensor(
22+
"pixel_values",
23+
pixel_values,
24+
)
25+
])
26+
responses.append(inference_response)
27+
return responses
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: "preprocessing"
2+
backend: "python"
3+
max_batch_size: 8
4+
5+
input [
6+
{
7+
name: "image"
8+
data_type: TYPE_FP32
9+
dims: [-1, -1, -1]
10+
}
11+
]
12+
output [
13+
{
14+
name: "pixel_values"
15+
data_type: TYPE_FP32
16+
dims: [-1, -1, -1]
17+
}
18+
]
19+
20+
instance_group [
21+
{
22+
kind: KIND_GPU
23+
}
24+
]

HuggingFace/img/Approach.PNG

832 KB
Loading

HuggingFace/img/netron.PNG

61.7 KB
Loading

0 commit comments

Comments
 (0)