|
1 | | -https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel.forward.example |
| 1 | +# Deploying HuggingFace models |
| 2 | + |
| 3 | +**Note**: If you are new to the Triton Inference Server, it is recommended to review [Part 1 of the Conceptual Guide](../Conceptual_Guide/Part_1-model_deployment/README.md). This tutorial assumes basic understanding about the Triton Inference Server. |
| 4 | + |
| 5 | +Developers often work with open source models. HuggingFace is a popular source of many open source models. The discussion in this guide will focus on how a user can deploy almost any model from HuggingFace with the Triton Inference Server. For this example, the [ViT](https://arxiv.org/abs/2010.11929) model available on [HuggingFace](https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/vit#transformers.ViTModel) is being used. |
| 6 | + |
| 7 | +There are two primary methods of deploying a model pipeline on the Triton Inference Server: |
| 8 | +* **Approach 1:** Deploy the pipeline without explicitly breaking apart model from a pipeline. The core advantage of this approach is that users can quickly deploy their pipeline. This can be achieved with the use of Triton's ["Python Backend"](https://github.com/triton-inference-server/python_backend). Refer [this example](https://github.com/triton-inference-server/python_backend#usage) for more information. |
| 9 | + |
| 10 | +* **Approach 2:** Break apart the pipeline, use a different backends for pre/post processing and deploying the core model on a framework backend. The advantage in this case is that running the core network on a dedicated framework backend provides higher performance. Additionally, many framework specific optimizations can be leveraged. See [Part 4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for more information. This is achieved with Triton's Ensembles. An explanation for the same can be found in [Part 5](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) of the Conceptual Guide. Refer to the documentation for more [information](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models). |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | +## Examples |
| 15 | + |
| 16 | +For the purposes of this explanation, the `ViT` model([Link to HuggingFace](https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/vit#transformers.ViTModel)) is being used. This specific ViT model doesn't have an application head (like image classification) but [HuggingFace provides](https://huggingface.co/models?search=google/vit) ViT models with different heads which users can utilize. |
| 17 | + |
| 18 | + |
| 19 | +A good practice while deploying models is to understand the and explore the structure of the model if you are unfamiliar with it. An easy way to see the structure with a graphical interface is by using tools like [Netron](https://netron.app/). While Triton autogenerates configuration files for the models, the users may still require names of the input and output layers to build clients/model ensembles for which we can use this tool. |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +### Deploying on the Python Backend |
| 24 | + |
| 25 | +Making use of Triton's python backend requires users to define up to three functions of the `TritonPythonModel` class: |
| 26 | +* `initialize()`: This function runs when Triton loads the model. It is recommended to use this function to initialize/load any models and/or data objects. Defining this function is optional. |
| 27 | +``` |
| 28 | +def initialize(self, args): |
| 29 | + self.feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k') |
| 30 | + self.model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k") |
| 31 | +``` |
| 32 | +* `execute()`: This function is executed upon every request. This can be used to house all the required pipeline logic. |
| 33 | +``` |
| 34 | +def execute(self, requests): |
| 35 | + responses = [] |
| 36 | + for request in requests: |
| 37 | + inp = pb_utils.get_input_tensor_by_name(request, "image") |
| 38 | + input_image = np.squeeze(inp.as_numpy()).transpose((2,0,1)) |
| 39 | + inputs = self.feature_extractor(images=input_image, return_tensors="pt") |
| 40 | +
|
| 41 | + outputs = self.model(**inputs) |
| 42 | +
|
| 43 | + # Sending results |
| 44 | + inference_response = pb_utils.InferenceResponse(output_tensors=[ |
| 45 | + pb_utils.Tensor( |
| 46 | + "label", |
| 47 | + outputs.last_hidden_state.numpy() |
| 48 | + ) |
| 49 | + ]) |
| 50 | + responses.append(inference_response) |
| 51 | + return responses |
| 52 | +``` |
| 53 | +* `finialize()`: This function is executed when Triton unloads the model. It can be used to free any memory, or any other operations required to safely unload the model. Defining this function is optional. |
| 54 | + |
| 55 | +To run this example open two terminals and use the following commands: |
| 56 | +* **Terminal 1**: This terminal will be used to launch the Triton Inference Server. |
| 57 | +``` |
| 58 | +# Pick the pre-made model repository |
| 59 | +mv python_model_repository model_repository |
| 60 | +
|
| 61 | +# Pull and run the Triton container & replace yy.mm |
| 62 | +# with year and month of release. Eg. 22.12 |
| 63 | +docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash |
| 64 | +
|
| 65 | +# Install dependencies |
| 66 | +pip install torch torchvision |
| 67 | +pip install transformers |
| 68 | +pip install Image |
| 69 | +
|
| 70 | +# Launch the server |
| 71 | +tritonserver --model-repository=/models |
| 72 | +``` |
| 73 | +* **Terminal 2**: This terminal will be used to run the client. |
| 74 | +``` |
| 75 | +# Pull & run the container |
| 76 | +docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash |
| 77 | +
|
| 78 | +# Run the client |
| 79 | +python3 client.py --model_name "python_vit" |
| 80 | +``` |
| 81 | + |
| 82 | +### Deploying using a Triton Ensemble |
| 83 | + |
| 84 | +Before the specifics around deploying the models can be discussed, the first step is to download and export the model. It is recommended to run the following inside the [PyTorch container available on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). If this is your first try at setting up a model ensemble in Triton, it is highly recommended to review [this guide](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) before proceeding. The key advantages of breaking down the pipeline is improved performance and access to a multitude of acceleration options. Explore [Part-4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for details about model acceleration. |
| 85 | + |
| 86 | +``` |
| 87 | +# Pull the PyTorch Container from NGC |
| 88 | +docker run -it --gpus=all -v ${PWD}:/workspace nvcr.io/nvidia/pytorch:22.12-py3 |
| 89 | +
|
| 90 | +# Install dependencies |
| 91 | +pip install transformers |
| 92 | +pip install transformers[onnx] |
| 93 | +
|
| 94 | +# Export the model |
| 95 | +python -m transformers.onnx --model=google/vit-base-patch16-224 --atol=1e-3 onnx/vit |
| 96 | +``` |
| 97 | + |
| 98 | +With the model downloaded, set up the model repository in the structure described below. The basic structure of the model repository along with the required configuration files are available in `ensemble_model_repository`. |
| 99 | +``` |
| 100 | +model_repository/ |
| 101 | +|-- ensemble_model |
| 102 | +| |-- 1 |
| 103 | +| `-- config.pbtxt |
| 104 | +|-- preprocessing |
| 105 | +| |-- 1 |
| 106 | +| | `-- model.py |
| 107 | +| `-- config.pbtxt |
| 108 | +`-- vit |
| 109 | + `-- 1 |
| 110 | + `-- model.onnx |
| 111 | +``` |
| 112 | + |
| 113 | +In this approach, there are three points to consider. |
| 114 | +* **Preprocessing**: The feature extraction step for ViT is done on a python backend. The implementation details for this step are same as the process followed in the [section above](#deploying-on-the-python-backend). |
| 115 | +* **The ViT model**: Simply place the model in the repository as described above. The Triton Inference Server will auto generate the required configurations files. If you wish to see the generated config, append `--log-verbose=1` while launching the server. |
| 116 | +* **Ensemble Configuration**: In this configuration we map the input and output layers of the two pieces in the ensemble, `preprocessing` which is handled on the python backend, and the ViT model which is deployed on the ONNX backend. |
| 117 | + |
| 118 | +To run this example, similar to the previous flow, make use of two terminals: |
| 119 | +* **Terminal 1**: This terminal will be used to launch the Triton Inference Server. |
| 120 | + |
| 121 | +``` |
| 122 | +# Pick the pre-made model repository and add the model |
| 123 | +mv ensemble_model_repository model_repository |
| 124 | +mkdir -p model_repository/vit/1 |
| 125 | +mv vit/model.onnx model_repository/vit/1/ |
| 126 | +mkdir model_repository/ensemble_model/1 |
| 127 | +
|
| 128 | +# Pull and run the Triton container & replace yy.mm |
| 129 | +# with year and month of release. Eg. 22.12 |
| 130 | +docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash |
| 131 | +
|
| 132 | +# Install dependencies |
| 133 | +pip install torch torchvision torchaudio |
| 134 | +pip install transformers |
| 135 | +pip install Image |
| 136 | +
|
| 137 | +# Launch the server |
| 138 | +tritonserver --model-repository=/models |
| 139 | +``` |
| 140 | +* **Terminal 2**: This terminal will be used to run the client. |
| 141 | +``` |
| 142 | +# Pull & run the container |
| 143 | +docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash |
| 144 | +
|
| 145 | +# Run the client |
| 146 | +python3 client.py --model_name "ensemble_model" |
| 147 | +``` |
| 148 | + |
| 149 | +## Summary |
| 150 | + |
| 151 | +In summary, there are two methods in which most HuggingFace models can be deployed, either deploy the entire pipeline on a python backend, or construct and ensemble. |
| 152 | + |
0 commit comments