E5 is a family of text embedding models that can be used for several different purposes, including text retrieval and classification. In this example, we'll be deploying the e5-large-v2 model with Triton Inference Server, using the TensorRT backend. While this example is specific to the e5-large-v2 model, it can be used as a baseline for other embedding models.
To deploy our e5 model, we'll need to create our model engine in a format that can be recognized by Triton, and place it in the proper directory structure.
We'll do this in two steps:
- Exporting the model as an ONNX file
- Compiling the exported ONNX file to a TensorRT plan
Tip
You'll need to have PyTorch and TensorRT installed for this section. We recommend executing the steps in this section inside the NGC PyTorch container, which has the prerequisites needed. You can do this by running the following command
docker run -it --rm --gpus all -v $(pwd):/workspace -v /tmp:/tmp nvcr.io/nvidia/pytorch:24.09-py3For exporting, we'll use the Hugging Face optimum package, which has built in support for exporting Hugging Face models to ONNX.
Note that here we're explicitly setting the batch size as 64. Depending on your use case and hardware capacity, you may want to increase or decrease that number.
pip install optimum[exporters] sentence_transformers
optimum-cli export onnx --task sentence-similarity --model intfloat/e5-large-v2 /tmp/e5_onnx --batch_size 64Once the model is exported to ONNX, we can compile it into a TensorRT Engine by using trtexec. We'll also create our model_repository directory to save our engine into.
Note that we must explicitly set the minimum and maximum shapes for our model inputs here. The minimum shapes should be 1x1 for both the input_ids and attention_mask inputs, corresponding to a batch size and sequence length of 1. The maximum shapes should be 64x512, where 64 the batch size matches the batch size set in the previous step, and 512 is the maximum sequence length for the e5-large-v2 model.
mkdir -p model_repository/e5/1
trtexec \
--onnx=/tmp/e5_onnx/model.onnx \
--saveEngine=/tmp/model_repository/e5/1/model.plan \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--maxShapes=input_ids:64x512,attention_mask:64x512Tip
If you used the NGC PyTorch container for the previous section, exit the container environment before executing the rest of the commands
With our model compiled and placed into our model repository, we can deploy our triton server by mounting it to and running the tritonserver docker container.
docker run --gpus=1 --rm --net=host -v $(PWD)/model_repository:/models nvcr.io/nvidia/tritonserver:24.09-py3 tritonserver --model-repository=/modelsIt may take some time to load the model and start the server. You should see a log message saying "Started GRPCInferenceService at 0.0.0.0:8001" when the server is ready.
Once our model is successfully deployed, we can start sending requests to it using the tritonclient library.
You can use the following code snippet to begin using your deployed model. The model in Triton will expect text to be pre-tokenized, so we use the transformers.AutoTokenizer class to create our tokenizer.
Note
For this model, you should include query or passage respectively at the beginning of your text when encoding for best retrieval performance.
import tritonclient.grpc as grpcclient
from tritonclient.utils import *
from transformers import AutoTokenizer
def prepare_tensor(name, input):
t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
input_texts = [
"query: are judo throws allowed in wrestling?",
"passage: Judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
]
tokenized_text = tokenizer(
input_texts, max_length=512, padding=True, truncation=True, return_tensors="np"
)
triton_inputs = [
prepare_tensor("input_ids", tokenized_text["input_ids"]),
prepare_tensor("attention_mask", tokenized_text["attention_mask"]),
]
with grpcclient.InferenceServerClient(url="localhost:8001") as client:
out = client.infer("e5", triton_inputs)
sentence_embedding = out.as_numpy("sentence_embedding")
token_embeddings = out.as_numpy("token_embeddings")
print(sentence_embedding)