Deploying E5 Text Embedding models with Triton Inference Server

E5 is a family of text embedding models that can be used for several different purposes, including text retrieval and classification. In this example, we'll be deploying the e5-large-v2 model with Triton Inference Server, using the TensorRT backend. While this example is specific to the e5-large-v2 model, it can be used as a baseline for other embedding models.

Creating Model Repository

To deploy our e5 model, we'll need to create our model engine in a format that can be recognized by Triton, and place it in the proper directory structure.

We'll do this in two steps:

Exporting the model as an ONNX file
Compiling the exported ONNX file to a TensorRT plan

Tip

You'll need to have PyTorch and TensorRT installed for this section. We recommend executing the steps in this section inside the NGC PyTorch container, which has the prerequisites needed. You can do this by running the following command

docker run -it --rm --gpus all -v $(pwd):/workspace -v /tmp:/tmp nvcr.io/nvidia/pytorch:24.09-py3

Exporting to ONNX

For exporting, we'll use the Hugging Face optimum package, which has built in support for exporting Hugging Face models to ONNX.

Note that here we're explicitly setting the batch size as 64. Depending on your use case and hardware capacity, you may want to increase or decrease that number.

pip install optimum[exporters] sentence_transformers
optimum-cli export onnx --task sentence-similarity --model intfloat/e5-large-v2 /tmp/e5_onnx --batch_size 64

Compile TRT Engine

Once the model is exported to ONNX, we can compile it into a TensorRT Engine by using trtexec. We'll also create our model_repository directory to save our engine into.

Note that we must explicitly set the minimum and maximum shapes for our model inputs here. The minimum shapes should be 1x1 for both the input_ids and attention_mask inputs, corresponding to a batch size and sequence length of 1. The maximum shapes should be 64x512, where 64 the batch size matches the batch size set in the previous step, and 512 is the maximum sequence length for the e5-large-v2 model.

mkdir -p model_repository/e5/1

trtexec \
    --onnx=/tmp/e5_onnx/model.onnx \
    --saveEngine=/tmp/model_repository/e5/1/model.plan \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --maxShapes=input_ids:64x512,attention_mask:64x512

Deploy Triton

Tip

If you used the NGC PyTorch container for the previous section, exit the container environment before executing the rest of the commands

With our model compiled and placed into our model repository, we can deploy our triton server by mounting it to and running the tritonserver docker container.

docker run --gpus=1 --rm --net=host -v $(PWD)/model_repository:/models nvcr.io/nvidia/tritonserver:24.09-py3 tritonserver --model-repository=/models

It may take some time to load the model and start the server. You should see a log message saying "Started GRPCInferenceService at 0.0.0.0:8001" when the server is ready.

Send Request

Once our model is successfully deployed, we can start sending requests to it using the tritonclient library.

You can use the following code snippet to begin using your deployed model. The model in Triton will expect text to be pre-tokenized, so we use the transformers.AutoTokenizer class to create our tokenizer.

Note

For this model, you should include query or passage respectively at the beginning of your text when encoding for best retrieval performance.

import tritonclient.grpc as grpcclient
from tritonclient.utils import *

from transformers import AutoTokenizer

def prepare_tensor(name, input):
    t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
input_texts = [
    "query: are judo throws allowed in wrestling?",
    "passage: Judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
]
tokenized_text = tokenizer(
    input_texts, max_length=512, padding=True, truncation=True, return_tensors="np"
)

triton_inputs = [
    prepare_tensor("input_ids", tokenized_text["input_ids"]),
    prepare_tensor("attention_mask", tokenized_text["attention_mask"]),
]

with grpcclient.InferenceServerClient(url="localhost:8001") as client:
    out = client.infer("e5", triton_inputs)

sentence_embedding = out.as_numpy("sentence_embedding")
token_embeddings = out.as_numpy("token_embeddings")

print(sentence_embedding)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying E5 Text Embedding models with Triton Inference Server

Creating Model Repository

Exporting to ONNX

Compile TRT Engine

Deploy Triton

Send Request

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Deploying E5 Text Embedding models with Triton Inference Server

Creating Model Repository

Exporting to ONNX

Compile TRT Engine

Deploy Triton

Send Request