Description
Triton Inference Server experiences unbalanced connection distribution across worker threads when using the HTTP endpoint, resulting in sequential request processing despite having multiple worker threads available.
The evhtp library used by Triton has a main thread that blocks on an event base listening for events on the input socket. When connections arrive, they are dispatched to worker threads through a pipe (a pair of sockets) mechanism.
Under certain conditions, the following sequence occurs:
- Main thread handles incoming connection and sends it to pipe
- Thread 1 accepts connection from pipe
- Main thread handles next incoming connection and sends it to pipe
- Thread 1 accepts connection again (same thread)
When the same thread repeatedly wins the race to accept connections from the pipe, requests are processed sequentially on that single thread, even when other worker threads are idle.
Triton Information
Image: nvcr.io/nvidia/tritonserver:25.06-py3
Backend: python
To Reproduce
- Create a docker container with a triton server and the
add_sub model by following this guide
- Add a delay to the initialize method of the add_sub model to simulate long model loading:
def initialize(self, args):
# Sleep for one minute
time.sleep(60)
- Copy the add_sub model 6 times add_sub1, add_sub2, add_sub3 ... add_sub6
- Start a triton server with the following command:
tritonserver \
--model-repository=<path_to_models_directory> \
--model-control-mode=explicit \
--load-model=* \
--model-load-thread-count=10 \
--http-thread-count=10
- Touch each model.py to simulate that the models have changed:
#!/bin/bash
for x in {1..6}; do
touch "/models/add_sub${x}/1/model.py"
echo "Touched models/add_sub${x}/1/model.py"
done
- Make concurrent requests to Triton to load all models:
#!/bin/bash
for x in {1..6}; do
echo "Loading model: add_sub${x}"
curl -X POST "localhost:8000/v2/repository/models/add_sub${x}/load" &
echo ""
done
wait
Expected behavior
All 6 models should begin loading simultaneously, with logs appearing at the same time:
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub6:1"
Actual Behavior
Some models start loading with a one-minute delay (matching the delay set in the initialize method), indicating sequential processing:
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub6:1"
Possible solution
The core of this problem is that libevhtp was designed to have lightweight handlers that execute very quickly. When executing large blocking tasks, they should be rescheduled to a separate thread pool.
A potential solution is to use the thread pool from the common library with a limit for a queue size (or even rewrite it as a small workstealing tp) and add it as a field in the HTTPAPIServer class, where blocking handlers would be executed asynchronously.
Description
Triton Inference Server experiences unbalanced connection distribution across worker threads when using the HTTP endpoint, resulting in sequential request processing despite having multiple worker threads available.
The evhtp library used by Triton has a main thread that blocks on an event base listening for events on the input socket. When connections arrive, they are dispatched to worker threads through a pipe (a pair of sockets) mechanism.
Under certain conditions, the following sequence occurs:
When the same thread repeatedly wins the race to accept connections from the pipe, requests are processed sequentially on that single thread, even when other worker threads are idle.
Triton Information
Image:
nvcr.io/nvidia/tritonserver:25.06-py3Backend:
pythonTo Reproduce
add_submodel by following this guideExpected behavior
All 6 models should begin loading simultaneously, with logs appearing at the same time:
Actual Behavior
Some models start loading with a one-minute delay (matching the delay set in the initialize method), indicating sequential processing:
Possible solution
The core of this problem is that libevhtp was designed to have lightweight handlers that execute very quickly. When executing large blocking tasks, they should be rescheduled to a separate thread pool.
A potential solution is to use the thread pool from the common library with a limit for a queue size (or even rewrite it as a small workstealing tp) and add it as a field in the HTTPAPIServer class, where blocking handlers would be executed asynchronously.