HTTP Connection Distribution Imbalance in evhtp Causes Sequential Request Processing

**Description**
Triton Inference Server experiences unbalanced connection distribution across worker threads when using the HTTP endpoint, resulting in sequential request processing despite having multiple worker threads available.

The evhtp library used by Triton has a main thread that blocks on an event base listening for events on the input socket. When connections arrive, they are dispatched to worker threads through a pipe (a pair of sockets) mechanism.

Under certain conditions, the following sequence occurs:
1. Main thread handles incoming connection and sends it to pipe
2. Thread 1 accepts connection from pipe
3. Main thread handles next incoming connection and sends it to pipe
4. Thread 1 accepts connection again (same thread)

When the same thread repeatedly wins the race to accept connections from the pipe, requests are processed sequentially on that single thread, even when other worker threads are idle.

**Triton Information**
Image: `nvcr.io/nvidia/tritonserver:25.06-py3`
Backend: `python`

**To Reproduce**
1. Create a docker container with a triton server and the `add_sub` model by following [this guide](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#quick-start)
2. Add a delay to the initialize method of the add_sub model to simulate long model loading:
```
def initialize(self, args):
     # Sleep for one minute
     time.sleep(60)
```
3. Copy the add_sub model 6 times add_sub1, add_sub2, add_sub3 ... add_sub6
4. Start a triton server with the following command:
```
tritonserver \
    --model-repository=<path_to_models_directory> \
    --model-control-mode=explicit \
    --load-model=* \
    --model-load-thread-count=10 \
    --http-thread-count=10
```
5. Touch each model.py to simulate that the models have changed:
```
#!/bin/bash

for x in {1..6}; do
    touch "/models/add_sub${x}/1/model.py"
    echo "Touched models/add_sub${x}/1/model.py"
done
```

6.  Make concurrent requests to Triton to load all models:
```
#!/bin/bash

for x in {1..6}; do
    echo "Loading model: add_sub${x}"
    curl -X POST "localhost:8000/v2/repository/models/add_sub${x}/load" &
    echo ""
done
wait
```

**Expected behavior**
All 6 models should begin loading simultaneously, with logs appearing at the same time:
```
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub6:1"
````

**Actual Behavior**
Some models start loading with a one-minute delay (matching the delay set in the initialize method), indicating sequential processing:
```
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub6:1"
```

**Possible solution**
The core of this problem is that libevhtp was designed to have lightweight handlers that execute very quickly. When executing large blocking tasks, they should be rescheduled to a separate thread pool.

A potential solution is to use the [thread pool from the common library](https://github.com/triton-inference-server/common/blob/main/src/thread_pool.cc) with a limit for a queue size (or even rewrite it as a small workstealing tp) and add it as a field in the HTTPAPIServer class, where blocking handlers would be executed asynchronously.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP Connection Distribution Imbalance in evhtp Causes Sequential Request Processing #8635

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTTP Connection Distribution Imbalance in evhtp Causes Sequential Request Processing #8635

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions