Skip to content

HTTP Connection Distribution Imbalance in evhtp Causes Sequential Request Processing #8635

@aleksn7

Description

@aleksn7

Description
Triton Inference Server experiences unbalanced connection distribution across worker threads when using the HTTP endpoint, resulting in sequential request processing despite having multiple worker threads available.

The evhtp library used by Triton has a main thread that blocks on an event base listening for events on the input socket. When connections arrive, they are dispatched to worker threads through a pipe (a pair of sockets) mechanism.

Under certain conditions, the following sequence occurs:

  1. Main thread handles incoming connection and sends it to pipe
  2. Thread 1 accepts connection from pipe
  3. Main thread handles next incoming connection and sends it to pipe
  4. Thread 1 accepts connection again (same thread)

When the same thread repeatedly wins the race to accept connections from the pipe, requests are processed sequentially on that single thread, even when other worker threads are idle.

Triton Information
Image: nvcr.io/nvidia/tritonserver:25.06-py3
Backend: python

To Reproduce

  1. Create a docker container with a triton server and the add_sub model by following this guide
  2. Add a delay to the initialize method of the add_sub model to simulate long model loading:
def initialize(self, args):
     # Sleep for one minute
     time.sleep(60)
  1. Copy the add_sub model 6 times add_sub1, add_sub2, add_sub3 ... add_sub6
  2. Start a triton server with the following command:
tritonserver \
    --model-repository=<path_to_models_directory> \
    --model-control-mode=explicit \
    --load-model=* \
    --model-load-thread-count=10 \
    --http-thread-count=10
  1. Touch each model.py to simulate that the models have changed:
#!/bin/bash

for x in {1..6}; do
    touch "/models/add_sub${x}/1/model.py"
    echo "Touched models/add_sub${x}/1/model.py"
done
  1. Make concurrent requests to Triton to load all models:
#!/bin/bash

for x in {1..6}; do
    echo "Loading model: add_sub${x}"
    curl -X POST "localhost:8000/v2/repository/models/add_sub${x}/load" &
    echo ""
done
wait

Expected behavior
All 6 models should begin loading simultaneously, with logs appearing at the same time:

I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub6:1"

Actual Behavior
Some models start loading with a one-minute delay (matching the delay set in the initialize method), indicating sequential processing:

I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub1:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub2:1"
I0203 09:10:40.123456 666 model_lifecycle.cc:473] "loading: add_sub3:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub4:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub5:1"
I0203 09:11:40.435898 666 model_lifecycle.cc:473] "loading: add_sub6:1"

Possible solution
The core of this problem is that libevhtp was designed to have lightweight handlers that execute very quickly. When executing large blocking tasks, they should be rescheduled to a separate thread pool.

A potential solution is to use the thread pool from the common library with a limit for a queue size (or even rewrite it as a small workstealing tp) and add it as a field in the HTTPAPIServer class, where blocking handlers would be executed asynchronously.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions