From fef7cb7667be14b07bc48bc413b6e9c4fd92555c Mon Sep 17 00:00:00 2001 From: Yingge He Date: Tue, 3 Feb 2026 09:31:47 -0800 Subject: [PATCH 1/2] Fix broken links --- README.md | 8 +++---- docs/README.md | 12 +++++----- docs/customization_guide/build.md | 4 +--- .../inference_protocols.md | 6 ++--- .../README.md | 4 ++-- docs/getting_started/llm.md | 14 ++++++------ docs/getting_started/trtllm_user_guide.md | 8 +++---- docs/index.md | 4 ++-- docs/introduction/index.md | 4 ++-- docs/protocol/extension_parameters.md | 6 ++--- docs/protocol/extension_schedule_policy.md | 6 ++--- docs/user_guide/architecture.md | 14 ++++++------ docs/user_guide/batcher.md | 10 ++++----- docs/user_guide/debugging_guide.md | 10 ++++----- docs/user_guide/decoupled_models.md | 6 ++--- docs/user_guide/faq.md | 8 +++---- docs/user_guide/jetson.md | 6 ++--- docs/user_guide/model_configuration.md | 12 +++++----- docs/user_guide/optimization.md | 22 +++++++++---------- docs/user_guide/ragged_batching.md | 4 ++-- docs/user_guide/request_cancellation.md | 12 +++++----- docs/user_guide/trace.md | 8 +++---- python/openai/README.md | 6 ++--- 23 files changed, 96 insertions(+), 98 deletions(-) diff --git a/README.md b/README.md index 522927495c..ae73196b58 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ inference -> data postprocessing". Using ensemble models for this -purpose can avoid the overhead of transferring intermediate tensors -and minimize the number of requests that must be sent to Triton. - -The ensemble scheduler must be used for ensemble models, regardless of -the scheduler used by the models within the ensemble. With respect to -the ensemble scheduler, an *ensemble* model is not an actual -model. Instead, it specifies the dataflow between models within the -ensemble as *ModelEnsembling::Step* entries in the model -configuration. The scheduler collects the output tensors in each step, -provides them as input tensors for other steps according to the -specification. In spite of that, the ensemble model is still viewed as -a single model from an external view. - -Note that the ensemble models will inherit the characteristics of the -models involved, so the meta-data in the request header must comply -with the models within the ensemble. For instance, if one of the -models is stateful model, then the inference request for the ensemble -model should contain the information mentioned in [Stateful -Models](#stateful-models), which will be provided to the stateful -model by the scheduler. - -As an example consider an ensemble model for image classification and -segmentation that has the following model configuration: - -``` -name: "ensemble_model" -platform: "ensemble" -max_batch_size: 1 -input [ - { - name: "IMAGE" - data_type: TYPE_STRING - dims: [ 1 ] - } -] -output [ - { - name: "CLASSIFICATION" - data_type: TYPE_FP32 - dims: [ 1000 ] - }, - { - name: "SEGMENTATION" - data_type: TYPE_FP32 - dims: [ 3, 224, 224 ] - } -] -ensemble_scheduling { - step [ - { - model_name: "image_preprocess_model" - model_version: -1 - input_map { - key: "RAW_IMAGE" - value: "IMAGE" - } - output_map { - key: "PREPROCESSED_OUTPUT" - value: "preprocessed_image" - } - }, - { - model_name: "classification_model" - model_version: -1 - input_map { - key: "FORMATTED_IMAGE" - value: "preprocessed_image" - } - output_map { - key: "CLASSIFICATION_OUTPUT" - value: "CLASSIFICATION" - } - }, - { - model_name: "segmentation_model" - model_version: -1 - input_map { - key: "FORMATTED_IMAGE" - value: "preprocessed_image" - } - output_map { - key: "SEGMENTATION_OUTPUT" - value: "SEGMENTATION" - } - } - ] -} -``` - -The ensemble\_scheduling section indicates that the ensemble scheduler will be -used and that the ensemble model consists of three different models. Each -element in step section specifies the model to be used and how the inputs and -outputs of the model are mapped to tensor names recognized by the scheduler. For -example, the first element in step specifies that the latest version of -image\_preprocess\_model should be used, the content of its input "RAW\_IMAGE" -is provided by "IMAGE" tensor, and the content of its output -"PREPROCESSED\_OUTPUT" will be mapped to "preprocessed\_image" tensor for later -use. The tensor names recognized by the scheduler are the ensemble inputs, the -ensemble outputs and all values in the input\_map and the output\_map. - -The models composing the ensemble may also have dynamic batching -enabled. Since ensemble models are just routing the data between -composing models, Triton can take requests into an ensemble model -without modifying the ensemble's configuration to exploit the dynamic -batching of the composing models. - -Assuming that only the ensemble model, the preprocess model, the classification -model and the segmentation model are being served, the client applications will -see them as four different models which can process requests independently. -However, the ensemble scheduler will view the ensemble model as the following. - -![Ensemble Example](images/ensemble_example0.png) - -When an inference request for the ensemble model is received, the ensemble -scheduler will: - -1. Recognize that the "IMAGE" tensor in the request is mapped to input - "RAW\_IMAGE" in the preprocess model. - -2. Check models within the ensemble and send an internal request to the - preprocess model because all the input tensors required are ready. - -3. Recognize the completion of the internal request, collect the output - tensor and map the content to "preprocessed\_image" which is an unique name - known within the ensemble. - -4. Map the newly collected tensor to inputs of the models within the ensemble. - In this case, the inputs of "classification\_model" and "segmentation\_model" - will be mapped and marked as ready. - -5. Check models that require the newly collected tensor and send internal - requests to models whose inputs are ready, the classification - model and the segmentation model in this case. Note that the responses will - be in arbitrary order depending on the load and computation time of - individual models. - -6. Repeat step 3-5 until no more internal requests should be sent, and then - response to the inference request with the tensors mapped to the ensemble - output names. - -Unlike other models, ensemble models do not support "instance_group" field in -the model configuration. The reason is that the ensemble scheduler itself -is mainly an event-driven scheduler with very minimal overhead so its -almost never the bottleneck of the pipeline. The composing models -within the ensemble can be individually scaled up or down with their -respective `instance_group` settings. To optimize your model pipeline -performance, you can use -[Model Analyzer](https://github.com/triton-inference-server/model_analyzer) -to find the optimal model configurations. - -When crafting the ensemble steps, it is useful to note the distinction between -*key* and *value* on the `input_map`/`output_map`: -* *key*: An `input`/`output` tensor name on the composing model. -* *value*: A tensor name on the ensemble model, which acts as an identifier -connecting ensemble `input`/`output` to those on the composing model and between -composing models. - -#### Additional Resources - -You can find additional end-to-end ensemble examples in the links below: -* [This guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) -explores the concept of ensembles with a running example. -* [Preprocessing in Python Backend Using - Ensemble](https://github.com/triton-inference-server/python_backend#preprocessing) -* [Accelerating Inference with NVIDIA Triton Inference Server and NVIDIA - DALI](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/) -* [Using RAPIDS AI with NVIDIA Triton Inference - Server](https://github.com/rapidsai/rapids-examples/tree/main/rapids_triton_example) - +![Triton Architecture Diagram](images/arch.jpg) \ No newline at end of file diff --git a/docs/user_guide/ensemble_models.md b/docs/user_guide/ensemble_models.md index 8c6ebebd1b..3477e33d6c 100644 --- a/docs/user_guide/ensemble_models.md +++ b/docs/user_guide/ensemble_models.md @@ -1,5 +1,5 @@