diff --git a/README.md b/README.md index 03b498e561..37f7a38251 100644 --- a/README.md +++ b/README.md @@ -54,8 +54,8 @@ Major features include: frameworks](https://github.com/triton-inference-server/fil_backend) - [Concurrent model execution](docs/user_guide/architecture.md#concurrent-model-execution) -- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher) -- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and +- [Dynamic batching](docs/user_guide/batcher.md#dynamic-batcher) +- [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and [implicit state management](docs/user_guide/architecture.md#implicit-state-management) for stateful models - Provides [Backend API](https://github.com/triton-inference-server/backend) that @@ -70,8 +70,8 @@ Major features include: protocols](docs/customization_guide/inference_protocols.md) based on the community developed [KServe protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) -- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and - [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api) +- A [C API](docs/customization_guide/inprocess_c_api.md) and + [Java API](docs/customization_guide/inprocess_java_api.md) allow Triton to link directly into your application for edge and other in-process use cases - [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server throughput, server latency, and more diff --git a/docs/README.md b/docs/README.md index f37ff2cc6b..bb862ae265 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,5 +1,5 @@ inference -> data postprocessing". Using ensemble models for this -purpose can avoid the overhead of transferring intermediate tensors -and minimize the number of requests that must be sent to Triton. - -The ensemble scheduler must be used for ensemble models, regardless of -the scheduler used by the models within the ensemble. With respect to -the ensemble scheduler, an *ensemble* model is not an actual -model. Instead, it specifies the dataflow between models within the -ensemble as *ModelEnsembling::Step* entries in the model -configuration. The scheduler collects the output tensors in each step, -provides them as input tensors for other steps according to the -specification. In spite of that, the ensemble model is still viewed as -a single model from an external view. - -Note that the ensemble models will inherit the characteristics of the -models involved, so the meta-data in the request header must comply -with the models within the ensemble. For instance, if one of the -models is stateful model, then the inference request for the ensemble -model should contain the information mentioned in [Stateful -Models](#stateful-models), which will be provided to the stateful -model by the scheduler. - -As an example consider an ensemble model for image classification and -segmentation that has the following model configuration: - -``` -name: "ensemble_model" -platform: "ensemble" -max_batch_size: 1 -input [ - { - name: "IMAGE" - data_type: TYPE_STRING - dims: [ 1 ] - } -] -output [ - { - name: "CLASSIFICATION" - data_type: TYPE_FP32 - dims: [ 1000 ] - }, - { - name: "SEGMENTATION" - data_type: TYPE_FP32 - dims: [ 3, 224, 224 ] - } -] -ensemble_scheduling { - step [ - { - model_name: "image_preprocess_model" - model_version: -1 - input_map { - key: "RAW_IMAGE" - value: "IMAGE" - } - output_map { - key: "PREPROCESSED_OUTPUT" - value: "preprocessed_image" - } - }, - { - model_name: "classification_model" - model_version: -1 - input_map { - key: "FORMATTED_IMAGE" - value: "preprocessed_image" - } - output_map { - key: "CLASSIFICATION_OUTPUT" - value: "CLASSIFICATION" - } - }, - { - model_name: "segmentation_model" - model_version: -1 - input_map { - key: "FORMATTED_IMAGE" - value: "preprocessed_image" - } - output_map { - key: "SEGMENTATION_OUTPUT" - value: "SEGMENTATION" - } - } - ] -} -``` - -The ensemble\_scheduling section indicates that the ensemble scheduler will be -used and that the ensemble model consists of three different models. Each -element in step section specifies the model to be used and how the inputs and -outputs of the model are mapped to tensor names recognized by the scheduler. For -example, the first element in step specifies that the latest version of -image\_preprocess\_model should be used, the content of its input "RAW\_IMAGE" -is provided by "IMAGE" tensor, and the content of its output -"PREPROCESSED\_OUTPUT" will be mapped to "preprocessed\_image" tensor for later -use. The tensor names recognized by the scheduler are the ensemble inputs, the -ensemble outputs and all values in the input\_map and the output\_map. - -The models composing the ensemble may also have dynamic batching -enabled. Since ensemble models are just routing the data between -composing models, Triton can take requests into an ensemble model -without modifying the ensemble's configuration to exploit the dynamic -batching of the composing models. - -Assuming that only the ensemble model, the preprocess model, the classification -model and the segmentation model are being served, the client applications will -see them as four different models which can process requests independently. -However, the ensemble scheduler will view the ensemble model as the following. - -![Ensemble Example](images/ensemble_example0.png) - -When an inference request for the ensemble model is received, the ensemble -scheduler will: - -1. Recognize that the "IMAGE" tensor in the request is mapped to input - "RAW\_IMAGE" in the preprocess model. - -2. Check models within the ensemble and send an internal request to the - preprocess model because all the input tensors required are ready. - -3. Recognize the completion of the internal request, collect the output - tensor and map the content to "preprocessed\_image" which is an unique name - known within the ensemble. - -4. Map the newly collected tensor to inputs of the models within the ensemble. - In this case, the inputs of "classification\_model" and "segmentation\_model" - will be mapped and marked as ready. - -5. Check models that require the newly collected tensor and send internal - requests to models whose inputs are ready, the classification - model and the segmentation model in this case. Note that the responses will - be in arbitrary order depending on the load and computation time of - individual models. - -6. Repeat step 3-5 until no more internal requests should be sent, and then - response to the inference request with the tensors mapped to the ensemble - output names. - -Unlike other models, ensemble models do not support "instance_group" field in -the model configuration. The reason is that the ensemble scheduler itself -is mainly an event-driven scheduler with very minimal overhead so its -almost never the bottleneck of the pipeline. The composing models -within the ensemble can be individually scaled up or down with their -respective `instance_group` settings. To optimize your model pipeline -performance, you can use -[Model Analyzer](https://github.com/triton-inference-server/model_analyzer) -to find the optimal model configurations. - -When crafting the ensemble steps, it is useful to note the distinction between -*key* and *value* on the `input_map`/`output_map`: -* *key*: An `input`/`output` tensor name on the composing model. -* *value*: A tensor name on the ensemble model, which acts as an identifier -connecting ensemble `input`/`output` to those on the composing model and between -composing models. - -#### Additional Resources - -You can find additional end-to-end ensemble examples in the links below: -* [This guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) -explores the concept of ensembles with a running example. -* [Preprocessing in Python Backend Using - Ensemble](https://github.com/triton-inference-server/python_backend#preprocessing) -* [Accelerating Inference with NVIDIA Triton Inference Server and NVIDIA - DALI](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/) -* [Using RAPIDS AI with NVIDIA Triton Inference - Server](https://github.com/rapidsai/rapids-examples/tree/main/rapids_triton_example) - +![Triton Architecture Diagram](images/arch.jpg) \ No newline at end of file diff --git a/docs/user_guide/batcher.md b/docs/user_guide/batcher.md index 556412e455..0cb2589a1f 100644 --- a/docs/user_guide/batcher.md +++ b/docs/user_guide/batcher.md @@ -1,5 +1,5 @@