From fef7cb7667be14b07bc48bc413b6e9c4fd92555c Mon Sep 17 00:00:00 2001
From: Yingge He <yinggeh@nvidia.com>
Date: Tue, 3 Feb 2026 09:31:47 -0800
Subject: [PATCH 1/2] Fix broken links

---
 README.md                                     |  8 +++----
 docs/README.md                                | 12 +++++-----
 docs/customization_guide/build.md             |  4 +---
 .../inference_protocols.md                    |  6 ++---
 .../README.md                                 |  4 ++--
 docs/getting_started/llm.md                   | 14 ++++++------
 docs/getting_started/trtllm_user_guide.md     |  8 +++----
 docs/index.md                                 |  4 ++--
 docs/introduction/index.md                    |  4 ++--
 docs/protocol/extension_parameters.md         |  6 ++---
 docs/protocol/extension_schedule_policy.md    |  6 ++---
 docs/user_guide/architecture.md               | 14 ++++++------
 docs/user_guide/batcher.md                    | 10 ++++-----
 docs/user_guide/debugging_guide.md            | 10 ++++-----
 docs/user_guide/decoupled_models.md           |  6 ++---
 docs/user_guide/faq.md                        |  8 +++----
 docs/user_guide/jetson.md                     |  6 ++---
 docs/user_guide/model_configuration.md        | 12 +++++-----
 docs/user_guide/optimization.md               | 22 +++++++++----------
 docs/user_guide/ragged_batching.md            |  4 ++--
 docs/user_guide/request_cancellation.md       | 12 +++++-----
 docs/user_guide/trace.md                      |  8 +++----
 python/openai/README.md                       |  6 ++---
 23 files changed, 96 insertions(+), 98 deletions(-)

diff --git a/README.md b/README.md
index 522927495c..ae73196b58 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,7 +55,7 @@ Major features include:
 - [Concurrent model
   execution](docs/user_guide/architecture.md#concurrent-model-execution)
 - [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
-- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and
+- [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and
   [implicit state management](docs/user_guide/architecture.md#implicit-state-management)
   for stateful models
 - Provides [Backend API](https://github.com/triton-inference-server/backend) that
@@ -70,8 +70,8 @@ Major features include:
   protocols](docs/customization_guide/inference_protocols.md) based on the community
   developed [KServe
   protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
-- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and
-  [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
+- A [C API](docs/customization_guide/inprocess_c_api.md) and
+  [Java API](docs/customization_guide/inprocess_java_api.md)
   allow Triton to link directly into your application for edge and other in-process use cases
 - [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
   throughput, server latency, and more
diff --git a/docs/README.md b/docs/README.md
index f37ff2cc6b..bb862ae265 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -111,17 +111,17 @@ The Model Configuration ModelOptimizationPolicy property is used to specify opti
 
 #### Scheduling and Batching
 
-Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](user_guide/model_configuration.md#scheduling-and-batching).
-- [Default Scheduler - Non-Batching](user_guide/model_configuration.md#default-scheduler)
-- [Dynamic Batcher](user_guide/model_configuration.md#dynamic-batcher)
+Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](#scheduling-and-batching).
+- [Default Scheduler - Non-Batching](user_guide/scheduler.md#default-scheduler)
+- [Dynamic Batcher](user_guide/batcher.md#dynamic-batcher)
   - [How to Configure Dynamic Batcher](user_guide/model_configuration.md#recommended-configuration-process)
-    - [Delayed Batching](user_guide/model_configuration.md#delayed-batching)
+    - [Delayed Batching](user_guide/batcher.md#delayed-batching)
     - [Preferred Batch Size](user_guide/model_configuration.md#preferred-batch-sizes)
   - [Preserving Request Ordering](user_guide/model_configuration.md#preserve-ordering)
   - [Priority Levels](user_guide/model_configuration.md#priority-levels)
   - [Queuing Policies](user_guide/model_configuration.md#queue-policy)
   - [Ragged Batching](user_guide/ragged_batching.md)
-- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher)
+- [Sequence Batcher](user_guide/batcher.md#sequence-batcher)
   - [Stateful Models](user_guide/model_execution.md#stateful-models)
   - [Control Inputs](user_guide/model_execution.md#control-inputs)
   - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/implicit_state_management.md#implicit-state-management)
diff --git a/docs/customization_guide/build.md b/docs/customization_guide/build.md
index 9724b25971..948fbafb06 100644
--- a/docs/customization_guide/build.md
+++ b/docs/customization_guide/build.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,8 +59,6 @@ to build Triton on a platform that is not listed here.
 
 * [Ubuntu 22.04, x86-64](#building-for-ubuntu-2204)
 
-* [Jetpack 4.x, NVIDIA Jetson (Xavier, Nano, TX2)](#building-for-jetpack-4x)
-
 * [Windows 10, x86-64](#building-for-windows-10)
 
 If you are developing or debugging Triton, see [Development and
diff --git a/docs/customization_guide/inference_protocols.md b/docs/customization_guide/inference_protocols.md
index 053ef13402..af4179576f 100644
--- a/docs/customization_guide/inference_protocols.md
+++ b/docs/customization_guide/inference_protocols.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,7 +31,7 @@
 Clients can communicate with Triton using either an [HTTP/REST
 protocol](#httprest-and-grpc-protocols), a [GRPC
 protocol](#httprest-and-grpc-protocols), or by an [in-process C
-API](inprocess_c_api.md#in-process-triton-server-api) or its
+API](inprocess_c_api.md) or its
 [C++ wrapper](https://github.com/triton-inference-server/developer_tools/tree/main/server).
 
 ## HTTP/REST and GRPC Protocols
@@ -142,7 +142,7 @@ For client-side documentation, see [Client-Side GRPC Status Codes](https://githu
 
 #### GRPC Inference Handler Threads
 
-In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices] (https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.
+In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices](https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.
 
 * `--grpc-infer-thread-count`: 2 by default.
 
diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/README.md b/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
index 1f96dd365d..202e3e22fe 100644
--- a/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
+++ b/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -326,6 +326,6 @@ dynamic_batching {
 }
 ```
 
-To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher).
+To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher).
 
 You can also try enabling both concurrent model execution and dynamic batching.
\ No newline at end of file
diff --git a/docs/getting_started/llm.md b/docs/getting_started/llm.md
index cecf565f51..2dfc1a1f11 100644
--- a/docs/getting_started/llm.md
+++ b/docs/getting_started/llm.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -30,7 +30,7 @@
 
 This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.
 
-This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
+This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
 
 - [Build and test TRT-LLM engine](#build-and-test-trt-llm-engine)
 - [Deploy with Triton Inference Server](#deploy-with-triton-inference-server)
@@ -76,7 +76,7 @@ Reference: <https://nvidia.github.io/TensorRT-LLM/installation/linux.html>
 
 ## Build the TRT-LLM Engine
 
-Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi>
+Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi>
 
 4. ## Download Phi-3-mini-4k-instruct
 
@@ -354,7 +354,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b> ensemble/config.pbtxt</b></summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -864,7 +864,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b>postprocessing/config.pbtxt</b></summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -993,7 +993,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b> preprocessing/config.pbtxt</b> </summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -1188,7 +1188,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <summary> <b> tensorrt_llm/config.pbtxt </b></summary>
 
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
diff --git a/docs/getting_started/trtllm_user_guide.md b/docs/getting_started/trtllm_user_guide.md
index 7f128e98c7..a47d0c471d 100644
--- a/docs/getting_started/trtllm_user_guide.md
+++ b/docs/getting_started/trtllm_user_guide.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -50,7 +50,7 @@ to prepare engines for your LLM models and serve them with Triton.
 ## How to use your custom TRT-LLM model
 
 All the supported models can be found in the
-[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder in
+[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core) folder in
 the TRT-LLM repo. Follow the examples to convert your models to TensorRT
 engines.
 
@@ -61,7 +61,7 @@ for Triton, and
 Only the *mandatory parameters* need to be set in the model config file. Feel free
 to modify the optional parameters as needed. To learn more about the
 parameters, model inputs, and outputs, see the
-[model config documentation](ttps://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
+[model config documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
 
 ## Advanced Configuration Options and Deployment Strategies
 
@@ -95,7 +95,7 @@ to learn how to use GenAI-Perf to benchmark your LLM models.
 ## Performance Best Practices
 
 Check out the
-[Performance Best Practices guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html)
+[Performance tuning guide](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/)
 to learn how to optimize your TensorRT-LLM models for better performance.
 
 ## Metrics
diff --git a/docs/index.md b/docs/index.md
index 84f3438f98..182bb06010 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -60,7 +60,7 @@ available for inferencing. Inference requests arrive at the server via
 either [HTTP/REST or GRPC](customization_guide/inference_protocols.md) or by the [C
 API](customization_guide/inference_protocols.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
-algorithms](#models-and-schedulers) that can be configured on a
+algorithms](./user_guide/architecture.md#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
 batching of inference requests and then passes the requests to the
 [backend](https://github.com/triton-inference-server/backend/blob/main/README.md)
diff --git a/docs/introduction/index.md b/docs/introduction/index.md
index 5395a5f01f..5b2d18e422 100644
--- a/docs/introduction/index.md
+++ b/docs/introduction/index.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -60,7 +60,7 @@ available for inferencing. Inference requests arrive at the server via
 either [HTTP/REST or GRPC](../customization_guide/inference_protocols.md) or by the [C
 API](../customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
-algorithms](#models-and-schedulers) that can be configured on a
+algorithms](../user_guide/architecture.md#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
 batching of inference requests and then passes the requests to the
 [backend](https://github.com/triton-inference-server/backend/blob/main/README.md)
diff --git a/docs/protocol/extension_parameters.md b/docs/protocol/extension_parameters.md
index 14ed4d1dc5..eef7aeb11b 100644
--- a/docs/protocol/extension_parameters.md
+++ b/docs/protocol/extension_parameters.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,9 +34,9 @@ custom parameters that cannot be provided as inputs. Because this extension is
 supported, Triton reports “parameters” in the extensions field of its
 Server Metadata. This extension uses the optional "parameters"
 field in the KServe Protocol in
-[HTTP](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#inference-request-json-object)
+[HTTP](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#inference-request-json-object)
 and
-[GRPC](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#parameters).
+[GRPC](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#parameters).
 
 The following parameters are reserved for Triton's usage and should not be
 used as custom parameters:
diff --git a/docs/protocol/extension_schedule_policy.md b/docs/protocol/extension_schedule_policy.md
index c3c57a63c7..2e85043661 100644
--- a/docs/protocol/extension_schedule_policy.md
+++ b/docs/protocol/extension_schedule_policy.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,9 +34,9 @@ parameters that influence how Triton handles and schedules the
 request. Because this extension is supported, Triton reports
 “schedule_policy” in the extensions field of its Server Metadata.
 Note the policies are specific to [dynamic
-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher)
 and only experimental support to [sequence
-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#sequence-batcher)
 with the [direct](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#direct)
 scheduling strategy.
 
diff --git a/docs/user_guide/architecture.md b/docs/user_guide/architecture.md
index 1b9a9eff18..b18cbf3b30 100644
--- a/docs/user_guide/architecture.md
+++ b/docs/user_guide/architecture.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -117,8 +117,8 @@ model.
 
 Examples of stateless models are CNNs such as image classification and
 object detection. The [default
-scheduler](model_configuration.md#default-scheduler) or [dynamic
-batcher](model_configuration.md#dynamic-batcher) can be used as the
+scheduler](scheduler.md#default-scheduler) or [dynamic
+batcher](batcher.md#dynamic-batcher) can be used as the
 scheduler for these stateless models.
 
 RNNs and similar models which do have internal memory can be stateless
@@ -126,9 +126,9 @@ as long as the state they maintain does not span inference
 requests. For example, an RNN that iterates over all elements in a
 batch is considered stateless by Triton if the internal state is not
 carried between batches of inference requests. The [default
-scheduler](model_configuration.md#default-scheduler) can be used for
+scheduler](scheduler.md#default-scheduler) can be used for
 these stateless models. The [dynamic
-batcher](model_configuration.md#dynamic-batcher) cannot be used since
+batcher](batcher.md#dynamic-batcher) cannot be used since
 the model is typically not expecting the batch to represent multiple
 inference requests.
 
@@ -142,7 +142,7 @@ maintained by the model is correctly updated. Moreover, the model may
 require that Triton provide *control* signals indicating, for example,
 the start and end of the sequence.
 
-The [sequence batcher](model_configuration.md#sequence-batcher) must
+The [sequence batcher](batcher.md#sequence-batcher) must
 be used for these stateful models. As explained below, the sequence
 batcher ensures that all inference requests in a sequence get routed
 to the same model instance so that the model can maintain state
@@ -535,7 +535,7 @@ model. Over time the following happens:
 With the Oldest scheduling strategy the sequence batcher ensures that
 all inference requests in a sequence are routed to the same model
 instance and then uses the [dynamic
-batcher](model_configuration.md#dynamic-batcher) to batch together
+batcher](batcher.md#dynamic-batcher) to batch together
 multiple inferences from different sequences into a batch that
 inferences together.  With this strategy the model must typically use
 the CONTROL_SEQUENCE_CORRID control so that it knows which sequence
diff --git a/docs/user_guide/batcher.md b/docs/user_guide/batcher.md
index 556412e455..0cb2589a1f 100644
--- a/docs/user_guide/batcher.md
+++ b/docs/user_guide/batcher.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -57,7 +57,7 @@ each model. It is also possible to use the [Model
 Analyzer](model_analyzer.md) to automatically search across different
 dynamic batcher configurations.
 
-* Decide on a [maximum batch size](#maximum-batch-size) for the model.
+* Decide on a [maximum batch size](model_configuration.md#maximum-batch-size) for the model.
 
 * Add the following to the model configuration to enable the dynamic
   batcher with all default settings. By default the dynamic batcher
@@ -207,7 +207,7 @@ in that order. If found, it will load it. This lets you easily share a custom ba
 among all models using the same backend.
 
 For a tutorial of how to create and use a custom batching library, please see the
-[backend examples directory](https://github.com/triton-inference-server/backend/tree/main/examples#volume-batching).
+[backend examples directory](https://github.com/triton-inference-server/backend/tree/main/examples#custom-batching).
 
 ## Sequence Batcher
 
@@ -217,7 +217,7 @@ dynamic batcher, the sequence batcher should be used for
 [stateful models](architecture.md#stateful-models) where a sequence of
 inference requests must be routed to the same model instance. The
 dynamically created batches are distributed to all [model
-instances](#instance-groups) configured for the model.
+instances](model_configuration.md#instance-groups) configured for the model.
 
 Sequence batching is enabled and configured independently for each
 model using the *ModelSequenceBatching* property in the model
@@ -259,7 +259,7 @@ sequence", the user doesn't need to set [control
 inputs](architecture.md#control-inputs) mentioned in the previous
 section. They will be filled internally by the scheduler.
 
-"Iterative sequences" can be [decoupled](architecture.md#decoupled) where more than
+"Iterative sequences" can be [decoupled](decoupled_models.md#decoupled-backends-and-models) where more than
 one response can be generated during execution or non-decoupled where
 a single response is generated when the full response is complete.
 
diff --git a/docs/user_guide/debugging_guide.md b/docs/user_guide/debugging_guide.md
index e5b0263d30..c9b36d4e64 100644
--- a/docs/user_guide/debugging_guide.md
+++ b/docs/user_guide/debugging_guide.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -74,7 +74,7 @@ If your error message only occurs in one or a few places in the Triton code, you
 
 **Step 3. Build with Debug Flags**
 
-The next step is building with debug flags. We unfortunately don’t provide a debug container, so you’d need to follow the [build guide](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md) to build the container, which includes a [section on adding debug symbols](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols). Once you do so, you can install GDB (`apt-get install gdb`) in the container and run Triton in GDB (`gdb --args tritonserver…`). If needed, you can open a second terminal to run a script in another container. If the server segfaults, you can enter `backtrace`, which will provide you a call stack that lets you know where the error got generated. You should then be able to trace the source of the error. If the bug still exists after debugging, we’ll need this to expedite our work.
+The next step is building with debug flags. We unfortunately don’t provide a debug container, so you’d need to follow the [build guide](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md) to build the container, which includes a [section on adding debug symbols](../customization_guide/build.md#building-with-debug-symbols). Once you do so, you can install GDB (`apt-get install gdb`) in the container and run Triton in GDB (`gdb --args tritonserver…`). If needed, you can open a second terminal to run a script in another container. If the server segfaults, you can enter `backtrace`, which will provide you a call stack that lets you know where the error got generated. You should then be able to trace the source of the error. If the bug still exists after debugging, we’ll need this to expedite our work.
 
 Advanced GDB users can also examine variable values, add breakpoints, and more to find the cause of their issue.
 
@@ -89,7 +89,7 @@ If you built the backend yourself, this could be a linking error. If you are con
 
 ## Server Issues
 
-You generally should not run into errors with the server itself. If the server goes down, it’s usually because something went wrong during model loading or inference and you can use the above section to debug. It’s particularly useful to work through the [Building with Debug Flags](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols) section above to resolve those sorts of issues. However, this section will go through some specific cases that may occur.
+You generally should not run into errors with the server itself. If the server goes down, it’s usually because something went wrong during model loading or inference and you can use the above section to debug. It’s particularly useful to work through the [Building with Debug Flags](../customization_guide/build.md#building-with-debug-symbols) section above to resolve those sorts of issues. However, this section will go through some specific cases that may occur.
 
 ### No Connection to Server
 
@@ -121,9 +121,9 @@ We often get performance optimization questions around the clients. Triton clien
 
 ## Performance Issues
 
-This section goes over debugging unexpected performance. If you are looking to optimize performance, please see the [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md) and [Performance Tuning](https://github.com/triton-inference-server/server/blob/main/docs/performance_tuning.md) guides.
+This section goes over debugging unexpected performance. If you are looking to optimize performance, please see the [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md) and [Performance Tuning](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/performance_tuning.md) guides.
 
-The easiest step to start with is running perf_analyzer to get a breakdown of the request lifecycle, throughput, and latency for each individual model. For a more detailed view, you can [enable tracing](https://github.com/triton-inference-server/server/blob/main/docs/trace.md) when running the server. This will provide exact timestamps to drill down into what is happening. You can also enable tracing with perf_analyzer for the GRPC and HTTP clients by using the tracing flags. Note that enabling tracing can impact Triton’s performance, but it can be helpful to examine the timestamps throughout a request’s lifecycle.
+The easiest step to start with is running perf_analyzer to get a breakdown of the request lifecycle, throughput, and latency for each individual model. For a more detailed view, you can [enable tracing](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/trace.md) when running the server. This will provide exact timestamps to drill down into what is happening. You can also enable tracing with perf_analyzer for the GRPC and HTTP clients by using the tracing flags. Note that enabling tracing can impact Triton’s performance, but it can be helpful to examine the timestamps throughout a request’s lifecycle.
 
 ### Performance Profiling
 
diff --git a/docs/user_guide/decoupled_models.md b/docs/user_guide/decoupled_models.md
index e8118843f3..d932b07d9c 100644
--- a/docs/user_guide/decoupled_models.md
+++ b/docs/user_guide/decoupled_models.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,7 +55,7 @@ the backend should not allow the caller thread to return from
 TRITONBACKEND_ModelInstanceExecute until that instance is ready to
 handle another set of requests. If not designed properly the backend
 can be easily over-subscribed. This can also cause under-utilization
-of features like [Dynamic Batching](model_configuration.md#dynamic-batcher)
+of features like [Dynamic Batching](batcher.md#dynamic-batcher)
 as it leads to eager batching.
 
 ### Python model using Python Backend
@@ -90,7 +90,7 @@ model, the client must use the bi-directional streaming RPC. See
 for more details. The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py) demonstrates
 how the gRPC streaming can be used to infer decoupled models.
 
-If using [Triton's in-process C API](../customization_guide/inference_protocols.md#in-process-triton-server-api),
+If using [Triton's in-process C API](../customization_guide/inprocess_c_api.md),
 your application should be cognizant that the callback function you registered with
 `TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
 each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)
diff --git a/docs/user_guide/faq.md b/docs/user_guide/faq.md
index e7d8a0e302..cdfd599d67 100644
--- a/docs/user_guide/faq.md
+++ b/docs/user_guide/faq.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2019-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -35,7 +35,7 @@ same as when using the model's framework directly. However, with
 Triton you get benefits like [concurrent model
 execution](architecture.md#concurrent-model-execution) (the ability to
 run multiple models at the same time on the same GPU) and [dynamic
-batching](model_configuration.md#dynamic-batcher) to get better
+batching](batcher.md#dynamic-batcher) to get better
 throughput. You can also [replace or upgrade models while Triton and
 client application are running](model_management.md). Another benefit
 is that Triton can be deployed as a Docker container, anywhere – on
@@ -122,13 +122,13 @@ concurrency](model_configuration.md#instance-groups) on a
 model-by-model basis.
 
 * Triton can [batch together multiple inference requests into a single
-  inference execution](model_configuration.md#dynamic-batcher). Typically,
+  inference execution](batcher.md#dynamic-batcher). Typically,
   batching inference requests leads to much higher thoughput with only
   a relatively small increase in latency.
 
 As a general rule, batching is the most beneficial way to increase GPU
 utilization. So you should always try enabling the [dynamic
-batcher](model_configuration.md#dynamic-batcher) with your models. Using
+batcher](batcher.md#dynamic-batcher) with your models. Using
 multiple instances of a model can also provide some benefit but is
 typically most useful for models that have small compute
 requirements. Most models will benefit from using two instances but
diff --git a/docs/user_guide/jetson.md b/docs/user_guide/jetson.md
index e87839a09b..8082bd70e5 100644
--- a/docs/user_guide/jetson.md
+++ b/docs/user_guide/jetson.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2021-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,7 +41,7 @@ Triton Inference Server support on JetPack includes:
 * [Model pipelines](architecture.md#ensemble-models)
 * [Extensible backends](https://github.com/triton-inference-server/backend)
 * [HTTP/REST and GRPC inference protocols](../customization_guide/inference_protocols.md)
-* [C API](../customization_guide/inference_protocols.md#in-process-triton-server-api)
+* [C API](../customization_guide/inprocess_c_api.md)
 
 Limitations on JetPack 5.0:
 
@@ -52,7 +52,7 @@ The CUDA execution provider is in Beta.
 * GPU metrics, GCS storage, S3 storage and Azure storage are not supported.
 
 On JetPack, although HTTP/REST and GRPC inference protocols are supported, for edge
-use cases, direct [C API integration](../customization_guide/inference_protocols.md#in-process-triton-server-api)
+use cases, direct [C API integration](../customization_guide/inprocess_c_api.md)
 is recommended.
 
 You can download the `.tgz` file for Jetson from the Triton Inference Server
diff --git a/docs/user_guide/model_configuration.md b/docs/user_guide/model_configuration.md
index 55435a296c..0c8e0c8875 100644
--- a/docs/user_guide/model_configuration.md
+++ b/docs/user_guide/model_configuration.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -87,7 +87,7 @@ The default is false, which means the model will generate exactly one response f
 ### Maximum Batch Size
 
 The *max_batch_size* property indicates the maximum batch size that the model supports for the [types of batching](architecture.md#models-and-schedulers) that can be exploited by Triton.
-If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its [dynamic batcher](#dynamic-batcher) or [sequence batcher](#sequence-batcher) to automatically use batching with the model.
+If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its [dynamic batcher](batcher.md#dynamic-batcher) or [sequence batcher](batcher.md#sequence-batcher) to automatically use batching with the model.
 In this case *max_batch_size* should be set to a value greater-or-equal-to 1 that indicates the maximum batch size that Triton should use with the model.
 
 For models that do not support batching, or do not support batching in the specific ways described above, *max_batch_size* must be set to zero.
@@ -194,7 +194,7 @@ The [*reshape* property](#reshape) must be used if there is a mismatch between t
 Similarly, the *reshape* property must be used if there is a mismatch between the output shape produced by the model and the shape that Triton returns in a response to an inference request.
 
 Model inputs can specify `allow_ragged_batch` to indicate that the input is a [ragged input](ragged_batching.md#ragged-batching).
-The field is used with [dynamic batcher](#dynamic-batcher) to allow batching without enforcing the input to have the same shape in all requests.
+The field is used with [dynamic batcher](model_configuration.md#default-max-batch-size-and-dynamic-batcher) to allow batching without enforcing the input to have the same shape in all requests.
 
 ## Auto-Generated Model Configuration
 
@@ -212,7 +212,7 @@ All other model types *must* provide a model configuration file.
 
 When developing a custom backend, you can populate required settings in the configuration and call `TRITONBACKEND_ModelSetConfig` API to update completed configuration with Triton core.
 You can take a look at [Onnxruntime](https://github.com/triton-inference-server/onnxruntime_backend) backends as examples of how to achieve this.
-Currently, only [inputs, outputs](#inputs-and-outputs), [max_batch_size](#maximum-batch-size) and [dynamic batching](#dynamic-batcher) settings can be populated by backend.
+Currently, only [inputs, outputs](#inputs-and-outputs), [max_batch_size](#maximum-batch-size) and [dynamic batching](model_configuration.md#default-max-batch-size-and-dynamic-batcher) settings can be populated by backend.
 For custom backends, your config.pbtxt file must include a `backend` field or your model name must be in the form `<model_name>.<backend_name>`.
 
 You can also see the model configuration generated for a model by Triton using the [model configuration endpoint](../protocol/extension_model_configuration.md).
@@ -312,9 +312,9 @@ Currently, the following backends which utilize these default batch values and t
 2. [TensorRT backend](https://github.com/triton-inference-server/tensorrt_backend)
 
    1. TensorRT models store the maximum batch size explicitly and do not make use of the default-max-batch-size parameter.
-      However, if max_batch_size > 1 and no [scheduler](model_configuration.md#scheduling-and-batching) is provided, the dynamic batch scheduler will be enabled.
+      However, if max_batch_size > 1 and no scheduler is provided, the dynamic batch scheduler will be enabled.
 
-If a value greater than 1 for the maximum batch size is set for the model, the [dynamic_batching](#dynamic-batcher) config will be set if no scheduler is provided in the configuration file.
+If a value greater than 1 for the maximum batch size is set for the model, the [dynamic_batching](batcher.md#dynamic-batcher) config will be set if no scheduler is provided in the configuration file.
 
 
 ## Datatypes
diff --git a/docs/user_guide/optimization.md b/docs/user_guide/optimization.md
index 636a75dc7b..18556c69bd 100644
--- a/docs/user_guide/optimization.md
+++ b/docs/user_guide/optimization.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2019-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2019-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -53,10 +53,10 @@ options, we will use a ONNX Inception model that you can obtain
 by following the [QuickStart](../getting_started/quickstart.md). As a baseline we use
 perf_analyzer to determine the performance of the model using a [basic
 model configuration that does not enable any performance
-features](../examples/model_repository/inception_graphdef/config.pbtxt).
+features](../examples/model_repository/inception_onnx/config.pbtxt).
 
 ```
-$ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 1:4
+$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 1:4
 ...
 Inferences/Second vs. Client p95 Batch Latency
 Concurrency: 1, throughput: 62.6 infer/sec, latency 21371 usec
@@ -81,7 +81,7 @@ latency.
 
 For most models, the Triton feature that provides the largest
 performance improvement is [dynamic
-batching](model_configuration.md#dynamic-batcher).
+batching](batcher.md#dynamic-batcher).
 [This example](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#dynamic-batching--concurrent-model-execution)
  sheds more light on conceptual details. If your model does not
 support batching then you can skip ahead to [Model
@@ -95,7 +95,7 @@ larger batch that will often execute much more efficiently than
 executing the individual requests independently. To enable the dynamic
 batcher stop Triton, add the following line to the end of the [model
 configuration file for
-inception_graphdef](../examples/model_repository/inception_graphdef/config.pbtxt),
+inception_onnx](../examples/model_repository/inception_onnx/config.pbtxt),
 and then restart Triton.
 
 ```
@@ -108,7 +108,7 @@ inference. To see this run perf_analyzer with request concurrency from
 1 to 8.
 
 ```
-$ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 1:8
+$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 1:8
 ...
 Inferences/Second vs. Client p95 Batch Latency
 Concurrency: 1, throughput: 66.8 infer/sec, latency 19785 usec
@@ -138,7 +138,7 @@ instance. So for maximum-batch-size 4 we want to run perf_analyzer
 with request concurrency of `2 * 4 * 1 = 8`.
 
 ```
-$ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 8
+$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 8
 ...
 Inferences/Second vs. Client p95 Batch Latency
 Concurrency: 8, throughput: 267.8 infer/sec, latency 35590 usec
@@ -158,12 +158,12 @@ utilization by allowing more inference work to be executed
 simultaneously on the GPU. Smaller models may benefit from more than
 two instances; you can use perf_analyzer to experiment.
 
-To specify two instances of the inception_graphdef model: stop Triton,
+To specify two instances of the inception_onnx model: stop Triton,
 remove any dynamic batching settings you may have previously added to
 the model configuration (we discuss combining dynamic batcher and
 multiple model instances below), add the following lines to the end of
 the [model configuration
-file](../examples/model_repository/inception_graphdef/config.pbtxt), and
+file](../examples/model_repository/inception_onnx/config.pbtxt), and
 then restart Triton.
 
 ```
@@ -173,7 +173,7 @@ instance_group [ { count: 2 }]
 Now run perf_analyzer using the same options as for the baseline.
 
 ```
-$ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 1:4
+$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 1:4
 ...
 Inferences/Second vs. Client p95 Batch Latency
 Concurrency: 1, throughput: 70.6 infer/sec, latency 19547 usec
@@ -199,7 +199,7 @@ When we run perf_analyzer with the same options used for just the
 dynamic batcher above.
 
 ```
-$ perf_analyzer -m inception_graphdef --percentile=95 --concurrency-range 16
+$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 16
 ...
 Inferences/Second vs. Client p95 Batch Latency
 Concurrency: 16, throughput: 289.6 infer/sec, latency 59817 usec
diff --git a/docs/user_guide/ragged_batching.md b/docs/user_guide/ragged_batching.md
index 308b75fa57..21dc2292a9 100644
--- a/docs/user_guide/ragged_batching.md
+++ b/docs/user_guide/ragged_batching.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,7 +28,7 @@
 
 # Ragged Batching
 
-Triton provides [dynamic batching feature](model_configuration.md#dynamic-batcher),
+Triton provides [dynamic batching feature](batcher.md#dynamic-batcher),
 which combines multiple requests for the same model execution to provide larger
 throughput. By default, the requests can be dynamically batched only if
 each input has the same shape across the requests. In order to exploit dynamic
diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 8db4e3b8c1..753d03968a 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,7 +41,7 @@ resources.
 
 ### In-Process C API
 
-[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
+[In-Process Triton Server C API](../customization_guide/inprocess_c_api.md) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
 and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query
 whether cancellation has been issued on an inflight request respectively. Read more
 about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
@@ -60,14 +60,14 @@ finer details.
 ## Handling in Triton Core
 
 Triton core checks for requests that have been cancelled at some critical points
-when using [dynamic](./model_configuration.md#dynamic-batcher) or
-[sequence](./model_configuration.md#sequence-batcher) batching. The checking is
+when using [dynamic](batcher.md#dynamic-batcher) or
+[sequence](batcher.md#sequence-batcher) batching. The checking is
 also performed between each
-[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates
+[ensemble](./scheduler.md#ensemble-scheduler) steps and terminates
 further processing if the request is cancelled.
 
 On detecting a cancelled request, Triton core responds with CANCELLED status. If a request
-is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher),
+is cancelled when using [sequence_batching](batcher.md#sequence-batcher),
 then all the pending requests in the same sequence will also be cancelled. The sequence
 is represented by the requests that has identical sequence id.
 
diff --git a/docs/user_guide/trace.md b/docs/user_guide/trace.md
index 8f7708665b..2f3413385d 100644
--- a/docs/user_guide/trace.md
+++ b/docs/user_guide/trace.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2019-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -632,8 +632,8 @@ refer to the [identity backend](https://github.com/triton-inference-server/ident
 In `openTelemetry` trace mode, if one wishes to start a new span, make sure
 that the name of your custom activity ends with `_START`. To end the new span,
 make sure that corresponding activity ends with `_END`. For example, in the
-identity backend, we start a `CUSTOM_ACTIVITY` span, by [reporting](https://github.com/triton-inference-server/identity_backend/blob/oandreeva-custom-trace-activity/src/identity.cc#L872-L876)
-`CUSTOM_ACTIVITY_START` event; and we close this span by [reporting](https://github.com/triton-inference-server/identity_backend/blob/oandreeva-custom-trace-activity/src/identity.cc#L880-L883)
+identity backend, we start a `CUSTOM_ACTIVITY` span, by [reporting](https://github.com/triton-inference-server/identity_backend/blob/30ff4255d09a4ec7547e7949a75d0cefb7e3bb28/src/identity.cc#L887-L893)
+`CUSTOM_ACTIVITY_START` event; and we close this span by [reporting](https://github.com/triton-inference-server/identity_backend/blob/30ff4255d09a4ec7547e7949a75d0cefb7e3bb28/src/identity.cc#L897-L902)
 `CUSTOM_ACTIVITY_END` event.
 
 Please note, that it is user's responsibility to make sure that all custom started
@@ -644,7 +644,7 @@ spans are properly ended.
 - OpenTelemetry trace mode is not supported on Windows systems.
 
 - Triton supports only
-[OTLP/HTTP Exporter](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp)
+[OTLP/HTTP Exporter](https://opentelemetry.io/docs/specs/otlp/#otlphttp)
 and allows specification of only url for this exporter through
 `--trace-config`. Other options and corresponding default values can be
 found [here](https://github.com/open-telemetry/opentelemetry-cpp/tree/v1.8.3/exporters/otlp#configuration-options--otlp-http-exporter-).
diff --git a/python/openai/README.md b/python/openai/README.md
index ac0ec5e53b..30e051a61b 100644
--- a/python/openai/README.md
+++ b/python/openai/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -161,8 +161,8 @@ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json'
 </details>
 
 5. Benchmark with `genai-perf`:
-- To install genai-perf in this container, see the instructions [here](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
-- Or try using genai-perf from the [SDK container](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
+- To install genai-perf in this container, see the instructions [here](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-genai-perf-ubuntu-2404-python-310)
+- Or try using genai-perf from the [SDK container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
 
 ```bash
 MODEL="llama-3.1-8b-instruct"

From 602b0893c8aa28848a8658ee95a9b6ce55619c9f Mon Sep 17 00:00:00 2001
From: Yingge He <yinggeh@nvidia.com>
Date: Wed, 4 Feb 2026 15:12:15 -0800
Subject: [PATCH 2/2] Remove duplicates

---
 README.md                          |   2 +-
 docs/contents.rst                  |   8 +-
 docs/index.md                      |   4 +-
 docs/user_guide/architecture.md    | 769 +----------------------------
 docs/user_guide/ensemble_models.md |   9 +-
 5 files changed, 16 insertions(+), 776 deletions(-)

diff --git a/README.md b/README.md
index ae73196b58..3d759db33b 100644
--- a/README.md
+++ b/README.md
@@ -54,7 +54,7 @@ Major features include:
   frameworks](https://github.com/triton-inference-server/fil_backend)
 - [Concurrent model
   execution](docs/user_guide/architecture.md#concurrent-model-execution)
-- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
+- [Dynamic batching](docs/user_guide/batcher.md#dynamic-batcher)
 - [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and
   [implicit state management](docs/user_guide/architecture.md#implicit-state-management)
   for stateful models
diff --git a/docs/contents.rst b/docs/contents.rst
index 490a518508..ca7ef3977f 100644
--- a/docs/contents.rst
+++ b/docs/contents.rst
@@ -1,5 +1,5 @@
 ..
-.. Copyright 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 ..
 .. Redistribution and use in source and binary forms, with or without
 .. modification, are permitted provided that the following conditions
@@ -37,7 +37,7 @@
    :caption: Getting Started
 
    getting_started/quick_deployment_by_backend
-   LLM With TRT-LLM <getting_started/trtllm_user_guide.md>
+   LLM With TensorRT-LLM <getting_started/trtllm_user_guide.md>
    Multimodal model <../tutorials/Popular_Models_Guide/Llava1.5/llava_trtllm_guide.md>
    Stable diffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.md>
 
@@ -96,10 +96,10 @@
    :hidden:
    :caption: Backends
 
-   TRT-LLM <tensorrtllm_backend/README>
+   TensorRT-LLM <tensorrtllm_backend/README>
    vLLM <backend_guide/vllm>
    Python <python_backend/README>
-   Pytorch <pytorch_backend/README>
+   PyTorch <pytorch_backend/README>
    ONNX Runtime <onnxruntime_backend/README>
    TensorRT <tensorrt_backend/README>
    FIL <fil_backend/README>
diff --git a/docs/index.md b/docs/index.md
index 182bb06010..bb7a0dc19b 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -58,7 +58,7 @@ architecture. The [model repository](user_guide/model_repository.md) is a
 file-system based repository of the models that Triton will make
 available for inferencing. Inference requests arrive at the server via
 either [HTTP/REST or GRPC](customization_guide/inference_protocols.md) or by the [C
-API](customization_guide/inference_protocols.md) and are then routed to the appropriate per-model
+API](customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
 algorithms](./user_guide/architecture.md#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
@@ -89,7 +89,7 @@ framework such as Kubernetes.
 Major features include:
 
 - [Supports multiple deep learning
-  frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
+  frameworks](backend/README.md#where-can-i-find-all-the-backends-that-are-available-for-triton)
 - [Supports multiple machine learning
   frameworks](https://github.com/triton-inference-server/fil_backend)
 - [Concurrent model
diff --git a/docs/user_guide/architecture.md b/docs/user_guide/architecture.md
index b18cbf3b30..c09c16bf3a 100644
--- a/docs/user_guide/architecture.md
+++ b/docs/user_guide/architecture.md
@@ -57,771 +57,4 @@ Readiness and liveness health endpoints and utilization, throughput
 and latency metrics ease the integration of Triton into deployment
 framework such as Kubernetes.
 
-![Triton Architecture Diagram](images/arch.jpg)
-
-## Concurrent Model Execution
-
-The Triton architecture allows multiple models and/or multiple
-instances of the same model to execute in parallel on the same
-system. The system may have zero, one, or many GPUs. The following
-figure shows an example with two models; model0 and model1. Assuming
-Triton is not currently processing any request, when two requests
-arrive simultaneously, one for each model, Triton immediately
-schedules both of them onto the GPU and the GPU’s hardware scheduler
-begins working on both computations in parallel. Models executing on
-the system's CPU are handled similarly by Triton except that the
-scheduling of the CPU threads execution each model is handled by the
-system's OS.
-
-![Triton Mult-Model Execution Diagram](images/multi_model_exec.png)
-
-By default, if multiple requests for the same model arrive at the same
-time, Triton will serialize their execution by scheduling only one at
-a time on the GPU, as shown in the following figure.
-
-![Triton Mult-Model Serial Execution
-Diagram](images/multi_model_serial_exec.png)
-
-Triton provides a [model configuration option called
-instance-group](model_configuration.md#instance-groups) that allows
-each model to specify how many parallel executions of that model
-should be allowed. Each such enabled parallel execution is referred to
-as an *instance*. By default, Triton gives each model a single
-instance for each available GPU in the system. By
-using the instance_group field in the model configuration, the number
-of execution instances for a model can
-be changed. The following figure shows model execution when model1
-is configured to allow three instances. As shown in the figure, the
-first three model1 inference requests are immediately executed in
-parallel. The fourth model1 inference request must wait until one of
-the first three executions completes before beginning.
-
-![Triton Mult-Model Parallel Execution
-Diagram](images/multi_model_parallel_exec.png)
-
-## Models And Schedulers
-
-Triton supports multiple scheduling and batching algorithms that can
-be selected independently for each model.  This section describes
-*stateless*, *stateful* and *ensemble* models and how Triton provides
-schedulers to support those model types. For a given model, the
-selection and configuration of the scheduler is done with the [model's
-configuration file](model_configuration.md).
-
-### Stateless Models
-
-With respect to Triton's schedulers, a *stateless* model does not
-maintain state between inference requests. Each inference performed on
-a stateless model is independent of all other inferences using that
-model.
-
-Examples of stateless models are CNNs such as image classification and
-object detection. The [default
-scheduler](scheduler.md#default-scheduler) or [dynamic
-batcher](batcher.md#dynamic-batcher) can be used as the
-scheduler for these stateless models.
-
-RNNs and similar models which do have internal memory can be stateless
-as long as the state they maintain does not span inference
-requests. For example, an RNN that iterates over all elements in a
-batch is considered stateless by Triton if the internal state is not
-carried between batches of inference requests. The [default
-scheduler](scheduler.md#default-scheduler) can be used for
-these stateless models. The [dynamic
-batcher](batcher.md#dynamic-batcher) cannot be used since
-the model is typically not expecting the batch to represent multiple
-inference requests.
-
-### Stateful Models
-
-With respect to Triton's schedulers, a *stateful* model does maintain
-state between inference requests. The model is expecting multiple
-inference requests that together form a sequence of inferences that
-must be routed to the same model instance so that the state being
-maintained by the model is correctly updated. Moreover, the model may
-require that Triton provide *control* signals indicating, for example,
-the start and end of the sequence.
-
-The [sequence batcher](batcher.md#sequence-batcher) must
-be used for these stateful models. As explained below, the sequence
-batcher ensures that all inference requests in a sequence get routed
-to the same model instance so that the model can maintain state
-correctly. The sequence batcher also communicates with the model to
-indicate when a sequence is starting, when a sequence is ending, when
-a sequence has an inference request ready for execution, and the
-*correlation ID* of the sequence.
-
-When making inference requests for a stateful model, the client
-application must provide the same correlation ID to all requests in a
-sequence, and must also mark the start and end of the sequence. The
-correlation ID allows Triton to identify that the requests belong to
-the same sequence.
-
-#### Control Inputs
-
-For a stateful model to operate correctly with the sequence batcher,
-the model must typically accept one or more *control* input tensors
-that Triton uses to communicate with the model. The
-*ModelSequenceBatching::Control* section of the [model
-configuration](model_configuration.md) indicates how the model exposes
-the tensors that the sequence batcher should use for these
-controls. All controls are optional. Below is portion of a model
-configuration that shows an example configuration for all the
-available control signals.
-
-```
-sequence_batching {
-  control_input [
-    {
-      name: "START"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_START
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "END"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_END
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "READY"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_READY
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "CORRID"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_CORRID
-          data_type: TYPE_UINT64
-        }
-      ]
-    }
-  ]
-}
-```
-
-* **Start**: The start input tensor is specified using
-  CONTROL_SEQUENCE_START in the configuration. The example
-  configuration indicates that the model has an input tensor called
-  START with a 32-bit floating point data-type. The sequence batcher
-  will define this tensor when executing an inference on the
-  model. The START tensor must be 1-dimensional with size equal to the
-  batch-size. Each element in the tensor indicates if the sequence in
-  the corresponding batch slot is starting or not. In the example
-  configuration, fp32_false_true indicates that a sequence start is
-  indicated by tensor element equal to 1, and non-start is indicated
-  by tensor element equal to 0.
-
-* **End**: The end input tensor is specified using
-  CONTROL_SEQUENCE_END in the configuration. The example configuration
-  indicates that the model has an input tensor called END with a
-  32-bit floating point data-type. The sequence batcher will define
-  this tensor when executing an inference on the model. The END tensor
-  must be 1-dimensional with size equal to the batch-size. Each
-  element in the tensor indicates if the sequence in the corresponding
-  batch slot is ending or not. In the example configuration,
-  fp32_false_true indicates that a sequence end is indicated by tensor
-  element equal to 1, and non-end is indicated by tensor element equal
-  to 0.
-
-* **Ready**: The ready input tensor is specified using
-  CONTROL_SEQUENCE_READY in the configuration. The example
-  configuration indicates that the model has an input tensor called
-  READY with a 32-bit floating point data-type. The sequence batcher
-  will define this tensor when executing an inference on the
-  model. The READY tensor must be 1-dimensional with size equal to the
-  batch-size. Each element in the tensor indicates if the sequence in
-  the corresponding batch slot has an inference request ready for
-  inference. In the example configuration, fp32_false_true indicates
-  that a sequence ready is indicated by tensor element equal to 1, and
-  non-ready is indicated by tensor element equal to 0.
-
-* **Correlation ID**: The correlation ID input tensor is specified
-  using CONTROL_SEQUENCE_CORRID in the configuration. The example
-  configuration indicates that the model has an input tensor called
-  CORRID with a unsigned 64-bit integer data-type. The sequence
-  batcher will define this tensor when executing an inference on the
-  model. The CORRID tensor must be 1-dimensional with size equal to
-  the batch-size. Each element in the tensor indicates the correlation
-  ID of the sequence in the corresponding batch slot.
-
-#### Implicit State Management
-
-Implicit state management allows a stateful model to store its state inside
-Triton. When using implicit state, the stateful model does not need to store
-the state required for inference inside the model.
-
-Below is a portion of the model configuration that indicates the model
-is using implicit state.
-
-```
-sequence_batching {
-  state [
-    {
-      input_name: "INPUT_STATE"
-      output_name: "OUTPUT_STATE"
-      data_type: TYPE_INT32
-      dims: [ -1 ]
-    }
-  ]
-}
-```
-
-The *state* section in the sequence_batching setting is used to indicate that
-the model is using implicit state. The *input_name* field specifies the name of
-the input tensor that will contain the input state. The *output_name* field
-describes the name of the output tensor produced by the model that contains
-output state. The output state provided by the model in the *i<sup>th</sup>*
-request in the sequence will be used as the input state in the
-*i+1<sup>th</sup>* request. The *dims* field specifies the dimensions of the
-state tensors. When the *dims* field contains variable-sized dimensions, the
-shape of the input state and output state does not have to match.
-
-For debugging purposes, the client can request the output state. In order to
-allow the client to request the output state, the
-[*output* section of the model configuration](./model_configuration.md#inputs-and-outputs)
-must list the output state as one of the model outputs. Note that requesting the
-output state from the client can increase the request latency because of the
-additional tensors that have to be transferred.
-
-Implicit state management requires backend support. Currently, only
-[onnxruntime_backend](https://github.com/triton-inference-server/onnxruntime_backend)
-[tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend),
-and [pytorch_backend](https://github.com/triton-inference-server/pytorch_backend)
-support implicit state.
-
-##### State Initialization
-
-By default, the starting request in the sequence contains uninitialized data for
-the input state. The model can use the start flag in the request to detect the
-beginning of a new sequence and initialize the model state by providing the
-initial state in the model output. If the *dims* section in the *state*
-description of the model contains variable-sized dimensions, Triton will use *1*
-for every variable-sized dimension for the starting request. For other
-non-starting requests in the sequence, the input state is the output state of
-the previous request in the sequence. For an example ONNX model that uses
-implicit state you can refer to this onnx model generated from the
-`create_onnx_modelfile_wo_initial_state()`
-[from this generation script](https://github.com/triton-inference-server/server/blob/main/qa/common/gen_qa_implicit_models.py).
-This is a simple accumulator model that stores the partial sum of the requests
-in a sequence in Triton using implicit state. For state initialization, if the
-request is starting, the model sets the "OUTPUT\_STATE" to be equal to the
-"INPUT" tensor. For non-starting requests, it sets the "OUTPUT\_STATE" tensor
-to the sum of "INPUT" and "INPUT\_STATE" tensors.
-
-In addition to the default state initialization discussed above, Triton provides
-two other mechanisms for initializing state.
-
-###### Initializing State from Zero.
-
-Below is an example of initializing state from zero.
-
-```
-sequence_batching {
-  state [
-    {
-      input_name: "INPUT_STATE"
-      output_name: "OUTPUT_STATE"
-      data_type: TYPE_INT32
-      dims: [ -1 ]
-      initial_state: {
-       data_type: TYPE_INT32
-       dims: [ 1 ]
-       zero_data: true
-       name: "initial state"
-      }
-    }
-  ]
-}
-```
-
-Note that in the example above variable dimensions in the state description are
-converted to fixed size dimensions.
-
-###### Initializing State from File
-
-For initializing state from file, you need to create a directory named
-"initial\_state" under the model directory. The file that contains the initial
-state under this directory needs to be provided in the *data_file* field.
-The data stored in this file will be used in row-major order as the initial
-state. Below is an example state description initializing state from file.
-
-```
-sequence_batching {
-  state [
-    {
-      input_name: "INPUT_STATE"
-      output_name: "OUTPUT_STATE"
-      data_type: TYPE_INT32
-      dims: [ -1 ]
-      initial_state: {
-       data_type: TYPE_INT32
-       dims: [ 1 ]
-       data_file: "initial_state_data"
-       name: "initial state"
-      }
-    }
-  ]
-}
-```
-
-#### Scheduling Strategies
-
-The sequence batcher can employ one of two scheduling strategies when
-deciding how to batch the sequences that are routed to the same model
-instance. These strategies are [direct](#direct) and [oldest](#oldest).
-
-##### Direct
-
-With the Direct scheduling strategy the sequence batcher ensures not
-only that all inference requests in a sequence are routed to the same
-model instance, but also that each sequence is routed to a dedicated
-batch slot within the model instance. This strategy is required when
-the model maintains state for each batch slot, and is expecting all
-inference requests for a given sequence to be routed to the same slot
-so that the state is correctly updated.
-
-As an example of the sequence batcher using the Direct scheduling
-strategy, assume a TensorRT stateful model that has the following
-model configuration.
-
-```
-name: "direct_stateful_model"
-platform: "tensorrt_plan"
-max_batch_size: 2
-sequence_batching {
-  max_sequence_idle_microseconds: 5000000
-  direct { }
-  control_input [
-    {
-      name: "START"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_START
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "READY"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_READY
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    }
-  ]
-}
-input [
-  {
-    name: "INPUT"
-    data_type: TYPE_FP32
-    dims: [ 100, 100 ]
-  }
-]
-output [
-  {
-    name: "OUTPUT"
-    data_type: TYPE_FP32
-    dims: [ 10 ]
-  }
-]
-instance_group [
-  {
-    count: 2
-  }
-]
-```
-
-The sequence_batching section indicates that the model should use the
-sequence batcher and the Direct scheduling strategy. In this example
-the model only requires a *start* and *ready* control input from the
-sequence batcher so only those controls are listed. The instance_group
-indicates two instances of the model should be instantiated and
-max_batch_size indicates that each of those instances should perform
-batch-size 2 inferences. The following figure shows a representation
-of the sequence batcher and the inference resources specified by this
-configuration.
-
-![Sequence Batching Example](images/sequence_example0.png)
-
-Each model instance is maintaining state for each batch slot, and is
-expecting all inference requests for a given sequence to be routed to
-the same slot so that the state is correctly updated. For this example
-that means that Triton can simultaneously perform inference for up to
-four sequences.
-
-Using the Direct scheduling strategy, the sequence batcher:
-
-* Recognizes when an inference request starts a new sequence and
-  allocates a batch slot for that sequence. If no batch slot is
-  available for the new sequence, Triton places the inference request
-  in a backlog.
-
-* Recognizes when an inference request is part of a sequence that has
-  an allocated batch slot and routes the request to that slot.
-
-* Recognizes when an inference request is part of a sequence that is
-  in the backlog and places the request in the backlog.
-
-* Recognizes when the last inference request in a sequence has been
-  completed. The batch slot occupied by that sequence is immediately
-  reallocated to a sequence in the backlog, or freed for a future
-  sequence if there is no backlog.
-
-The following figure shows how multiple sequences are scheduled onto
-the model instances using the Direct scheduling strategy. On the left
-the figure shows several sequences of requests arriving at
-Triton. Each sequence could be made up of any number of inference
-requests and those individual inference requests could arrive in any
-order relative to inference requests in other sequences, except that
-the execution order shown on the right assumes that the first
-inference request of sequence 0 arrives before any inference request
-in sequences 1-5, the first inference request of sequence 1 arrives
-before any inference request in sequences 2-5, etc.
-
-The right of the figure shows how the inference request sequences are
-scheduled onto the model instances over time.
-
-![Sequence Batcher Example](images/sequence_example1.png)
-
-The following figure shows the sequence batcher uses the control input
-tensors to communicate with the model. The figure shows two sequences
-assigned to the two batch slots in a model instance. Inference
-requests for each sequence arrive over time. The START and READY rows
-show the input tensor values used for each execution of the
-model. Over time the following happens:
-
-* The first request arrives for the sequence in slot0. Assuming the
-  model instance is not already executing an inference, the sequence
-  scheduler immediately schedules the model instance to execute
-  because an inference request is available.
-
-* This is the first request in the sequence so the corresponding
-  element in the START tensor is set to 1. There is no request
-  available in slot1 so the READY tensor shows only slot0 as ready.
-
-* After the inference completes the sequence scheduler sees that there
-  are no requests available in any batch slot and so the model
-  instance sits idle.
-
-* Next, two inference requests arrive close together in time so that
-  the sequence scheduler sees them both available in their respective
-  batch slots. The scheduler immediately schedules the model instance
-  to perform a batch-size 2 inference and uses START and READY to show
-  that both slots have an inference request available but that only
-  slot1 is the start of a new sequence.
-
-* The processing continues in a similar manner for the other inference
-  requests.
-
-![Sequence Batcher Example](images/sequence_example2.png)
-
-##### Oldest
-
-With the Oldest scheduling strategy the sequence batcher ensures that
-all inference requests in a sequence are routed to the same model
-instance and then uses the [dynamic
-batcher](batcher.md#dynamic-batcher) to batch together
-multiple inferences from different sequences into a batch that
-inferences together.  With this strategy the model must typically use
-the CONTROL_SEQUENCE_CORRID control so that it knows which sequence
-each inference request in the batch belongs to. The
-CONTROL_SEQUENCE_READY control is typically not needed because all
-inferences in the batch will always be ready for inference.
-
-As an example of the sequence batcher using the Oldest scheduling
-strategy, assume a stateful model that has the following model
-configuration:
-
-```
-name: "oldest_stateful_model"
-platform: "tensorflow_savedmodel"
-max_batch_size: 2
-sequence_batching {
-  max_sequence_idle_microseconds: 5000000
-  oldest
-    {
-      max_candidate_sequences: 4
-    }
-  control_input [
-    {
-      name: "START"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_START
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "END"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_END
-          fp32_false_true: [ 0, 1 ]
-        }
-      ]
-    },
-    {
-      name: "CORRID"
-      control [
-        {
-          kind: CONTROL_SEQUENCE_CORRID
-          data_type: TYPE_UINT64
-        }
-      ]
-    }
-  ]
-}
-input [
-  {
-    name: "INPUT"
-    data_type: TYPE_FP32
-    dims: [ 100, 100 ]
-  }
-]
-output [
-  {
-    name: "OUTPUT"
-    data_type: TYPE_FP32
-    dims: [ 10 ]
-  }
-]
-```
-
-The sequence_batching section indicates that the model should use the
-sequence batcher and the Oldest scheduling strategy. The Oldest
-strategy is configured so that the sequence batcher maintains up to 4
-active candidate sequences from which it prefers to form dynamic
-batches of size 2. In this example the model requires a *start*,
-*end*, and *correlation ID* control input from the sequence
-batcher. The following figure shows a representation of the sequence
-batcher and the inference resources specified by this configuration.
-
-![Sequence Batching Example](images/dyna_sequence_example0.png)
-
-Using the Oldest scheduling strategy, the sequence batcher:
-
-* Recognizes when an inference request starts a new sequence and
-  attempts to find a model instance that has room for a candidate
-  sequence. If no model instance has room for a new candidate
-  sequence, Triton places the inference request in a backlog.
-
-* Recognizes when an inference request is part of a sequence that is
-  already a candidate sequence in some model instance and routes the
-  request to that model instance.
-
-* Recognizes when an inference request is part of a sequence that is
-  in the backlog and places the request in the backlog.
-
-* Recognizes when the last inference request in a sequence has been
-  completed. The model instance immediately removes a sequence from
-  the backlog and makes it a candidate sequence in the model instance,
-  or records that the model instance can handle a future sequence if
-  there is no backlog.
-
-The following figure shows how multiple sequences are scheduled onto
-the model instance specified by the above example configuration. On
-the left the figure shows four sequences of requests arriving at
-Triton. Each sequence is composed of multiple inference requests as
-shown in the figure. The center of the figure shows how the inference
-request sequences are batched onto the model instance over time,
-assuming that the inference requests for each sequence arrive at the
-same rate with sequence A arriving just before B, which arrives just
-before C, etc. The Oldest strategy forms a dynamic batch from the
-oldest requests but never includes more than one request from a given
-sequence in a batch (for example, the last two inferences in sequence
-D are not batched together).
-
-![Sequence Batcher Example](images/dyna_sequence_example1.png)
-
-### Ensemble Models
-
-An ensemble model represents a *pipeline* of one or more models and
-the connection of input and output tensors between those
-models. Ensemble models are intended to be used to encapsulate a
-procedure that involves multiple models, such as "data preprocessing
--> inference -> data postprocessing".  Using ensemble models for this
-purpose can avoid the overhead of transferring intermediate tensors
-and minimize the number of requests that must be sent to Triton.
-
-The ensemble scheduler must be used for ensemble models, regardless of
-the scheduler used by the models within the ensemble. With respect to
-the ensemble scheduler, an *ensemble* model is not an actual
-model. Instead, it specifies the dataflow between models within the
-ensemble as *ModelEnsembling::Step* entries in the model
-configuration. The scheduler collects the output tensors in each step,
-provides them as input tensors for other steps according to the
-specification. In spite of that, the ensemble model is still viewed as
-a single model from an external view.
-
-Note that the ensemble models will inherit the characteristics of the
-models involved, so the meta-data in the request header must comply
-with the models within the ensemble. For instance, if one of the
-models is stateful model, then the inference request for the ensemble
-model should contain the information mentioned in [Stateful
-Models](#stateful-models), which will be provided to the stateful
-model by the scheduler.
-
-As an example consider an ensemble model for image classification and
-segmentation that has the following model configuration:
-
-```
-name: "ensemble_model"
-platform: "ensemble"
-max_batch_size: 1
-input [
-  {
-    name: "IMAGE"
-    data_type: TYPE_STRING
-    dims: [ 1 ]
-  }
-]
-output [
-  {
-    name: "CLASSIFICATION"
-    data_type: TYPE_FP32
-    dims: [ 1000 ]
-  },
-  {
-    name: "SEGMENTATION"
-    data_type: TYPE_FP32
-    dims: [ 3, 224, 224 ]
-  }
-]
-ensemble_scheduling {
-  step [
-    {
-      model_name: "image_preprocess_model"
-      model_version: -1
-      input_map {
-        key: "RAW_IMAGE"
-        value: "IMAGE"
-      }
-      output_map {
-        key: "PREPROCESSED_OUTPUT"
-        value: "preprocessed_image"
-      }
-    },
-    {
-      model_name: "classification_model"
-      model_version: -1
-      input_map {
-        key: "FORMATTED_IMAGE"
-        value: "preprocessed_image"
-      }
-      output_map {
-        key: "CLASSIFICATION_OUTPUT"
-        value: "CLASSIFICATION"
-      }
-    },
-    {
-      model_name: "segmentation_model"
-      model_version: -1
-      input_map {
-        key: "FORMATTED_IMAGE"
-        value: "preprocessed_image"
-      }
-      output_map {
-        key: "SEGMENTATION_OUTPUT"
-        value: "SEGMENTATION"
-      }
-    }
-  ]
-}
-```
-
-The ensemble\_scheduling section indicates that the ensemble scheduler will be
-used and that the ensemble model consists of three different models. Each
-element in step section specifies the model to be used and how the inputs and
-outputs of the model are mapped to tensor names recognized by the scheduler. For
-example, the first element in step specifies that the latest version of
-image\_preprocess\_model should be used, the content of its input "RAW\_IMAGE"
-is provided by "IMAGE" tensor, and the content of its output
-"PREPROCESSED\_OUTPUT" will be mapped to "preprocessed\_image" tensor for later
-use. The tensor names recognized by the scheduler are the ensemble inputs, the
-ensemble outputs and all values in the input\_map and the output\_map.
-
-The models composing the ensemble may also have dynamic batching
-enabled.  Since ensemble models are just routing the data between
-composing models, Triton can take requests into an ensemble model
-without modifying the ensemble's configuration to exploit the dynamic
-batching of the composing models.
-
-Assuming that only the ensemble model, the preprocess model, the classification
-model and the segmentation model are being served, the client applications will
-see them as four different models which can process requests independently.
-However, the ensemble scheduler will view the ensemble model as the following.
-
-![Ensemble Example](images/ensemble_example0.png)
-
-When an inference request for the ensemble model is received, the ensemble
-scheduler will:
-
-1. Recognize that the "IMAGE" tensor in the request is mapped to input
-   "RAW\_IMAGE" in the preprocess model.
-
-2. Check models within the ensemble and send an internal request to the
-   preprocess model because all the input tensors required are ready.
-
-3. Recognize the completion of the internal request, collect the output
-   tensor and map the content to "preprocessed\_image" which is an unique name
-   known within the ensemble.
-
-4. Map the newly collected tensor to inputs of the models within the ensemble.
-   In this case, the inputs of "classification\_model" and "segmentation\_model"
-   will be mapped and marked as ready.
-
-5. Check models that require the newly collected tensor and send internal
-   requests to models whose inputs are ready, the classification
-   model and the segmentation model in this case. Note that the responses will
-   be in arbitrary order depending on the load and computation time of
-   individual models.
-
-6. Repeat step 3-5 until no more internal requests should be sent, and then
-   response to the inference request with the tensors mapped to the ensemble
-   output names.
-
-Unlike other models, ensemble models do not support "instance_group" field in
-the model configuration. The reason is that the ensemble scheduler itself
-is mainly an event-driven scheduler with very minimal overhead so its
-almost never the bottleneck of the pipeline. The composing models
-within the ensemble can be individually scaled up or down with their
-respective `instance_group` settings. To optimize your model pipeline
-performance, you can use
-[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
-to find the optimal model configurations.
-
-When crafting the ensemble steps, it is useful to note the distinction between
-*key* and *value* on the `input_map`/`output_map`:
-* *key*: An `input`/`output` tensor name on the composing model.
-* *value*: A tensor name on the ensemble model, which acts as an identifier
-connecting ensemble `input`/`output` to those on the composing model and between
-composing models.
-
-#### Additional Resources
-
-You can find additional end-to-end ensemble examples in the links below:
-* [This guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles)
-explores the concept of ensembles with a running example.
-* [Preprocessing in Python Backend Using
-  Ensemble](https://github.com/triton-inference-server/python_backend#preprocessing)
-* [Accelerating Inference with NVIDIA Triton Inference Server and NVIDIA
-  DALI](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/)
-* [Using RAPIDS AI with NVIDIA Triton Inference
-  Server](https://github.com/rapidsai/rapids-examples/tree/main/rapids_triton_example)
-
+![Triton Architecture Diagram](images/arch.jpg)
\ No newline at end of file
diff --git a/docs/user_guide/ensemble_models.md b/docs/user_guide/ensemble_models.md
index 8c6ebebd1b..3477e33d6c 100644
--- a/docs/user_guide/ensemble_models.md
+++ b/docs/user_guide/ensemble_models.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -183,6 +183,13 @@ performance, you can use
 [Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
 to find the optimal model configurations.
 
+When crafting the ensemble steps, it is useful to note the distinction between
+*key* and *value* on the `input_map`/`output_map`:
+* *key*: An `input`/`output` tensor name on the composing model.
+* *value*: A tensor name on the ensemble model, which acts as an identifier
+connecting ensemble `input`/`output` to those on the composing model and between
+composing models.
+
 ## Managing Memory Usage in Ensemble Models
 
 An *inflight request* refers to an intermediate request generated by an upstream model that is queued and held in memory until it is processed by a downstream model within an ensemble pipeline. When upstream models process requests significantly faster than downstream models, these in-flight requests can accumulate and potentially lead to unbounded memory growth. This problem occurs when there is a speed mismatch between different steps in the pipeline and is particularly common in *decoupled models* that produce multiple responses per request more quickly than downstream models can consume.