Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 6533c4e

Browse files
authored
Update documents for release 0.9 (NVIDIA#1461)
1 parent 250d9c2 commit 6533c4e

38 files changed

+1347
-1198
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ cpp/.ccache/
3232
tensorrt_llm/libs
3333
tensorrt_llm/bindings.pyi
3434
tensorrt_llm/bindings/*.pyi
35+
*docs/cpp_docs*
36+
*docs/source/_cpp_gen*
3537

3638
# Testing
3739
.coverage.*

README.md

Lines changed: 15 additions & 445 deletions
Large diffs are not rendered by default.

docs/source/2023-05-17-how-to-add-a-new-model.md

Lines changed: 0 additions & 17 deletions
This file was deleted.

docs/source/batch_manager.md renamed to docs/source/advanced/batch-manager.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
(batch-manager)=
2+
13
# The Batch Manager in TensorRT-LLM
24

35
TensorRT-LLM relies on a component, called the Batch Manager, to support
@@ -17,7 +19,7 @@ how it returns completed requests to the user.
1719

1820
A software component (called the client in the text that follows) can interact
1921
with the batch manager using two mandatory, and several optional callbacks. Their signatures are defined
20-
in the [`callbacks.h`](source:cpp/include/tensorrt_llm/batch_manager/callbacks.h) file.
22+
in the [`callbacks.h`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/callbacks.h) file.
2123

2224
These callbacks are invoked in the generation loop at regular intervals and serve a variety of functions described below.
2325

@@ -40,9 +42,7 @@ tensors and a 64-bit unsigned number (`uint64_t`) that will uniquely identify
4042
the request. That identifier is called the *request ID* in the text that
4143
follows (and in the code of the batch manager). The input tensors are collected
4244
in a map (`std::map<std::string, Tensor>`) that associates input names to
43-
tensor. See
44-
[`InferenceRequest.h`](source:cpp/include/tensorrt_llm/batch_manager/InferenceRequest.h)
45-
for more details.
45+
tensor. Refer to [`InferenceRequest.h`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h) for more information.
4646

4747
Responses are delivered to the client through a callback of type
4848
`SendResponseCallback`. A conforming callback must accept the 64-bit
@@ -58,7 +58,7 @@ Its signature is:
5858
using SendResponseCallback = std::function<void(uint64_t, std::list<std::shared_ptr<Tensor>> const&, bool, const std::string&)>;
5959
```
6060

61-
Note that the batch manager will reject any request sent using the
61+
The batch manager will reject any request sent using the
6262
`GetInferenceRequestsCallback` callback if the request ID passed by the
6363
client corresponds to the request ID of a request that is being processed
6464
by the batch manager. A request ID can be reused after it appears in a
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
(expert-parallelism)=
2+
3+
# Expert Parallelism in TensorRT-LLM
4+
5+
## Mixture of Experts (MoE)
6+
7+
Mixture of Experts (MoE) architectures have been used widely recently, such as [Mistral Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json). Specifically, MOE’s structure supports multiple parallel Feedforward Neural Network (FFN) layers (called experts) to replace the single FFN layer in the dense model. When tokens arrive, the router layer selects the TopK experts for each token. The corresponding hidden state of the token is then dispatched to the selected TopK experts, respectively. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
8+
9+
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/moe_structure.png?raw=true" alt="moe_structure" width="500" height="auto">
10+
11+
<sub>the MOE structure in Switch Transformer: [https://arxiv.org/pdf/2101.03961.pdf](https://arxiv.org/pdf/2101.03961.pdf) </sub>
12+
13+
## Tensor Parallel vs Expert Parallel
14+
15+
Parallelism on multi-GPUs is necessary if the MoE model can not be accommodated by a single GPU’s memory. We have supported two kinds of parallel patterns for MoE structure, Tensor Parallel (default pattern) and Expert Parallel.
16+
17+
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/tp_ep.png?raw=true" alt="tensor parallel vs expert parallel" width="500" height="auto">
18+
19+
Tensor Parallel evenly splits each expert’s weight and distributes them to different GPUs, which means each GPU holds partial weight of all experts, While Expert Parallel evenly distributes some of the experts’ full weight to different GPUs, which means each GPU holds part of the experts’ full weight. As a result, each GPU rank in the Tensor Parallel group receives all tokens’ hidden states for all experts, then computes using the partial weights, while for Expert Parallel, each GPU rank only receives part of tokens’ hidden states for experts on this rank, then computes using the full weights.
20+
21+
22+
## How to Enable
23+
24+
The default parallel pattern is Tensor Parallel. You can enable Expert Parallel by setting `--moe_tp_mode 1` when calling `convert_coneckpoint.py`, and `--tp_size` is used to set the Expert Parallel size.
25+
26+
The other parameters related to MoE structure, such as `num_experts_per_tok` (TopK in previous context), and `num_local_experts`, can be find in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).

docs/source/gpt_attention.md renamed to docs/source/advanced/gpt-attention.md

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,13 @@
1-
# Multi-head, Multi-query and Group-query Attention
1+
(gpt-attention)=
22

3-
This document details the implementation of multihead attention (MHA),
4-
multiquery attention (MQA) and group-query attention (GQA) for auto-regressive
5-
GPT-like models in TensorRT-LLM. As a quick reminder, the multihead attention
3+
# Multi-Head, Multi-Query, and Group-Query Attention
4+
5+
This document details the implementation of multi-head attention (MHA),
6+
multi-query attention (MQA) and group-query attention (GQA) for auto-regressive
7+
GPT-like models in TensorRT-LLM. As a quick reminder, the multi-head attention
68
is the sequence of a batched matmul, a softmax and another batched matmul
79
described in the
8-
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article.
9-
Multi-query Attention (MQA) [[https://arxiv.org/abs/1911.02150](https://arxiv.org/abs/1911.02150)]
10-
Group-query Attention (GQA) [[https://arxiv.org/abs/2307.09288](https://arxiv.org/abs/2307.09288)]
11-
are variants of MHA that use fewer, so-called, K/V head than the number of
12-
query heads. TensorRT-LLM, MHA, MQA and GQA are implemented by the operator
13-
[`tensorrt_llm.functional.gpt_attention`](source:tensorrt_llm/functional.py).
10+
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article. [Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are variants of MHA that use fewer, so-called, K/V head than the number of query heads. TensorRT-LLM, MHA, MQA and GQA are implemented by the operator [`tensorrt_llm.functional.gpt_attention`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py).
1411

1512
## Important Note
1613

@@ -24,17 +21,17 @@ future***.
2421
In TensorRT-LLM, the GPT attention operator supports two different types
2522
of QKV inputs: Padded and packed (i.e. non padded) inputs. The mode is
2623
determined by the global configuration parameter `remove_input_padding` defined
27-
in [`tensorrt_llm.plugin`](source:tensorrt_llm/plugin/plugin.py).
24+
in [`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py).
2825

29-
When padding is enabled (i.e. `remove_input_padding` is `False`), the sequences
26+
When padding is enabled (that is, `remove_input_padding` is `False`), the sequences
3027
that are shorter than the `max_sequence_length` are padded to that maximum
3128
length. It may result in excessive memory consumption as well as unneeded
3229
computations on padding tokens (in the various matrix multiplications that
3330
surround the MHA block).
3431

3532
To overcome that problem, TensorRT-LLM supports a mode without padding where
3633
the different tokens are packed together and the user provides the operator
37-
with a 1D tensor containing the lengths of the different sequences. It is
34+
with a 1D tensor containing the lengths of the different sequences. It is
3835
recommended that users to always use packed mode (and support for the padded
3936
mode may be removed in the future).
4037

@@ -45,8 +42,8 @@ context and generation phases in auto-regressive models like GPT.
4542

4643
### Context Phase
4744

48-
If the `context_fmha_type` is set to `disabled` (see
49-
[`tensorrt_llm.plugin`](source:tensorrt_llm/plugin/plugin.py)),
45+
If the `context_fmha_type` is set to `disabled` (refer to
46+
[`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py)),
5047
the implementation maps to a sequence of GPU kernels that will store the
5148
intermediate `Q*K^T` tensor in memory before calling the softmax operator. It
5249
is the slowest method and the memory footprint is significant (quadratically
@@ -58,9 +55,9 @@ FP32), that function will trigger a kernel that performs the MHA/MQA block
5855
using a single kernel. For short sequences, that kernel uses a vanilla
5956
implementation of MHA/MQA. For larger sequences, this kernel uses the Flash
6057
Attention algorithm as described in
61-
[https://arxiv.org/abs/2205.14135](https://arxiv.org/abs/2205.14135)
58+
[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
6259
and
63-
[https://arxiv.org/abs/2307.08691](https://arxiv.org/abs/2307.08691).
60+
[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691).
6461

6562
Currently, the implementation triggers extra kernels that apply pre-processing
6663
to the elements (like RoPE) and populate the KV cache (see below). In a future
@@ -72,18 +69,16 @@ improve the overall performance.
7269
When FP8 quantization is activated, the attention can be further accelerated by
7370
enabling FP8 Context FMHA (`use_fp8_context_fmha = enable`).
7471

75-
Please be aware that this is an experimental feature only supported on Hopper.
76-
If you notice a significant decrease in accuracy, it is recommended to disable
77-
it..
72+
This is an experimental feature only supported on Hopper. If you notice a significant decrease in accuracy, it is recommended to disable it.
7873

7974
### Generation Phase
8075

81-
The generation phase is implemented using a single kernel, called the masked
82-
multihead attention in TensorRT-LLM. That kernel is able to apply
83-
pre-processing on the Q, K and V elements on-the-fly: Add the QKV bias, apply
84-
RoPE, do dequantization/quantization. TensorRT-LLM will continue to add (or
76+
The generation phase is implemented using a single kernel called the masked
77+
multi-head attention in TensorRT-LLM. That kernel is able to apply
78+
pre-processing on the Q, K, and V elements on-the-fly: adds the QKV bias, applies
79+
RoPE, and performs dequantization and quantization. TensorRT-LLM will continue to add (or
8580
enable) additional features in future releases. For example, enable the support
86-
for ALiBi or IA3.
81+
for IA3.
8782

8883
_The masked MHA kernel has a special version that distributes the work across
8984
multiple CUDA thread-blocks on the GPU for cases where the GPU occupancy is
@@ -131,9 +126,12 @@ of class `DecoderXQARunner` in
131126
`cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h`.
132127

133128

134-
## Inflight batching
129+
(inflight-batching)=
130+
131+
## In-flight Batching
135132

136-
TensorRT-LLM supports a feature called in-flight batching. With that feature,
133+
TensorRT-LLM supports in-flight batching of requests (also known as continuous
134+
batching or iteration-level batching) for higher serving throughput. With this feature,
137135
sequences in context phase can be processed together with sequences in
138136
generation phase. The purpose of that technique is to better interleave
139137
requests to reduce latency as well as make better use the of the GPUs.
@@ -150,6 +148,8 @@ be relaxed in a future version.
150148
_(1) Padding sequences in the generation phase, that contain a single token, to
151149
the length of the maximum input sequence is inefficient use of resources_.
152150

151+
152+
153153
## Chunked Context
154154

155155
In the original state, the common behavior was to process all context tokens at
@@ -158,10 +158,10 @@ context chunks can be batched with more tokens during the generation phase,
158158
which is expected to increase the total throughput. Chunking contexts also removes
159159
constraints on input length. To enable this feature, the FMHA paged kv-cache also
160160
needs to be enabled. Except for the last one, the size of the context chunk needs
161-
to be an integer multiple of the kv-cache block size. Please refer to
161+
to be an integer multiple of the kv-cache block size. Refer to
162162
[the performance best practices](perf_best_practices.md#chunked-context) for usage.
163163

164-
## KV Cache(s)
164+
## KV Cache
165165

166166
In the generation phase, a common optimization is to provide the MHA kernel
167167
with a cache containing the values of the past K and V elements that have
@@ -290,6 +290,7 @@ sequences in generation phase, there are `beam_width` tokens per sequence. The
290290
beam width can be different for each sequence.
291291

292292
In other words, the pseudo-code to compute the number of tokens is:
293+
293294
```python
294295
num_tokens = 0
295296

0 commit comments

Comments
 (0)