You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 4, 2025. It is now read-only.
tensor. Refer to [`InferenceRequest.h`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h) for more information.
46
46
47
47
Responses are delivered to the client through a callback of type
48
48
`SendResponseCallback`. A conforming callback must accept the 64-bit
@@ -58,7 +58,7 @@ Its signature is:
58
58
using SendResponseCallback = std::function<void(uint64_t, std::list<std::shared_ptr<Tensor>> const&, bool, const std::string&)>;
59
59
```
60
60
61
-
Note that the batch manager will reject any request sent using the
61
+
The batch manager will reject any request sent using the
62
62
`GetInferenceRequestsCallback` callback if the request ID passed by the
63
63
client corresponds to the request ID of a request that is being processed
64
64
by the batch manager. A request ID can be reused after it appears in a
Mixture of Experts (MoE) architectures have been used widely recently, such as [Mistral Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json). Specifically, MOE’s structure supports multiple parallel Feedforward Neural Network (FFN) layers (called experts) to replace the single FFN layer in the dense model. When tokens arrive, the router layer selects the TopK experts for each token. The corresponding hidden state of the token is then dispatched to the selected TopK experts, respectively. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
<sub>the MOE structure in Switch Transformer: [https://arxiv.org/pdf/2101.03961.pdf](https://arxiv.org/pdf/2101.03961.pdf) </sub>
12
+
13
+
## Tensor Parallel vs Expert Parallel
14
+
15
+
Parallelism on multi-GPUs is necessary if the MoE model can not be accommodated by a single GPU’s memory. We have supported two kinds of parallel patterns for MoE structure, Tensor Parallel (default pattern) and Expert Parallel.
16
+
17
+
<imgsrc="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/tp_ep.png?raw=true"alt="tensor parallel vs expert parallel"width="500"height="auto">
18
+
19
+
Tensor Parallel evenly splits each expert’s weight and distributes them to different GPUs, which means each GPU holds partial weight of all experts, While Expert Parallel evenly distributes some of the experts’ full weight to different GPUs, which means each GPU holds part of the experts’ full weight. As a result, each GPU rank in the Tensor Parallel group receives all tokens’ hidden states for all experts, then computes using the partial weights, while for Expert Parallel, each GPU rank only receives part of tokens’ hidden states for experts on this rank, then computes using the full weights.
20
+
21
+
22
+
## How to Enable
23
+
24
+
The default parallel pattern is Tensor Parallel. You can enable Expert Parallel by setting `--moe_tp_mode 1` when calling `convert_coneckpoint.py`, and `--tp_size` is used to set the Expert Parallel size.
25
+
26
+
The other parameters related to MoE structure, such as `num_experts_per_tok` (TopK in previous context), and `num_local_experts`, can be find in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article. [Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are variants of MHA that use fewer, so-called, K/V head than the number of query heads. TensorRT-LLM, MHA, MQA and GQA are implemented by the operator [`tensorrt_llm.functional.gpt_attention`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py).
14
11
15
12
## Important Note
16
13
@@ -24,17 +21,17 @@ future***.
24
21
In TensorRT-LLM, the GPT attention operator supports two different types
25
22
of QKV inputs: Padded and packed (i.e. non padded) inputs. The mode is
26
23
determined by the global configuration parameter `remove_input_padding` defined
27
-
in [`tensorrt_llm.plugin`](source:tensorrt_llm/plugin/plugin.py).
24
+
in [`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py).
28
25
29
-
When padding is enabled (i.e.`remove_input_padding` is `False`), the sequences
26
+
When padding is enabled (that is,`remove_input_padding` is `False`), the sequences
30
27
that are shorter than the `max_sequence_length` are padded to that maximum
31
28
length. It may result in excessive memory consumption as well as unneeded
32
29
computations on padding tokens (in the various matrix multiplications that
33
30
surround the MHA block).
34
31
35
32
To overcome that problem, TensorRT-LLM supports a mode without padding where
36
33
the different tokens are packed together and the user provides the operator
37
-
with a 1D tensor containing the lengths of the different sequences. It is
34
+
with a 1D tensor containing the lengths of the different sequences. It is
38
35
recommended that users to always use packed mode (and support for the padded
39
36
mode may be removed in the future).
40
37
@@ -45,8 +42,8 @@ context and generation phases in auto-regressive models like GPT.
45
42
46
43
### Context Phase
47
44
48
-
If the `context_fmha_type` is set to `disabled` (see
0 commit comments