janhq
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 15 additions & 445 deletions b/‎README.md‎
Lines changed: 15 additions & 445 deletions
diff --git a/‎docs/source/2023-05-17-how-to-add-a-new-model.md‎
Lines changed: 0 additions & 17 deletions b/‎docs/source/2023-05-17-how-to-add-a-new-model.md‎
Lines changed: 0 additions & 17 deletions
diff --git a/‎docs/source/batch_manager.md‎ renamed to ‎docs/source/advanced/batch-manager.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/source/batch_manager.md‎ renamed to ‎docs/source/advanced/batch-manager.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/advanced/expert-parallelism.md‎
Lines changed: 26 additions & 0 deletions b/‎docs/source/advanced/expert-parallelism.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎docs/source/gpt_attention.md‎ renamed to ‎docs/source/advanced/gpt-attention.md‎
Lines changed: 30 additions & 29 deletions b/‎docs/source/gpt_attention.md‎ renamed to ‎docs/source/advanced/gpt-attention.md‎
Lines changed: 30 additions & 29 deletions
@@ -32,6 +32,8 @@ cpp/.ccache/
 tensorrt_llm/libs
 tensorrt_llm/bindings.pyi
 tensorrt_llm/bindings/*.pyi
+*docs/cpp_docs*
+*docs/source/_cpp_gen*
 
 # Testing
 .coverage.*
 
@@ -1,3 +1,5 @@
+(batch-manager)=
+
 # The Batch Manager in TensorRT-LLM
 
 TensorRT-LLM relies on a component, called the Batch Manager, to support
@@ -17,7 +19,7 @@ how it returns completed requests to the user.
 
 A software component (called the client in the text that follows) can interact
 with the batch manager using two mandatory, and several optional callbacks. Their signatures are defined
-in the [`callbacks.h`](source:cpp/include/tensorrt_llm/batch_manager/callbacks.h) file.
+in the [`callbacks.h`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/callbacks.h) file.
 
 These callbacks are invoked in the generation loop at regular intervals and serve a variety of functions described below.
 
@@ -40,9 +42,7 @@ tensors and a 64-bit unsigned number (`uint64_t`) that will uniquely identify
 the request. That identifier is called the *request ID* in the text that
 follows (and in the code of the batch manager). The input tensors are collected
 in a map (`std::map<std::string, Tensor>`) that associates input names to
-tensor. See
-[`InferenceRequest.h`](source:cpp/include/tensorrt_llm/batch_manager/InferenceRequest.h)
-for more details.
+tensor. Refer to [`InferenceRequest.h`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h) for more information.
 
 Responses are delivered to the client through a callback of type
 `SendResponseCallback`. A conforming callback must accept the 64-bit
@@ -58,7 +58,7 @@ Its signature is:
 using SendResponseCallback = std::function<void(uint64_t, std::list<std::shared_ptr<Tensor>> const&, bool, const std::string&)>;
 ```
 
-Note that the batch manager will reject any request sent using the
+The batch manager will reject any request sent using the
 `GetInferenceRequestsCallback` callback if the request ID passed by the
 client corresponds to the request ID of a request that is being processed
 by the batch manager.  A request ID can be reused after it appears in a
 
@@ -0,0 +1,26 @@
+(expert-parallelism)=
+
+# Expert Parallelism in TensorRT-LLM
+
+## Mixture of Experts (MoE)
+
+Mixture of Experts (MoE) architectures have been used widely recently, such as [Mistral Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json). Specifically, MOE’s structure supports multiple parallel Feedforward Neural Network (FFN) layers (called experts) to replace the single FFN layer in the dense model. When tokens arrive, the router layer selects the TopK experts for each token. The corresponding hidden state of the token is then dispatched to the selected TopK experts, respectively. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
+
+<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/moe_structure.png?raw=true" alt="moe_structure" width="500" height="auto">
+
+<sub>the MOE structure in Switch Transformer: [https://arxiv.org/pdf/2101.03961.pdf](https://arxiv.org/pdf/2101.03961.pdf) </sub>
+
+## Tensor Parallel vs Expert Parallel
+
+Parallelism on multi-GPUs is necessary if the MoE model can not be accommodated by a single GPU’s memory.  We have supported two kinds of parallel patterns for MoE structure, Tensor Parallel (default pattern) and Expert Parallel.
+
+<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/tp_ep.png?raw=true" alt="tensor parallel vs expert parallel" width="500" height="auto">
+
+Tensor Parallel evenly splits each expert’s weight and distributes them to different GPUs, which means each GPU holds partial weight of all experts, While Expert Parallel evenly distributes some of the experts’ full weight to different GPUs, which means each GPU holds part of the experts’ full weight. As a result, each GPU rank in the Tensor Parallel group receives all tokens’ hidden states for all experts, then computes using the partial weights, while for Expert Parallel, each GPU rank only receives part of tokens’ hidden states for experts on this rank, then computes using the full weights.
+
+
+## How to Enable
+
+The default parallel pattern is Tensor Parallel. You can enable Expert Parallel by setting `--moe_tp_mode 1` when calling `convert_coneckpoint.py`, and `--tp_size` is used to set the Expert Parallel size.
+
+The other parameters related to MoE structure, such as `num_experts_per_tok` (TopK in previous context), and `num_local_experts`, can be find in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
@@ -1,16 +1,13 @@
-# Multi-head, Multi-query and Group-query Attention
+(gpt-attention)=
 
-This document details the implementation of multihead attention (MHA),
-multiquery attention (MQA) and group-query attention (GQA) for auto-regressive
-GPT-like models in TensorRT-LLM.  As a quick reminder, the multihead attention
+# Multi-Head, Multi-Query, and Group-Query Attention
+
+This document details the implementation of multi-head attention (MHA),
+multi-query attention (MQA) and group-query attention (GQA) for auto-regressive
+GPT-like models in TensorRT-LLM.  As a quick reminder, the multi-head attention
 is the sequence of a batched matmul, a softmax and another batched matmul
 described in the
-[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article.
-Multi-query Attention (MQA) [[https://arxiv.org/abs/1911.02150](https://arxiv.org/abs/1911.02150)]
-Group-query Attention (GQA) [[https://arxiv.org/abs/2307.09288](https://arxiv.org/abs/2307.09288)]
-are variants of MHA that use fewer, so-called, K/V head than the number of
-query heads.  TensorRT-LLM, MHA, MQA and GQA are implemented by the operator
-[`tensorrt_llm.functional.gpt_attention`](source:tensorrt_llm/functional.py).
+[Attention Is All You Need](https://arxiv.org/abs/1706.03762) article. [Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288) are variants of MHA that use fewer, so-called, K/V head than the number of query heads. TensorRT-LLM, MHA, MQA and GQA are implemented by the operator [`tensorrt_llm.functional.gpt_attention`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py).
 
 ## Important Note
 
@@ -24,17 +21,17 @@ future***.
 In TensorRT-LLM, the GPT attention operator supports two different types
 of QKV inputs: Padded and packed (i.e. non padded) inputs. The mode is
 determined by the global configuration parameter `remove_input_padding` defined
-in [`tensorrt_llm.plugin`](source:tensorrt_llm/plugin/plugin.py).
+in [`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py).
 
-When padding is enabled (i.e. `remove_input_padding` is `False`), the sequences
+When padding is enabled (that is, `remove_input_padding` is `False`), the sequences
 that are shorter than the `max_sequence_length` are padded to that maximum
 length. It may result in excessive memory consumption as well as unneeded
 computations on padding tokens (in the various matrix multiplications that
 surround the MHA block).
 
 To overcome that problem, TensorRT-LLM supports a mode without padding where
 the different tokens are packed together and the user provides the operator
-with a 1D tensor containing the lengths of the different sequences.  It is
+with a 1D tensor containing the lengths of the different sequences. It is
 recommended that users to always use packed mode (and support for the padded
 mode may be removed in the future).
 
@@ -45,8 +42,8 @@ context and generation phases in auto-regressive models like GPT.
 
 ### Context Phase
 
-If the `context_fmha_type` is set to `disabled` (see
-[`tensorrt_llm.plugin`](source:tensorrt_llm/plugin/plugin.py)),
+If the `context_fmha_type` is set to `disabled` (refer to
+[`tensorrt_llm.plugin`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/plugin/plugin.py)),
 the implementation maps to a sequence of GPU kernels that will store the
 intermediate `Q*K^T` tensor in memory before calling the softmax operator. It
 is the slowest method and the memory footprint is significant (quadratically
@@ -58,9 +55,9 @@ FP32), that function will trigger a kernel that performs the MHA/MQA block
 using a single kernel. For short sequences, that kernel uses a vanilla
 implementation of MHA/MQA. For larger sequences, this kernel uses the Flash
 Attention algorithm as described in
-[https://arxiv.org/abs/2205.14135](https://arxiv.org/abs/2205.14135)
+[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
 and
-[https://arxiv.org/abs/2307.08691](https://arxiv.org/abs/2307.08691).
+[FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691).
 
 Currently, the implementation triggers extra kernels that apply pre-processing
 to the elements (like RoPE) and populate the KV cache (see below). In a future
@@ -72,18 +69,16 @@ improve the overall performance.
 When FP8 quantization is activated, the attention can be further accelerated by
 enabling FP8 Context FMHA (`use_fp8_context_fmha = enable`).
 
-Please be aware that this is an experimental feature only supported on Hopper.
-If you notice a significant decrease in accuracy, it is recommended to disable
-it..
+This is an experimental feature only supported on Hopper. If you notice a significant decrease in accuracy, it is recommended to disable it.
 
 ### Generation Phase
 
-The generation phase is implemented using a single kernel, called the masked
-multihead attention in TensorRT-LLM. That kernel is able to apply
-pre-processing on the Q, K and V elements on-the-fly: Add the QKV bias, apply
-RoPE, do dequantization/quantization. TensorRT-LLM will continue to add (or
+The generation phase is implemented using a single kernel called the masked
+multi-head attention in TensorRT-LLM. That kernel is able to apply
+pre-processing on the Q, K, and V elements on-the-fly: adds the QKV bias, applies
+RoPE, and performs dequantization and quantization. TensorRT-LLM will continue to add (or
 enable) additional features in future releases. For example, enable the support
-for ALiBi or IA3.
+for IA3.
 
 _The masked MHA kernel has a special version that distributes the work across
 multiple CUDA thread-blocks on the GPU for cases where the GPU occupancy is
@@ -131,9 +126,12 @@ of class `DecoderXQARunner` in
 `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h`.
 
 
-## Inflight batching
+(inflight-batching)=
+
+## In-flight Batching
 
-TensorRT-LLM supports a feature called in-flight batching. With that feature,
+TensorRT-LLM supports in-flight batching of requests (also known as continuous
+batching or iteration-level batching) for higher serving throughput. With this feature,
 sequences in context phase can be processed together with sequences in
 generation phase. The purpose of that technique is to better interleave
 requests to reduce latency as well as make better use the of the GPUs.
@@ -150,6 +148,8 @@ be relaxed in a future version.
 _(1) Padding sequences in the generation phase, that contain a single token, to
 the length of the maximum input sequence is inefficient use of resources_.
 
+
+
 ## Chunked Context
 
 In the original state, the common behavior was to process all context tokens at
@@ -158,10 +158,10 @@ context chunks can be batched with more tokens during the generation phase,
 which is expected to increase the total throughput. Chunking contexts also removes
 constraints on input length. To enable this feature, the FMHA paged kv-cache also
 needs to be enabled. Except for the last one, the size of the context chunk needs
-to be an integer multiple of the kv-cache block size. Please refer to
+to be an integer multiple of the kv-cache block size. Refer to
 [the performance best practices](perf_best_practices.md#chunked-context) for usage.
 
-## KV Cache(s)
+## KV Cache
 
 In the generation phase, a common optimization is to provide the MHA kernel
 with a cache containing the values of the past K and V elements that have
@@ -290,6 +290,7 @@ sequences in generation phase, there are `beam_width` tokens per sequence. The
 beam width can be different for each sequence.
 
 In other words, the pseudo-code to compute the number of tokens is:
+
 ```python
 num_tokens = 0