Skip to content

Commit f7b49c9

Browse files
[Doc] Update validated LLM model list (#5746)
* add qwen3 model in llm support list * update the validated GPUs * rm wrong word --------- Co-authored-by: ZhangJianyu <zhang.jianyu@outlook.com>
1 parent 9ada31f commit f7b49c9

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

docs/tutorials/llm.rst

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
Large Language Models (LLM) Optimizations Overview
22
==================================================
33

4-
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. LLMs have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers.
5-
The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs.
4+
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. LLMs have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers.
5+
The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs.
66
On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant is enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs.
77

88
These LLM-specific optimizations can be automatically applied with a single frontend API function in Python interface, `ipex.llm.optimize()`. Check `ipex.llm.optimize <./llm/llm_optimize_transformers.md>`_ for more details.
@@ -53,8 +53,8 @@ LLM Inference
5353
- ✅
5454
- ✅
5555
- ✅
56-
* - Qwen
57-
- Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-7B-Instruct
56+
* - Qwen
57+
- Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen3-4B, Qwen/Qwen3-8B, thewimo/Qwen3-4B-AWQ, AlphaGaO/Qwen3-4B-GPTQ, Qwen/Qwen3-8B-AWQ
5858
- ✅
5959
- ✅
6060
- ✅
@@ -65,24 +65,24 @@ LLM Inference
6565
- ✅
6666
- ✅
6767
- ✅
68-
* - Bloom
68+
* - Bloom
6969
- bigscience/bloom-7b1
7070
- ✅
7171
- ✅
7272
- ✅
73-
-
73+
-
7474
* - Baichuan2
7575
- baichuan-inc/Baichuan2-13B-Chat
7676
- ✅
7777
- ✅
7878
- ✅
79-
-
80-
* - OPT
79+
-
80+
* - OPT
8181
- facebook/opt-6.7b, facebook/opt-30b
8282
- ✅
83-
-
83+
-
8484
- ✅
85-
-
85+
-
8686
* - Mixtral
8787
- mistralai/Mistral-7B-Instruct-v0.2
8888
- ✅
@@ -92,8 +92,8 @@ LLM Inference
9292

9393
Platforms
9494
~~~~~~~~~~~~~
95-
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96-
The WoQ (Weight Only Quantization) int4 workloads are also partially validated on Intel® Core™ Ultra series (Lunar Lake) with Intel® Arc™ Graphics. Refer to Weight Only Quantization INT4 section.
95+
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96+
The WoQ (Weight Only Quantization) INT4 workloads are also partially validated on Intel® Core™ Ultra series (Arrow Lake-H, Lunar Lake) with Intel® Arc™ Graphics and Intel® Arc™ B-Series GPUs (code-named Battlemage). Refer to Weight Only Quantization INT4 section.
9797

9898
*Note*: The above verified models (including other models in the same model family, like "meta-llama/Llama-2-7b-hf" from Llama family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
9999

@@ -117,7 +117,7 @@ LLM fine-tuning on Intel® Data Center Max 1550 GPU
117117
* - Llama2
118118
- meta-llama/Llama-2-70b-hf
119119
- ✅
120-
-
120+
-
121121
- ✅
122122
* - Llama3
123123
- meta-llama/Meta-Llama-3-8B
@@ -207,9 +207,9 @@ for inference workloads.
207207
Weight Only Quantization INT4
208208
-----------------------------
209209

210-
Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.
210+
Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.
211211

212-
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
212+
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
213213

214214
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/neural-compressor/blob/master/docs/source/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WoQ), which only quantizes the weights statically. WoQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WoQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
215215

0 commit comments

Comments
 (0)