You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. LLMs have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers.
5
-
The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs.
4
+
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. LLMs have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers.
5
+
The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs.
6
6
On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant is enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs.
7
7
8
8
These LLM-specific optimizations can be automatically applied with a single frontend API function in Python interface, `ipex.llm.optimize()`. Check `ipex.llm.optimize <./llm/llm_optimize_transformers.md>`_ for more details.
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96
-
The WoQ (Weight Only Quantization) int4 workloads are also partially validated on Intel® Core™ Ultra series (Lunar Lake) with Intel® Arc™ Graphics. Refer to Weight Only Quantization INT4 section.
95
+
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96
+
The WoQ (Weight Only Quantization) INT4 workloads are also partially validated on Intel® Core™ Ultra series (Arrow Lake-H, Lunar Lake) with Intel® Arc™ Graphics and Intel® Arc™ B-Series GPUs (code-named Battlemage). Refer to Weight Only Quantization INT4 section.
97
97
98
98
*Note*: The above verified models (including other models in the same model family, like "meta-llama/Llama-2-7b-hf" from Llama family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
99
99
@@ -117,7 +117,7 @@ LLM fine-tuning on Intel® Data Center Max 1550 GPU
117
117
* - Llama2
118
118
- meta-llama/Llama-2-70b-hf
119
119
- ✅
120
-
-
120
+
-
121
121
- ✅
122
122
* - Llama3
123
123
- meta-llama/Meta-Llama-3-8B
@@ -207,9 +207,9 @@ for inference workloads.
207
207
Weight Only Quantization INT4
208
208
-----------------------------
209
209
210
-
Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.
210
+
Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.
211
211
212
-
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
212
+
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
213
213
214
214
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/neural-compressor/blob/master/docs/source/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WoQ), which only quantizes the weights statically. WoQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WoQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
0 commit comments