Skip to content

Commit 72fe418

Browse files
rogerxfeng8zhuyuhua-v
andauthored
update the llm doc (#5372)
* Update llm.rst update the supported llm for inference and finetune * Update validated platforms add the Arc B-series in validation platforms. * fix the link * add descriptions for new features Signed-off-by: zhuyuhua-v <yuhua.zhu@intel.com> * Update llm.rst update HF/vLLM terms * Update llm.rst change back the hyper link to html for github.io * Update int4_weight_only_quantization.md --------- Signed-off-by: zhuyuhua-v <yuhua.zhu@intel.com> Co-authored-by: zhuyuhua-v <yuhua.zhu@intel.com>
1 parent 37c18fe commit 72fe418

File tree

2 files changed

+34
-0
lines changed

2 files changed

+34
-0
lines changed

docs/tutorials/llm.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,18 @@ LLM Inference
8989
-
9090
- ✅
9191
-
92+
* - Phi3
93+
- microsoft/Phi-3-small-128k-instruct
94+
- ✅
95+
- ✅
96+
- ✅
97+
- ✅
98+
* - Mistral
99+
- mistralai/Mistral-7B-Instruct-v0.2
100+
- ✅
101+
- ✅
102+
- ✅
103+
- ✅
92104

93105
Platforms
94106
~~~~~~~~~~~~~
@@ -109,31 +121,37 @@ LLM fine-tuning on Intel® Data Center Max 1550 GPU
109121
- Mixed Precision (BF16+FP32)
110122
- Full fine-tuning
111123
- LoRA
124+
- QLoRA
112125
* - Llama2
113126
- meta-llama/Llama-2-7b-hf
114127
- ✅
115128
- ✅
116129
- ✅
130+
-
117131
* - Llama2
118132
- meta-llama/Llama-2-70b-hf
119133
- ✅
120134
-
121135
- ✅
136+
-
122137
* - Llama3
123138
- meta-llama/Meta-Llama-3-8B
124139
- ✅
125140
- ✅
126141
- ✅
142+
- ✅
127143
* - Qwen
128144
- Qwen/Qwen-1.5B
129145
- ✅
130146
- ✅
131147
- ✅
148+
-
132149
* - Phi-3-mini 3.8B
133150
- Phi-3-mini-4k-instruct
134151
- ✅
135152
- ✅
136153
- ✅
154+
-
137155

138156
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.5.10/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
139157

@@ -181,6 +199,9 @@ cache policy is shown as the following right figure.
181199
:width: 800
182200
:alt: The beam idx trace for every step
183201

202+
203+
Additionally, Intel® Extension for PyTorch* also supports Hugging Face's native dynamic_cache, ensuring compatibility with the original caching mechanism while providing performance enhancements through its optimized cache management.
204+
184205
Distributed Inference
185206
~~~~~~~~~~~~~~~~~~~~~
186207

@@ -202,6 +223,18 @@ heavier computations and places higher requirements to the underlying
202223
hardware. Given that, quantization becomes a more important methodology
203224
for inference workloads.
204225

226+
Ecosystem Support
227+
~~~~~~~~~~~~~~~~~~~~~~~~
228+
229+
Intel® Extension for PyTorch* offers extensive support for various ecosystems, including vLLM and TGI, with the goal of enhancing performance and flexibility for LLM workloads.
230+
231+
- **vLLM**: `vLLM documentation <https://docs.vllm.ai/en/v0.5.1/getting_started/xpu-installation.html>`_
232+
- **TGI**: `TGI documentation <https://github.com/huggingface/text-generation-inference/blob/main/docs/source/installation_intel.md>`_
233+
234+
The extension provides support for a variety of **custom kernels**, which include commonly used kernel fusion techniques, such as `rms_norm` and `rotary_embedding`, as well as attention-related kernels like `paged_attention` and `chunked_prefill`. These optimizations enhance the functionality and efficiency of the ecosystem on Intel® GPU platform by improving the execution of key operations.
235+
236+
Additionally, Intel® Extension for PyTorch* provides support for **WOQ INT4 GEMM kernels**, which enables vLLM and TGI to work with models that have been quantized using GPTQ/AWQ techniques. This support extends the ability to run INT4 models, further optimizing performance and reducing memory consumption while maintaining high inference accuracy.
237+
205238

206239
Weight Only Quantization INT4
207240
-----------------------------

docs/tutorials/llm/int4_weight_only_quantization.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ To overcome this issue, we propose quantization methods that reduce the size and
2626
Validation Platforms
2727
* Intel® Data Center GPU Max Series
2828
* Intel® Arc™ A-Series Graphics
29+
* Intel® Arc™ B-Series Graphics
2930
* Intel® Core™ Ultra series
3031

3132
> Note: For algorithms marked as 'stay tuned' are highly recommended to wait for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.

0 commit comments

Comments
 (0)