You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update llm.rst
update the supported llm for inference and finetune
* Update validated platforms
add the Arc B-series in validation platforms.
* fix the link
* add descriptions for new features
Signed-off-by: zhuyuhua-v <yuhua.zhu@intel.com>
* Update llm.rst
update HF/vLLM terms
* Update llm.rst
change back the hyper link to html for github.io
* Update int4_weight_only_quantization.md
---------
Signed-off-by: zhuyuhua-v <yuhua.zhu@intel.com>
Co-authored-by: zhuyuhua-v <yuhua.zhu@intel.com>
Copy file name to clipboardExpand all lines: docs/tutorials/llm.rst
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,6 +89,18 @@ LLM Inference
89
89
-
90
90
- ✅
91
91
-
92
+
* - Phi3
93
+
- microsoft/Phi-3-small-128k-instruct
94
+
- ✅
95
+
- ✅
96
+
- ✅
97
+
- ✅
98
+
* - Mistral
99
+
- mistralai/Mistral-7B-Instruct-v0.2
100
+
- ✅
101
+
- ✅
102
+
- ✅
103
+
- ✅
92
104
93
105
Platforms
94
106
~~~~~~~~~~~~~
@@ -109,31 +121,37 @@ LLM fine-tuning on Intel® Data Center Max 1550 GPU
109
121
- Mixed Precision (BF16+FP32)
110
122
- Full fine-tuning
111
123
- LoRA
124
+
- QLoRA
112
125
* - Llama2
113
126
- meta-llama/Llama-2-7b-hf
114
127
- ✅
115
128
- ✅
116
129
- ✅
130
+
-
117
131
* - Llama2
118
132
- meta-llama/Llama-2-70b-hf
119
133
- ✅
120
134
-
121
135
- ✅
136
+
-
122
137
* - Llama3
123
138
- meta-llama/Meta-Llama-3-8B
124
139
- ✅
125
140
- ✅
126
141
- ✅
142
+
- ✅
127
143
* - Qwen
128
144
- Qwen/Qwen-1.5B
129
145
- ✅
130
146
- ✅
131
147
- ✅
148
+
-
132
149
* - Phi-3-mini 3.8B
133
150
- Phi-3-mini-4k-instruct
134
151
- ✅
135
152
- ✅
136
153
- ✅
154
+
-
137
155
138
156
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.5.10/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
139
157
@@ -181,6 +199,9 @@ cache policy is shown as the following right figure.
181
199
:width:800
182
200
:alt:The beam idx trace for every step
183
201
202
+
203
+
Additionally, Intel® Extension for PyTorch* also supports Hugging Face's native dynamic_cache, ensuring compatibility with the original caching mechanism while providing performance enhancements through its optimized cache management.
204
+
184
205
Distributed Inference
185
206
~~~~~~~~~~~~~~~~~~~~~
186
207
@@ -202,6 +223,18 @@ heavier computations and places higher requirements to the underlying
202
223
hardware. Given that, quantization becomes a more important methodology
203
224
for inference workloads.
204
225
226
+
Ecosystem Support
227
+
~~~~~~~~~~~~~~~~~~~~~~~~
228
+
229
+
Intel® Extension for PyTorch* offers extensive support for various ecosystems, including vLLM and TGI, with the goal of enhancing performance and flexibility for LLM workloads.
The extension provides support for a variety of **custom kernels**, which include commonly used kernel fusion techniques, such as `rms_norm` and `rotary_embedding`, as well as attention-related kernels like `paged_attention` and `chunked_prefill`. These optimizations enhance the functionality and efficiency of the ecosystem on Intel® GPU platform by improving the execution of key operations.
235
+
236
+
Additionally, Intel® Extension for PyTorch* provides support for **WOQ INT4 GEMM kernels**, which enables vLLM and TGI to work with models that have been quantized using GPTQ/AWQ techniques. This support extends the ability to run INT4 models, further optimizing performance and reducing memory consumption while maintaining high inference accuracy.
Copy file name to clipboardExpand all lines: docs/tutorials/llm/int4_weight_only_quantization.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,7 @@ To overcome this issue, we propose quantization methods that reduce the size and
26
26
Validation Platforms
27
27
* Intel® Data Center GPU Max Series
28
28
* Intel® Arc™ A-Series Graphics
29
+
* Intel® Arc™ B-Series Graphics
29
30
* Intel® Core™ Ultra series
30
31
31
32
> Note: For algorithms marked as 'stay tuned' are highly recommended to wait for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.
0 commit comments