You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intel® Extension for PyTorch\* provides a new dedicated module, ipex.llm, to host for Large Language Models (LLMs) specific APIs. With ipex.llm, Intel® Extension for PyTorch\* provides comprehensive LLM optimization cross various popular datatypes including FP32/BF16/INT8/INT4. Specifically for low precision, both SmoothQuant and Weight-Only quantization are supported for various scenarios. And user can also run Intel® Extension for PyTorch\* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.
52
-
53
-
A typical API under this new module is ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. ipex.llm.optimize is an upgrade API to replace previous ipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example of ipex.llm.optimize for fp32 or bf16 inference:
More examples of this API can be found at LLM optimization API.
69
-
70
-
Besides the new optimization API for LLM inference, Intel® Extension for PyTorch\* also provides new capability for users to auto-tune a good quantization recipe for running SmoothQuant INT8 with good accuracy. SmoothQuant is a popular method to improve the accuracy of int8 quantization. The new auto-tune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy. More details can be found at SmoothQuant Recipe Tuning API introduction.
71
-
72
-
Intel® Extension for PyTorch\* newly optimized many more LLM models including more llama2 variance like llama2-13b/llama2-70b, encoder-decoder model like T5, code generation models like starcoder/codegen, and more like Baichuan, Baichuan2, ChatGLM2, ChatGLM3, mistral, mpt, dolly, etc.. A full list of optimized models can be found at LLM optimization.
73
-
74
-
- Bug fixing and other optimization
75
-
76
-
- Further optimized the performance of LLMs [#2349](https://github.com/intel/intel-extension-for-pytorch/commit/d6d591938aefb9020a8a542a160abe4aeb6b238c)[#2412](https://github.com/intel/intel-extension-for-pytorch/commit/e0399108856c826ad609e5f421021945de30a4bf#diff-11f6a633ad677c6a8b6e8e4462afbe836a853a284e362ba794a8fcbceebc9dc5), [#2469](https://github.com/intel/intel-extension-for-pytorch/commit/aeaeba47bc722d9b18f13f8a78e02092c0a6bb5b), [#2476](https://github.com/intel/intel-extension-for-pytorch/commit/c95eb77398fa131e4ef60be65841ca09a284115d)
77
-
- Optimized the Flash Attention Operator [#2317](https://github.com/intel/intel-extension-for-pytorch/commit/8d0426c1aebc85620fd417fa7fd4e0f1b357fa3d)[#2334](https://github.com/intel/intel-extension-for-pytorch/commit/efab335b427daf76e01836d520b1d7981de59595)[#2392](https://github.com/intel/intel-extension-for-pytorch/commit/5ed3a2413db5f0a5e53bcca0b3e84a814d87bb50)[#2480](https://github.com/intel/intel-extension-for-pytorch/commit/df2387e976461f6c42e0b90b3544ea76d3132694)
78
-
- Fixed the static quantization of the ELSER model [#2491](https://github.com/intel/intel-extension-for-pytorch/commit/ac613a73fb395836b210710a6fefdf6d32df3386)
79
-
- Switched deepspeed to the public release version on PyPI [#2473](https://github.com/intel/intel-extension-for-pytorch/commit/dba7b8c5fc9bfd8e7aa9431efe63499014acd722)[#2511](https://github.com/intel/intel-extension-for-pytorch/commit/94c31ecb3b6f6e77f595ce94dd6d6cbae1db1210)
80
-
- Upgrade oneDNN to v3.3.4 [#2433](https://github.com/intel/intel-extension-for-pytorch/commit/af9b096070e81b46250172174bb9d12e3e1c6acf)
0 commit comments