update llm readme doc (#2585)

ZailiWang · web-flow · commit 211813b2c818 · 2024-02-05T13:03:40.000+08:00
* update llm overview part per Guobing

* remove redundant period

* correct modelzoo version; add note for Falcon dist. inference availability with ds v0.13.1

* updates for r2.2

* misc updates

* add qconfig download links; correct optimized model scope

* update model optimized scope desc. and baichuan modelID

* update model support status for bloom1b7,baichuan13b,opt30b

* update optimized model tables

* update specific argument changes for individual models in INT8 WOQ deepspeed inf.

* update model scope table in mainpage README.md; update expression and table in llm/README.md

* trial for new model scope table

* update neox dist. support scope and recipe

* add qconfig download link for 2 more models

* revert model table in llm.rst. will update by another pr

* update llm.rst
diff --git a/README.md b/README.md
@@ -21,36 +21,41 @@ In the current technological landscape, Generative AI (GenAI) workloads and mode
 
 | MODEL FAMILY | MODEL NAME (Huggingface hub) | FP32 | BF16 | Static quantization INT8 | Weight only quantization INT8 | Weight only quantization INT4 |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-|LLAMA| meta-llama/Llama-2-7b-hf | ✅ | ✅ | ✅ | ✅ | ☑️ | 
-|LLAMA| meta-llama/Llama-2-13b-hf | ✅ | ✅ | ✅ | ✅ | ☑️ | 
-|LLAMA| meta-llama/Llama-2-70b-hf | ✅ | ✅ | ✅ | ✅ | ☑️ | 
-|GPT-J| EleutherAI/gpt-j-6b | ✅ | ✅ | ✅ | ✅ | ✅ | 
-|GPT-NEOX| EleutherAI/gpt-neox-20b | ✅ | ✅ | ☑️ | ✅ | ☑️ | 
-|DOLLY| databricks/dolly-v2-12b | ✅ | ✅ | ☑️ | ☑️ | ☑️ | 
-|FALCON| tiiuae/falcon-40b | ✅ | ✅ | ✅ |  ✅ | ✅ | 
-|OPT| facebook/opt-30b | ✅ | ✅ | ✅ |    | ☑️ | 
-|OPT| facebook/opt-1.3b | ✅ | ✅ | ✅ |  ✅ | ☑️ | 
-|Bloom| bigscience/bloom-1b7 | ✅ | ☑️ | ✅ |    | ☑️ |
-|CodeGen| Salesforce/codegen-2B-multi | ✅ | ✅ | ☑️ |  ✅ | ✅ |
-|Baichuan| baichuan-inc/Baichuan2-7B-Chat | ✅ | ✅ | ✅ | ✅  |    |
-|Baichuan| baichuan-inc/Baichuan2-13B-Chat | ✅ | ✅ |    |  ✅ |    |
-|Baichuan| baichuan-inc/Baichuan-13B-Chat | ✅ | ☑️ | ✅ |    |    |
-|ChatGLM| THUDM/chatglm3-6b | ✅ | ✅ | ☑️ |  ✅ |    |
-|ChatGLM| THUDM/chatglm2-6b | ✅ | ☑️ | ☑️ |  ☑️ |    |
-|GPTBigCode| bigcode/starcoder | ✅ | ✅ | ☑️ |  ✅ | ☑️ |
-|T5| google/flan-t5-xl | ✅ | ✅ | ☑️ |  ✅ |    |
-|Mistral| mistralai/Mistral-7B-v0.1 | ✅ | ✅ | ☑️ |  ✅ | ☑️ |
-|MPT| mosaicml/mpt-7b | ✅ | ✅ | ☑️ |  ✅ | ✅ |
-
-*Note*: All above models have undergone thorough optimization and verification processes for both performance and accuracy. In the context of the optimized model list table above, the symbol ✅ signifies that the model can achieve an accuracy drop of less than 1% when using a specific data type compared to FP32, whereas the accuracy drop may exceed 1% for ☑️ marked ones. We are working in progress to better support the models in the table with various data types. In addition, more models will be optimized, which will expand the table.
+|LLAMA| meta-llama/Llama-2-7b-hf | 🟩 | 🟩 | 🟩 | 🟩 | 🟨 | 
+|LLAMA| meta-llama/Llama-2-13b-hf | 🟩 | 🟩 | 🟩 | 🟩 | 🟨 | 
+|LLAMA| meta-llama/Llama-2-70b-hf | 🟩 | 🟩 | 🟩 | 🟩 | 🟨 | 
+|GPT-J| EleutherAI/gpt-j-6b | 🟩 | 🟩 | 🟩 | 🟩 | 🟩 | 
+|GPT-NEOX| EleutherAI/gpt-neox-20b | 🟩 | 🟨 | 🟨 | 🟩 | 🟨 | 
+|DOLLY| databricks/dolly-v2-12b | 🟩 | 🟨 | 🟨 | 🟩 | 🟨 | 
+|FALCON| tiiuae/falcon-40b | 🟩 | 🟩 | 🟩 | 🟩 | 🟩 | 
+|OPT| facebook/opt-30b | 🟩 | 🟩 | 🟩 | 🟩 | 🟨 | 
+|OPT| facebook/opt-1.3b | 🟩 | 🟩 | 🟩 | 🟩 | 🟨 | 
+|Bloom| bigscience/bloom-1b7 | 🟩 | 🟨 | 🟩 | 🟩  | 🟨 |
+|CodeGen| Salesforce/codegen-2B-multi | 🟩 | 🟩 | 🟨 | 🟩 | 🟩 |
+|Baichuan| baichuan-inc/Baichuan2-7B-Chat | 🟩 | 🟩 | 🟩 | 🟩 |    |
+|Baichuan| baichuan-inc/Baichuan2-13B-Chat | 🟩 | 🟩 | 🟩 | 🟩 |    |
+|Baichuan| baichuan-inc/Baichuan-13B-Chat | 🟩 | 🟨 | 🟩 | 🟩 |    |
+|ChatGLM| THUDM/chatglm3-6b | 🟩 | 🟩 | 🟨 | 🟩 |    |
+|ChatGLM| THUDM/chatglm2-6b | 🟩 | 🟩 | 🟨 | 🟩 |    |
+|GPTBigCode| bigcode/starcoder | 🟩 | 🟩 | 🟨 | 🟩 | 🟨 |
+|T5| google/flan-t5-xl | 🟩 | 🟩 | 🟨 | 🟩 |    |
+|Mistral| mistralai/Mistral-7B-v0.1 | 🟩 | 🟩 | 🟨 | 🟩 | 🟨 |
+|MPT| mosaicml/mpt-7b | 🟩 | 🟩 | 🟨 | 🟩 | 🟩 |
+
+- 🟩 signifies that the model can perform well and with good accuracy (<1% difference as compared with FP32).
+
+- 🟨 signifies that the model can perform well while accuracy may not been in a perfect state (>1% difference as compared with FP32).
+
+*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16).
+We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future.
 
 ## Support
 
 The team tracks bugs and enhancement requests using [GitHub issues](https://github.com/intel/intel-extension-for-pytorch/issues/). Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported.
 
 ## Intel® AI Reference Models
 
-Use cases that had already been optimized by Intel engineers are available at [Intel® AI Reference Models](https://github.com/IntelAI/models/tree/pytorch-r2.2.0-models) (former Model Zoo). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r2.2.0-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running the scripts in the Reference Models.
+Use cases that had already been optimized by Intel engineers are available at [Intel® AI Reference Models](https://github.com/IntelAI/models/tree/pytorch-r2.2-models) (former Model Zoo). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r2.2-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running the scripts in the Reference Models.
 
 ## License
 
diff --git a/docker/Dockerfile.prebuilt b/docker/Dockerfile.prebuilt
@@ -27,10 +27,10 @@ RUN ${PYTHON} -m pip --no-cache-dir install --upgrade \
 # Some TF tools expect a "python" binary
 RUN ln -s $(which ${PYTHON}) /usr/local/bin/python
 
-ARG IPEX_VERSION=2.1.100
-ARG PYTORCH_VERSION=2.1.1
-ARG TORCHAUDIO_VERSION=2.1.1
-ARG TORCHVISION_VERSION=0.16.1
+ARG IPEX_VERSION=2.2.0
+ARG PYTORCH_VERSION=2.2.0
+ARG TORCHAUDIO_VERSION=2.2.0
+ARG TORCHVISION_VERSION=0.17.0
 ARG TORCH_CPU_URL=https://download.pytorch.org/whl/cpu/torch_stable.html
 
 RUN \
diff --git a/docs/design_doc/cpu/isa_dyndisp.md b/docs/design_doc/cpu/isa_dyndisp.md
@@ -14,6 +14,7 @@ PyTorch & IPEX CPU ISA support statement:
  | IPEX-1.12 | ✘ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ |
  | IPEX-1.13 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ |
  | IPEX-2.1 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
+ | IPEX-2.2 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
 
 \* Current IPEX DEFAULT level implemented as same as AVX2 level.
 
diff --git a/docs/tutorials/features/sq_recipe_tuning_api.md b/docs/tutorials/features/sq_recipe_tuning_api.md
@@ -1,5 +1,5 @@
-Smooth Quant Recipe Tuning API
-====================================
+Smooth Quant Recipe Tuning API (Experimental)
+=============================================
 Smooth Quantization is a popular method to improve the accuracy of int8 quantization. The [autotune API](../api_doc.html#ipex.quantization.autotune) allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.
 
 SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below:
diff --git a/docs/tutorials/getting_started.md b/docs/tutorials/getting_started.md
@@ -1,6 +1,6 @@
 # Quick Start
 
-The following instructions assume you have installed the Intel® Extension for PyTorch\*. For installation instructions, refer to [Installation](../../../index.html#installation?platform=cpu&version=v2.1.0%2Bcpu).
+The following instructions assume you have installed the Intel® Extension for PyTorch\*. For installation instructions, refer to [Installation](../../../index.html#installation?platform=cpu&version=v2.2.0%2Bcpu).
 
 To start using the Intel® Extension for PyTorch\* in your code, you need to make the following changes:
 
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
@@ -1,8 +1,8 @@
 Installation
 ============
 
-Select your preferences and follow the installation instructions provided on the [Installation page](../../../index.html#installation?platform=cpu&version=v2.1.0%2Bcpu).
+Select your preferences and follow the installation instructions provided on the [Installation page](../../../index.html#installation?platform=cpu&version=v2.2.0%2Bcpu).
 
 After successful installation, refer to the [Quick Start](getting_started.md) and [Examples](examples.md) sections to start using the extension in your code.
 
-**NOTE:** For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the [LLM best practices](https://github.com/intel/intel-extension-for-pytorch/tree/v2.1.0%2Bcpu/examples/cpu/inference/python/llm).
+**NOTE:** For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the [LLM best practices](https://github.com/intel/intel-extension-for-pytorch/tree/v2.2.0%2Bcpu/examples/cpu/inference/python/llm).
diff --git a/docs/tutorials/llm.rst b/docs/tutorials/llm.rst
diff --git a/docs/tutorials/llm/llm_optimize.md b/docs/tutorials/llm/llm_optimize.md
diff --git a/examples/cpu/inference/python/llm/README.md b/examples/cpu/inference/python/llm/README.md