intel
diff --git a/‎examples/cpu/inference/python/llm/README.md‎
Lines changed: 9 additions & 1 deletion b/‎examples/cpu/inference/python/llm/README.md‎
Lines changed: 9 additions & 1 deletion
@@ -190,7 +190,7 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python ru
 ## Weight only quantization with low precision checkpoint (Experimental)
 Using INT4 weights can further improve performance by reducing memory bandwidth. However, direct per-channel quantization of weights to INT4 probably results in poor accuracy. Some algorithms can modify weights through calibration before quantizing weights to minimize accuracy drop. GPTQ is one of such algorithms. You may generate modified weights and quantization info (scales, zero points) for a certain model with a some dataset by such algorithms. The results are saved as a `state_dict` in a `.pt` file. We provided a script here to run GPTQ (Intel(R) Neural Compressor 2.3.1 is required).
 
-Here is an example:
+Here is how to use it:
 ```bash
 # Step 1: Generate modified weights and quantization info
 python utils/run_gptq.py --model <MODEL_ID> --output-dir ./saved_results
@@ -235,6 +235,14 @@ user_model = ipex.optimize_transformers(
     deployment_mode=False,
 )
 ```
+**Example**
+
+Intel(R) Extension for PyTorch* with INT4 weight only quantization has been used in latest MLPerf submission (August 2023) to fully maximize the power of Intel(R) Xeon((R), and also shows good accuracy as comparing with FP32. This example is a simplified version of the MLPerf task. It will download a finetuned FP32 GPT-J model used for MLPerf submission, quantize the model to INT4 and run a text summarization task on the `cnn_dailymail` dataset. The example runs for 1000 samples, which is a good approximation of the results for the entire dataset and saves time.
+```sh
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> bash single_instance/run_int4_gpt-j_on_cnndailymail.sh
+```
+Please note that 100 GB disk space, 100 GB memory and Internet access are needed to run this example. The example will run for a few hours depending on your hardware and network condition. The example is verified on the 4th generation Intel(R) Xeon(R) Scalable (Sapphire Rapids) platform. You may get different results on older platforms as some new hardware features are unavailable.
+
 **Checkpoint Requirements**
 
 IPEX now only supports some certain cases. Weights must be N by K and per-channel asymmetrically quantized (group size = -1) to UINT4 and then compressed along K axis to `torch.int32`.