Skip to content

Commit 94f4320

Browse files
Add an example of INT4 GPT-J running MLPerf task to show accuracy (#2153)
* Add an example of INT4 GPT-J running MLPerf task to show accuracy * Refine scripts * Update readme and format with flake8 --------- Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
1 parent d6780e0 commit 94f4320

File tree

3 files changed

+494
-1
lines changed

3 files changed

+494
-1
lines changed

examples/cpu/inference/python/llm/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python ru
190190
## Weight only quantization with low precision checkpoint (Experimental)
191191
Using INT4 weights can further improve performance by reducing memory bandwidth. However, direct per-channel quantization of weights to INT4 probably results in poor accuracy. Some algorithms can modify weights through calibration before quantizing weights to minimize accuracy drop. GPTQ is one of such algorithms. You may generate modified weights and quantization info (scales, zero points) for a certain model with a some dataset by such algorithms. The results are saved as a `state_dict` in a `.pt` file. We provided a script here to run GPTQ (Intel(R) Neural Compressor 2.3.1 is required).
192192
193-
Here is an example:
193+
Here is how to use it:
194194
```bash
195195
# Step 1: Generate modified weights and quantization info
196196
python utils/run_gptq.py --model <MODEL_ID> --output-dir ./saved_results
@@ -235,6 +235,14 @@ user_model = ipex.optimize_transformers(
235235
deployment_mode=False,
236236
)
237237
```
238+
**Example**
239+
240+
Intel(R) Extension for PyTorch* with INT4 weight only quantization has been used in latest MLPerf submission (August 2023) to fully maximize the power of Intel(R) Xeon((R), and also shows good accuracy as comparing with FP32. This example is a simplified version of the MLPerf task. It will download a finetuned FP32 GPT-J model used for MLPerf submission, quantize the model to INT4 and run a text summarization task on the `cnn_dailymail` dataset. The example runs for 1000 samples, which is a good approximation of the results for the entire dataset and saves time.
241+
```sh
242+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> bash single_instance/run_int4_gpt-j_on_cnndailymail.sh
243+
```
244+
Please note that 100 GB disk space, 100 GB memory and Internet access are needed to run this example. The example will run for a few hours depending on your hardware and network condition. The example is verified on the 4th generation Intel(R) Xeon(R) Scalable (Sapphire Rapids) platform. You may get different results on older platforms as some new hardware features are unavailable.
245+
238246
**Checkpoint Requirements**
239247
240248
IPEX now only supports some certain cases. Weights must be N by K and per-channel asymmetrically quantized (group size = -1) to UINT4 and then compressed along K axis to `torch.int32`.

0 commit comments

Comments
 (0)