From baad12095ae26200edf5f98fd6bf2011b08a5828 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 12 Nov 2025 17:00:01 -0800 Subject: [PATCH 01/17] add init doc Signed-off-by: yiliu30 --- docs/getting-started/compress.md | 1 + examples/autoround/README.md | 143 +++++++++++++++++++++++++++++++ 2 files changed, 144 insertions(+) create mode 100644 examples/autoround/README.md diff --git a/docs/getting-started/compress.md b/docs/getting-started/compress.md index ece0ba851c..418cf1be94 100644 --- a/docs/getting-started/compress.md +++ b/docs/getting-started/compress.md @@ -33,6 +33,7 @@ Compression schemes use quantization methods including the following: | **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Better accuracy recovery with faster runtime than GPTQ | | **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods | | **Round-To-Nearest (RTN)** | Simple quantization technique that rounds each value to the nearest representable level in the target precision. | Provides moderate accuracy recovery in most scenarios. Computationally cheap and fast to implement, making it suitable for real-time or resource-constrained environments. | +| **AutoRound** | Utilizes xxx. | High accuracy recovery xxx. | For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware. diff --git a/examples/autoround/README.md b/examples/autoround/README.md new file mode 100644 index 0000000000..09e5723bfe --- /dev/null +++ b/examples/autoround/README.md @@ -0,0 +1,143 @@ +# `AutoRound` Quantization + +`llm-compressor` supports quantizing weights to `int4` for memory savings and inference acceleration with `vLLM` + +> `int4` mixed precision computation is supported on Nvidia GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper). + +## Installation + +To get started, install: + +```bash +git clone https://github.com/vllm-project/llm-compressor.git +cd llm-compressor +pip install -e . +``` + +## Quickstart + +The example includes an end-to-end script for applying the quantization algorithm. + +```bash +python3 llama3_example.py +``` + +The resulting model `Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound` is ready to be loaded into vLLM. + +## Code Walkthough + +Now, we will step though the code in the example. There are four steps: +1) Load model +2) Prepare calibration data +3) Apply quantization +4) Evaluate accuracy in vLLM + +### 1) Load Model + +Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. + +```python +from transformers import AutoTokenizer, AutoModelForCausalLM + +MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" +model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) +``` + +### 2) Prepare Calibration Data + +Prepare the calibration data. When quantizing weigths of a model to `int4` using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. + +In our case, we are quantizing an Instruction tuned generic model, so we will use the `ultrachat` dataset. Some best practices include: +* 512 samples is a good place to start (increase if accuracy drops) +* 2048 sequence length is a good place to start +* Use the chat template or instrucion template that the model is trained with + +```python +from datasets import load_dataset + +NUM_CALIBRATION_SAMPLES=512 +MAX_SEQUENCE_LENGTH=2048 + +# Load dataset. +ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]") +ds = ds.shuffle(seed=42) + +# Preprocess the data into the format the model is trained with. +def preprocess(example): + return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)} +ds = ds.map(preprocess) + +# Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it). +def tokenize(sample): + return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) +ds = ds.map(tokenize, remove_columns=ds.column_names) +``` + +### 3) Apply Quantization + +With the dataset ready, we will now apply quantization. + +We first select the quantization algorithm. + +In our case, we will apply the default GPTQ recipe for `int4` (which uses static group size 128 scales) to all linear layers. +> See the `Recipes` documentation for more information on making complex recipes + +```python +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import GPTQModifier + +# Configure the quantization algorithm to run. +recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) + +# Apply quantization. +oneshot( + model=model, dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, +) + +# Save to disk compressed. +SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) +``` + +We have successfully created an `int4` model! + +### 4) Evaluate Accuracy + +With the model created, we can now load and run in vLLM (after installing). + +```python +from vllm import LLM +model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128") +``` + +We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.3`): +> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. + +Run the following to test accuracy on GSM-8K: + +```bash +lm_eval --model vllm \ + --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \ + --tasks gsm8k \ + --num_fewshot 5 \ + --limit 250 \ + --batch_size 'auto' +``` + +We can see the resulting scores look good! + +```bash +|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| +|-----|------:|----------------|-----:|-----------|---|----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.728|± |0.0282| +| | |strict-match | 5|exact_match|↑ |0.720|± |0.0285| +``` + +### Questions or Feature Request? + +Please open up an issue on `vllm-project/llm-compressor` From e73b98d280ca573a5edfd635836e9c72b5e64767 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:26:40 -0800 Subject: [PATCH 02/17] update docs Signed-off-by: yiliu30 --- examples/autoround/README.md | 143 ----------------------------------- 1 file changed, 143 deletions(-) delete mode 100644 examples/autoround/README.md diff --git a/examples/autoround/README.md b/examples/autoround/README.md deleted file mode 100644 index 09e5723bfe..0000000000 --- a/examples/autoround/README.md +++ /dev/null @@ -1,143 +0,0 @@ -# `AutoRound` Quantization - -`llm-compressor` supports quantizing weights to `int4` for memory savings and inference acceleration with `vLLM` - -> `int4` mixed precision computation is supported on Nvidia GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper). - -## Installation - -To get started, install: - -```bash -git clone https://github.com/vllm-project/llm-compressor.git -cd llm-compressor -pip install -e . -``` - -## Quickstart - -The example includes an end-to-end script for applying the quantization algorithm. - -```bash -python3 llama3_example.py -``` - -The resulting model `Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound` is ready to be loaded into vLLM. - -## Code Walkthough - -Now, we will step though the code in the example. There are four steps: -1) Load model -2) Prepare calibration data -3) Apply quantization -4) Evaluate accuracy in vLLM - -### 1) Load Model - -Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. - -```python -from transformers import AutoTokenizer, AutoModelForCausalLM - -MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" -model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") -tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) -``` - -### 2) Prepare Calibration Data - -Prepare the calibration data. When quantizing weigths of a model to `int4` using GPTQ, we need some sample data to run the GPTQ algorithms. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. - -In our case, we are quantizing an Instruction tuned generic model, so we will use the `ultrachat` dataset. Some best practices include: -* 512 samples is a good place to start (increase if accuracy drops) -* 2048 sequence length is a good place to start -* Use the chat template or instrucion template that the model is trained with - -```python -from datasets import load_dataset - -NUM_CALIBRATION_SAMPLES=512 -MAX_SEQUENCE_LENGTH=2048 - -# Load dataset. -ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]") -ds = ds.shuffle(seed=42) - -# Preprocess the data into the format the model is trained with. -def preprocess(example): - return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)} -ds = ds.map(preprocess) - -# Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it). -def tokenize(sample): - return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) -ds = ds.map(tokenize, remove_columns=ds.column_names) -``` - -### 3) Apply Quantization - -With the dataset ready, we will now apply quantization. - -We first select the quantization algorithm. - -In our case, we will apply the default GPTQ recipe for `int4` (which uses static group size 128 scales) to all linear layers. -> See the `Recipes` documentation for more information on making complex recipes - -```python -from llmcompressor import oneshot -from llmcompressor.modifiers.quantization import GPTQModifier - -# Configure the quantization algorithm to run. -recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) - -# Apply quantization. -oneshot( - model=model, dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, -) - -# Save to disk compressed. -SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128" -model.save_pretrained(SAVE_DIR, save_compressed=True) -tokenizer.save_pretrained(SAVE_DIR) -``` - -We have successfully created an `int4` model! - -### 4) Evaluate Accuracy - -With the model created, we can now load and run in vLLM (after installing). - -```python -from vllm import LLM -model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128") -``` - -We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.3`): -> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. - -Run the following to test accuracy on GSM-8K: - -```bash -lm_eval --model vllm \ - --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \ - --tasks gsm8k \ - --num_fewshot 5 \ - --limit 250 \ - --batch_size 'auto' -``` - -We can see the resulting scores look good! - -```bash -|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| -|-----|------:|----------------|-----:|-----------|---|----:|---|-----:| -|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.728|± |0.0282| -| | |strict-match | 5|exact_match|↑ |0.720|± |0.0285| -``` - -### Questions or Feature Request? - -Please open up an issue on `vllm-project/llm-compressor` From 03b420e2c7ffc3fe100e2ba8d1d11b867c705f83 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:26:51 -0800 Subject: [PATCH 03/17] update docs Signed-off-by: yiliu30 --- examples/autoround/README.md | 142 +++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 examples/autoround/README.md diff --git a/examples/autoround/README.md b/examples/autoround/README.md new file mode 100644 index 0000000000..748ae2da69 --- /dev/null +++ b/examples/autoround/README.md @@ -0,0 +1,142 @@ +# `AutoRound` Quantization + +`llm-compressor` supports quantizing weights to `int4` using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), +an advanced quantization algorithm for achieving high accuracy with low-bit quantization, and inference acceleration with `vLLM` + + +## Installation + +To get started, install: + +```bash +git clone https://github.com/vllm-project/llm-compressor.git +cd llm-compressor +pip install -e . +``` + +## Quickstart + +The example includes an end-to-end script for applying the AutoRound quantization algorithm. + +```bash +python3 llama3_example.py +``` + +The resulting model `Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound` is ready to be loaded into vLLM. + +## Code Walkthough + +Now, we will step though the code in the example. There are four steps: +1) Load model +2) Prepare calibration data +3) Apply quantization +4) Evaluate accuracy in vLLM + +### 1) Load Model + +Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. + +```python +from transformers import AutoTokenizer, AutoModelForCausalLM + +MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" +model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) +``` + +### 2) Prepare Calibration Data + +Prepare the calibration data. When quantizing weigths of a model using AutoRound, we need some sample data to run the AutoRound algorithms. +To quantize a given tensor, Auto-Round introduces three trainable parameters (V, α and β) to adjust the rounding value and clipping range. +For a given model, Auto-Round quantizes the decoder layer one by one, using block-wise output reconstruction error as loss to train these parameters. +More specifically, AutoRound introduces a trainable parameter V to adjust the rounding values, +As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. + +In our case, we are using [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) as our default calibration dataset. Some best practices include: +* 128 samples is a good place to start (increase if accuracy drops) +* 2048 sequence length is a good place to start +* 200 tuning steps is a good place to start (increase if accuracy drops) + +```python +# Select calibration dataset. +from auto_round.calib_dataset import get_dataset + +NUM_CALIBRATION_SAMPLES = 128 +MAX_SEQUENCE_LENGTH = 2048 +# Get aligned calibration dataset. + +ds = get_dataset( + tokenizer=tokenizer, + seqlen=MAX_SEQUENCE_LENGTH, + nsamples=NUM_CALIBRATION_SAMPLES, +) +``` + +### 3) Apply Quantization + +With the dataset ready, we will now apply AutoRound quantization to the model. + +```python +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import AutoRoundModifier + +# Configure the quantization algorithm to run. +recipe = AutoRoundModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], iters=200) + +# Apply quantization. +oneshot( + model=model, dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, +) + +# Save to disk compressed. +SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) +``` + +We have successfully created an `int4` model! + +### 4) Evaluate Accuracy + +With the model created, we can now load and run in vLLM (after installing). + +```python +from vllm import LLM +model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound") +``` + +We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.9.1`): +> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. + +Run the following to test accuracy on GSM-8K: + +```bash +lm_eval --model vllm \ + --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound",add_bos_token=true \ + --tasks gsm8k \ + --num_fewshot 5 \ + --limit 1000 \ + --batch_size 'auto' +``` + +We can see the resulting scores look good! + +```bash +| Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr | +| ----- | ------: | ---------------- | -----: | ----------- | --- | ----: | --- | -----: | +| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.737 | ± | 0.0139 | +| | | strict-match | 5 | exact_match | ↑ | 0.736 | ± | 0.0139 | +``` +> Note: please be aware that quantized model accuracy may fluctuate due to non-deterministic factors. + + +### Known Issues +Currently, `AutoRound` quantization only supports `wNa16` quantization, more quantization schemes will be supported in the near future. +Please refer to the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968) for the latest updates. + +### Questions or Feature Request? + +Please open up an issue on `vllm-project/llm-compressor` From 37a03ec836138b9aeeb15f9c18323614f5b41d73 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:48:24 -0800 Subject: [PATCH 04/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 34 ++++++++++++++-------------------- 1 file changed, 14 insertions(+), 20 deletions(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 748ae2da69..5b9fd5cc25 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -1,8 +1,7 @@ # `AutoRound` Quantization -`llm-compressor` supports quantizing weights to `int4` using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), -an advanced quantization algorithm for achieving high accuracy with low-bit quantization, and inference acceleration with `vLLM` - +`llm-compressor` supports [AutoRound]((https://aclanthology.org/2024.findings-emnlp.662.pdf)), an advanced quantization technique that delivers high-accuracy, low-bit quantization. The quantized results are fully compatible with `compressed-tensor` and can be served directly with vLLM. +AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tuning these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. ## Installation @@ -46,16 +45,11 @@ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) ### 2) Prepare Calibration Data -Prepare the calibration data. When quantizing weigths of a model using AutoRound, we need some sample data to run the AutoRound algorithms. -To quantize a given tensor, Auto-Round introduces three trainable parameters (V, α and β) to adjust the rounding value and clipping range. -For a given model, Auto-Round quantizes the decoder layer one by one, using block-wise output reconstruction error as loss to train these parameters. -More specifically, AutoRound introduces a trainable parameter V to adjust the rounding values, -As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. - -In our case, we are using [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) as our default calibration dataset. Some best practices include: -* 128 samples is a good place to start (increase if accuracy drops) -* 2048 sequence length is a good place to start -* 200 tuning steps is a good place to start (increase if accuracy drops) +When quantizing model weights with AutoRound, you’ll need a small set of sample data to run the algorithm. By default, we are using [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) as our calibration dataset. +Recommended starting points: +- 128 samples — typically sufficient for stable calibration (increase if accuracy degrades). +- 2048 sequence length — a good baseline for most LLMs. +- 200 tuning steps — usually enough to converge (increase if accuracy drops). ```python # Select calibration dataset. @@ -63,8 +57,8 @@ from auto_round.calib_dataset import get_dataset NUM_CALIBRATION_SAMPLES = 128 MAX_SEQUENCE_LENGTH = 2048 -# Get aligned calibration dataset. +# Get aligned calibration dataset. ds = get_dataset( tokenizer=tokenizer, seqlen=MAX_SEQUENCE_LENGTH, @@ -81,7 +75,9 @@ from llmcompressor import oneshot from llmcompressor.modifiers.quantization import AutoRoundModifier # Configure the quantization algorithm to run. -recipe = AutoRoundModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], iters=200) +recipe = AutoRoundModifier( + targets="Linear", scheme="W4A16", ignore=["lm_head"], iters=200 +) # Apply quantization. oneshot( @@ -130,13 +126,11 @@ We can see the resulting scores look good! | gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.737 | ± | 0.0139 | | | | strict-match | 5 | exact_match | ↑ | 0.736 | ± | 0.0139 | ``` -> Note: please be aware that quantized model accuracy may fluctuate due to non-deterministic factors. - +> Note: quantized model accuracy may vary slightly due to nondeterminism. ### Known Issues -Currently, `AutoRound` quantization only supports `wNa16` quantization, more quantization schemes will be supported in the near future. -Please refer to the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968) for the latest updates. +Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quantization scheme. Support for additional schemes is planned. You can follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968). ### Questions or Feature Request? -Please open up an issue on `vllm-project/llm-compressor` +Please open up an issue on `vllm-project/llm-compressor` or `intel/auto-round`. From f993f80e6c5c56145e570af2b64eee9c1bdb87eb Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:50:06 -0800 Subject: [PATCH 05/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 5b9fd5cc25..1fdfbe6fca 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -1,6 +1,6 @@ # `AutoRound` Quantization -`llm-compressor` supports [AutoRound]((https://aclanthology.org/2024.findings-emnlp.662.pdf)), an advanced quantization technique that delivers high-accuracy, low-bit quantization. The quantized results are fully compatible with `compressed-tensor` and can be served directly with vLLM. +`llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM. AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tuning these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. ## Installation From 374d99161d32cf3337a3147e0707deb54774b7f2 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:52:58 -0800 Subject: [PATCH 06/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 1fdfbe6fca..49ed7133e8 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -133,4 +133,4 @@ Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quan ### Questions or Feature Request? -Please open up an issue on `vllm-project/llm-compressor` or `intel/auto-round`. +Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) or [intel/auto-roun](https://github.com/intel/auto-round/issues). From 73fdcf18524038e197b03346e4e57d550950f12f Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:53:42 -0800 Subject: [PATCH 07/17] fix Signed-off-by: yiliu30 --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 49ed7133e8..7ab449f8ca 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -133,4 +133,4 @@ Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quan ### Questions or Feature Request? -Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) or [intel/auto-roun](https://github.com/intel/auto-round/issues). +Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) or [intel/auto-round](https://github.com/intel/auto-round/issues). From ead09b3028811806330d1e022001715515b0ff42 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 22:54:47 -0800 Subject: [PATCH 08/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 7ab449f8ca..b0b4e337e7 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -1,6 +1,7 @@ # `AutoRound` Quantization `llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM. + AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tuning these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. ## Installation From 9eedf880944130aadbd54cb004d8a8bb84a0b2df Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 17 Nov 2025 23:05:52 -0800 Subject: [PATCH 09/17] add more Signed-off-by: yiliu30 --- docs/getting-started/compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting-started/compress.md b/docs/getting-started/compress.md index 418cf1be94..6a5befe0d8 100644 --- a/docs/getting-started/compress.md +++ b/docs/getting-started/compress.md @@ -33,7 +33,7 @@ Compression schemes use quantization methods including the following: | **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Better accuracy recovery with faster runtime than GPTQ | | **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods | | **Round-To-Nearest (RTN)** | Simple quantization technique that rounds each value to the nearest representable level in the target precision. | Provides moderate accuracy recovery in most scenarios. Computationally cheap and fast to implement, making it suitable for real-time or resource-constrained environments. | -| **AutoRound** | Utilizes xxx. | High accuracy recovery xxx. | +| **AutoRound** |Introduces lightweight trainable parameters to optimize rounding and clipping ranges using block-wise reconstruction error. | Strong accuracy recovery with moderate tuning time; significantly more accurate than RTN and generally faster than GPTQ. | For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware. From 6e9623ba5fb271d47f441ae5f27874fb7384da0a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Tue, 18 Nov 2025 04:03:30 -0800 Subject: [PATCH 10/17] fix Signed-off-by: yiliu30 --- docs/getting-started/compress.md | 2 +- examples/autoround/README.md | 14 +++++++++----- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/getting-started/compress.md b/docs/getting-started/compress.md index 6a5befe0d8..6e4a56977c 100644 --- a/docs/getting-started/compress.md +++ b/docs/getting-started/compress.md @@ -33,7 +33,7 @@ Compression schemes use quantization methods including the following: | **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Better accuracy recovery with faster runtime than GPTQ | | **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods | | **Round-To-Nearest (RTN)** | Simple quantization technique that rounds each value to the nearest representable level in the target precision. | Provides moderate accuracy recovery in most scenarios. Computationally cheap and fast to implement, making it suitable for real-time or resource-constrained environments. | -| **AutoRound** |Introduces lightweight trainable parameters to optimize rounding and clipping ranges using block-wise reconstruction error. | Strong accuracy recovery with moderate tuning time; significantly more accurate than RTN and generally faster than GPTQ. | +| **AutoRound** | Introduces lightweight trainable parameters to optimize rounding and clipping ranges using block-wise reconstruction error. | Strong accuracy recovery with moderate tuning time; significantly more accurate than RTN and generally faster than GPTQ. | For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware. diff --git a/examples/autoround/README.md b/examples/autoround/README.md index b0b4e337e7..993e01e963 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -2,7 +2,7 @@ `llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM. -AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tuning these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. +AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. ## Installation @@ -24,9 +24,9 @@ python3 llama3_example.py The resulting model `Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound` is ready to be loaded into vLLM. -## Code Walkthough +## Code Walkthrough -Now, we will step though the code in the example. There are four steps: +Now, we will step through the code in the example. There are four steps: 1) Load model 2) Prepare calibration data 3) Apply quantization @@ -73,7 +73,7 @@ With the dataset ready, we will now apply AutoRound quantization to the model. ```python from llmcompressor import oneshot -from llmcompressor.modifiers.quantization import AutoRoundModifier +from llmcompressor.modifiers.autoround import AutoRoundModifier # Configure the quantization algorithm to run. recipe = AutoRoundModifier( @@ -82,12 +82,16 @@ recipe = AutoRoundModifier( # Apply quantization. oneshot( - model=model, dataset=ds, + model=model, + dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, + # disable shuffling to get slightly better mmlu score + shuffle_calibration_samples=False, ) + # Save to disk compressed. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" model.save_pretrained(SAVE_DIR, save_compressed=True) From 87d71625670aa35419a76ac66f42d8a8884ba511 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 19 Nov 2025 17:15:12 -0800 Subject: [PATCH 11/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 993e01e963..2b4bdb060f 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -138,4 +138,4 @@ Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quan ### Questions or Feature Request? -Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor/issues) or [intel/auto-round](https://github.com/intel/auto-round/issues). +Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round). From 1a76018900bbcdaad11a9815b0c1bec6f2a4d175 Mon Sep 17 00:00:00 2001 From: Yi Liu Date: Thu, 20 Nov 2025 10:46:48 +0800 Subject: [PATCH 12/17] Update examples/autoround/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 2b4bdb060f..3c208bd477 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -109,7 +109,7 @@ from vllm import LLM model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound") ``` -We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.9.1`): +We can evaluate accuracy with `lm_eval` (`pip install lm-eval==0.4.9.1`): > Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. Run the following to test accuracy on GSM-8K: From 5e7df30275d36579a34220d0cf6568b4fd307188 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 19 Nov 2025 18:49:20 -0800 Subject: [PATCH 13/17] refine Signed-off-by: yiliu30 --- docs/getting-started/compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting-started/compress.md b/docs/getting-started/compress.md index 6e4a56977c..57b24fef6d 100644 --- a/docs/getting-started/compress.md +++ b/docs/getting-started/compress.md @@ -33,7 +33,7 @@ Compression schemes use quantization methods including the following: | **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Better accuracy recovery with faster runtime than GPTQ | | **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods | | **Round-To-Nearest (RTN)** | Simple quantization technique that rounds each value to the nearest representable level in the target precision. | Provides moderate accuracy recovery in most scenarios. Computationally cheap and fast to implement, making it suitable for real-time or resource-constrained environments. | -| **AutoRound** | Introduces lightweight trainable parameters to optimize rounding and clipping ranges using block-wise reconstruction error. | Strong accuracy recovery with moderate tuning time; significantly more accurate than RTN and generally faster than GPTQ. | +| **AutoRound** | AutoRound optimizes rounding and clipping ranges via sign-gradient descent. | Delivers leading 4-bit and superior sub-4-bit accuracy compared to GPTQ/AWQ, with runtime faster than GPTQ and on par with AWQ. | For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware. From c5b7a1d4cfd8da08de135033a1d49a3dadad7461 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 19 Nov 2025 18:58:11 -0800 Subject: [PATCH 14/17] update Signed-off-by: yiliu30 --- examples/autoround/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/autoround/README.md b/examples/autoround/README.md index 3c208bd477..3abdd7d17f 100644 --- a/examples/autoround/README.md +++ b/examples/autoround/README.md @@ -134,7 +134,7 @@ We can see the resulting scores look good! > Note: quantized model accuracy may vary slightly due to nondeterminism. ### Known Issues -Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quantization scheme. Support for additional schemes is planned. You can follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968). +Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quantization schemes. Support for additional schemes is planned. You can follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968). ### Questions or Feature Request? From 39915b03ab6ef00dd4a204d1fdb955470631e4f9 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 20 Nov 2025 16:57:45 -0800 Subject: [PATCH 15/17] update readme Signed-off-by: yiliu30 --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 1a173b8c07..052a814c18 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou Some of the exciting new features include: +* **AutoRound Quantization Support**: Added [`AutoRoundModifier`] for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. Check out the [example](examples/autoround/llama3_example.py) to get started. * **Qwen3 Next and Qwen3 VL MoE Quantization Support**: Quantize the Qwen3 Next and Qwen3 VL MoE models and seamlessly run the models in vLLM. Examples for [NVFP4](examples/quantization_w4a4_fp4/qwen3_next_example.py) and [FP8](examples/quantization_w8a8_fp8/qwen3_next_example.py) Quantization have been added for the Qwen3-Next-80B-A3B-Instruct. For the Qwen3 VL MoE, support has been added for the datafree pathway, specifically [FP8 Quantization](examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) (e.g channel-wise and block-wise quantization). NOTE: these models are not supported in tranformers<=4.56.2. You may need to install transformers from source. * **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py). * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization. @@ -55,6 +56,7 @@ Some of the exciting new features include: * AWQ * SmoothQuant * SparseGPT +* AutoRound ### When to Use Which Optimization From d02c95a2a1b55a2e62c54b93e944d8f2bfc0b0e4 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 20 Nov 2025 16:59:12 -0800 Subject: [PATCH 16/17] update Signed-off-by: yiliu30 --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 052a814c18..6c9b40981b 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou Some of the exciting new features include: -* **AutoRound Quantization Support**: Added [`AutoRoundModifier`] for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. Check out the [example](examples/autoround/llama3_example.py) to get started. +* **AutoRound Quantization Support**: Added [`AutoRoundModifier`](examples/autoround/llama3_example.py) for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance. * **Qwen3 Next and Qwen3 VL MoE Quantization Support**: Quantize the Qwen3 Next and Qwen3 VL MoE models and seamlessly run the models in vLLM. Examples for [NVFP4](examples/quantization_w4a4_fp4/qwen3_next_example.py) and [FP8](examples/quantization_w8a8_fp8/qwen3_next_example.py) Quantization have been added for the Qwen3-Next-80B-A3B-Instruct. For the Qwen3 VL MoE, support has been added for the datafree pathway, specifically [FP8 Quantization](examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) (e.g channel-wise and block-wise quantization). NOTE: these models are not supported in tranformers<=4.56.2. You may need to install transformers from source. * **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py). * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization. From 4a54ade0260cdb877617990d67ef2896e5eec4a0 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 20 Nov 2025 17:00:47 -0800 Subject: [PATCH 17/17] update Signed-off-by: yiliu30 --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 6c9b40981b..1cb34c1aaa 100644 --- a/README.md +++ b/README.md @@ -80,6 +80,7 @@ Applying quantization with `llmcompressor`: * [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py) * [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md) * [Weight only quantization to `int4` using AWQ](examples/awq/README.md) +* [Weight only quantization to `int4` using AutoRound](examples/autoround/README.md) * [Quantizing MoE LLMs](examples/quantizing_moe/README.md) * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md) * [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)