Skip to content

Commit 86e8b95

Browse files
authored
Merge branch 'main' into INFERENG-1867
2 parents 8108e51 + b5c9758 commit 86e8b95

28 files changed

+275
-136
lines changed

.github/workflows/linkcheck.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,16 @@ jobs:
1313
markdown-link-check:
1414
runs-on: ubuntu-latest
1515
steps:
16-
- uses: actions/checkout@v4
17-
- uses: umbrelladocs/action-linkspector@v1
16+
- name: Checkout code
17+
uses: actions/checkout@v4
18+
19+
- name: Run linkspector
20+
uses: umbrelladocs/action-linkspector@v1
1821
with:
1922
github_token: ${{ secrets.github_token }}
2023
reporter: github-pr-review
2124
fail_on_error: true
25+
filter_mode: nofilter
2226
config_file: '.github/workflows/linkspector/linkspector.yml'
27+
show_stats: true
28+
level: info

.github/workflows/quality-check.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
steps:
1111
- uses: actions/setup-python@v5
1212
with:
13-
python-version: '3.9'
13+
python-version: '3.10'
1414
- uses: actions/checkout@v4
1515
- name: "⚙️ Install dependencies"
1616
run: pip3 install .[dev]

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ Applying quantization with `llmcompressor`:
8585

8686
### User Guides
8787
Deep dives into advanced usage of `llmcompressor`:
88-
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)
88+
* [Quantizing large models with sequential onloading](examples/big_models_with_sequential_onloading/README.md)
8989

9090

9191
## Quick Tour

docs/developer/contributing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ If not, please file a new issue, providing as much relevant information as possi
5050

5151
### Pull Requests & Code Reviews
5252

53-
Please check the PR checklist in the [PR template](.github/PULL_REQUEST_TEMPLATE.md) for detailed guide for contribution.
53+
Please check the PR checklist in the [PR template](../../.github/PULL_REQUEST_TEMPLATE.md) for detailed guide for contribution.
5454

5555
### Thank You
5656

docs/getting-started/compress.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ LLM Compressor provides a straightforward way to compress your models using vari
1010

1111
Before you begin, ensure that your environment meets the following prerequisites:
1212
- **Operating System:** Linux (recommended for GPU support)
13-
- **Python Version:** 3.9 or newer
13+
- **Python Version:** 3.10 or newer
1414
- **Available GPU:** For optimal performance, it's recommended to use a GPU. LLM Compressor supports the latest PyTorch and CUDA versions for compatibility with NVIDIA GPUs.
1515

1616
## Select a Model and Dataset

docs/getting-started/deploy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ vLLM is a high-performance inference engine designed for large language models,
1212

1313
Before deploying your model, ensure you have the following prerequisites:
1414
- **Operating System:** Linux (recommended for GPU support)
15-
- **Python Version:** 3.9 or newer
15+
- **Python Version:** 3.10 or newer
1616
- **Available GPU:** For optimal performance, it's recommended to use a GPU. vLLM supports a range of accelerators, including NVIDIA GPUs, AMD GPUs, TPUs, and other accelerators.
1717
- **vLLM Installed:** Ensure you have vLLM installed. You can install it using pip:
1818
```bash

docs/getting-started/install.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ LLM Compressor can be installed using several methods depending on your requirem
1111
Before installing LLM Compressor, ensure you have the following prerequisites:
1212

1313
- **Operating System:** Linux (recommended for GPU support)
14-
- **Python Version:** 3.9 or newer
14+
- **Python Version:** 3.10 or newer
1515
- **Pip Version:** Ensure you have the latest version of pip installed. You can upgrade pip using the following command:
1616

1717
```bash

docs/guides/compression_schemes.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,17 @@ Post-training quantization is performed to reduce the precision of quantizable w
66

77
| Scheme | Description | Data required? | vLLM Hardware Compatibility |
88
|--------|-------------|----------------|-----------------------------|
9-
| **[W8A8-FP8](../examples/quantization_w8a8_fp8.md)** | 8-bit floating point (FP8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token or static per-tensor quantization to compress activations to 8 bits. Weights scales that are generated on a per-channel basis or generated on a per-tensor basis are also possible. Channel-wise weights quantization with dynamic per-token activations is the the most performant option. W8A8-FP8 does not require a calibration dataset. Activation quantization is carried out during inference on vLLM. Good for general performance and compression, especially for server and batch inference. | No calibration dataset is required, unless you are doing static per-tensor activation quantization. | Latest NVIDIA GPUs (Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |
10-
| **[W8A8-INT8](../examples/quantization_w8a8_int8.md)** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Weight quantization can be both per-tensor or per-channel for INT8. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Activations can be static or dynamic. Additionally, INT8 activations can also be asymmetric. W8A8-INT8 helps improve speed in high QPS scenarios or during offline serving with vLLM. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. | Requires calibration dataset for weight quantization and static per-tensor activation quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
11-
| **[W4A16](../examples/quantization_w4a16/README.md)** | Quantizes only weights to 4-bit integer (INT4) precision, retaining activations in 16-bit floating point (FP16) precision. W4A16 provides ~3.7X smaller weights but requires 16-bit arithmetic operations. W4A16 also supports asymmetric weight quantization. W4A16 provides maximum compression for latency-sensitive applications with limited memory, and useful speed ups in low QPS regimes with more weight compression. The linked example leverages the GPTQ algorithm to decrease quantization loss, but other algorithms like [AWQ](../examples/awq/awq_one_shot.py) can also be leveraged for W4A16 quantization. Recommended for any GPU types. | Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
9+
| **[W8A8-FP8](../../examples/quantization_w8a8_fp8/README.md)** | 8-bit floating point (FP8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token or static per-tensor quantization to compress activations to 8 bits. Weights scales that are generated on a per-channel basis or generated on a per-tensor basis are also possible. Channel-wise weights quantization with dynamic per-token activations is the the most performant option. W8A8-FP8 does not require a calibration dataset. Activation quantization is carried out during inference on vLLM. Good for general performance and compression, especially for server and batch inference. | No calibration dataset is required, unless you are doing static per-tensor activation quantization. | Latest NVIDIA GPUs (Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |
10+
| **[W8A8-INT8](../../examples/quantization_w8a8_int8/README.md)** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Weight quantization can be both per-tensor or per-channel for INT8. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Activations can be static or dynamic. Additionally, INT8 activations can also be asymmetric. W8A8-INT8 helps improve speed in high QPS scenarios or during offline serving with vLLM. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. | Requires calibration dataset for weight quantization and static per-tensor activation quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
11+
| **[W4A16](../../examples/quantization_w4a16/README.md)** | Quantizes only weights to 4-bit integer (INT4) precision, retaining activations in 16-bit floating point (FP16) precision. W4A16 provides ~3.7X smaller weights but requires 16-bit arithmetic operations. W4A16 also supports asymmetric weight quantization. W4A16 provides maximum compression for latency-sensitive applications with limited memory, and useful speed ups in low QPS regimes with more weight compression. The linked example leverages the GPTQ algorithm to decrease quantization loss, but other algorithms like [AWQ](../../examples/awq/llama_example.py) can also be leveraged for W4A16 quantization. Recommended for any GPU types. | Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
1212
| **W8A16** | Encodes model weights in 8‑bit integers and activations in 16‑bit integers. W8A16 compression delivers smaller model output size than FP32 and is faster at inferencing on hardware with native 8‑bit integer units. Lower power and memory bandwidth compared to floating‑point.| Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
1313
| **NVFP4** | 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture. NVFP4 maintains numerical accuracy across a wide dynamic range of tensor values by using high-precision scale encoding and a two-level micro-block scaling strategy. NVFP4 compression generates a global scale for each tensor, along with local quantization scales for groups of 16 elements. Global scale and local quantization scales are generated for weights and activations. You cannot change the group size. | Requires a calibration dataset. | All NVIDIA Blackwell GPUs or later |
14-
| [**W8A8-FP8_BLOCK**](../examples/quantization_w8a8_fp8/fp8_block_example.py)| Uses block-wise quantization to compress weights to FP8 in (commonly 128×128 tiles), and dynamic per-token-group (128) quantization for activations. Does not require a calibration dataset. Activation quantization is carried out during inference on vLLM. | Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
14+
| [**W8A8-FP8_BLOCK**](../../examples/quantization_w8a8_fp8/fp8_block_example.py)| Uses block-wise quantization to compress weights to FP8 in (commonly 128×128 tiles), and dynamic per-token-group (128) quantization for activations. Does not require a calibration dataset. Activation quantization is carried out during inference on vLLM. | Does not require a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
1515

1616
## Sparsification Compression
1717
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
1818

1919
| Scheme | Description | Data required? | vLLM Hardware Compatibility |
2020
|--------|-------------|----------------|-----------------------------|
21-
| **[2:4 Semi-structured Sparsity](../examples/sparse_2of4_quantization_fp8.md)** | Uses semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits. Useful for better inference than W8A8-FP8, with almost no drop in its [evaluation score](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution. | Requires a calibration dataset. | Supported on all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for compute capability >=9.0 (Hopper and Blackwell) |
21+
| **[2:4 Semi-structured Sparsity](../../examples/sparse_2of4_quantization_fp8/README.md)** | Uses semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits. Useful for better inference than W8A8-FP8, with almost no drop in its [evaluation score](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution. | Requires a calibration dataset. | Supported on all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for compute capability >=9.0 (Hopper and Blackwell) |
2222
| **Unstructured Sparsity** | Unstructured sparsity quantization zeros out individual weights in the model without enforcing a regular pattern. Unlike block or channel pruning, it removes weights wherever they contribute least, yielding a fine‑grained sparse matrix. | Does not require a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |

docs/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,16 +39,16 @@ Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project
3939
## Recent Updates
4040

4141
!!! info "QuIP and SpinQuant-style Transforms"
42-
The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
42+
The newly added [`QuIPModifier`](../examples/transform/quip_example.py) and [`SpinQuantModifier`](../examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
4343

4444
!!! info "DeepSeekV3-style Block Quantization Support"
45-
Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8.md).
45+
Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](../examples/quantization_w8a8_fp8/fp8_block_example.py).
4646

4747
!!! info "FP4 Quantization - now with MoE and non-uniform support"
48-
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
48+
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](../examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](../examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](../examples/quantization_non_uniform/README.md) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
4949

5050
!!! info "Llama4 Quantization Support"
51-
Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
51+
Quantize a Llama4 model to [W4A16](../examples/quantization_w4a16) or [NVFP4](../examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
5252

5353
For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).
5454

examples/quantization_w4a4_fp4/qwen3_next_example.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,7 @@
55
from llmcompressor.modifiers.quantization import QuantizationModifier
66
from llmcompressor.utils import dispatch_for_generation
77

8-
# NOTE: Qwen3-Next-80B-A3B-Instruct support is not in transformers<=4.56.2
9-
# you may need to install transformers from source
8+
# NOTE: Requires a minimum of transformers 4.57.0
109

1110
MODEL_ID = "Qwen/Qwen3-Next-80B-A3B-Instruct"
1211

0 commit comments

Comments
 (0)