-
Notifications
You must be signed in to change notification settings - Fork 617
[Draft][Quantization][Feature] Add AWQ quantization in vllm-ascend. #4316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for AWQ quantization in vllm-ascend. The changes are well-structured, adding the necessary configurations and Ascend-specific implementations for AWQ, including linear and MoE layers. A key improvement is the added robustness in AscendRMSNorm to handle different quantization configurations without crashing. My review has identified a couple of redundant function calls within the new npu_fused_experts function that should be removed to improve performance.
vllm_ascend/quantization/awq/awq.py
Outdated
| # gmm1: gate_up_proj | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) | ||
| if not use_wna16: | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/quantization/awq/awq.py
Outdated
| hidden_states = torch_npu.npu_swiglu(hidden_states) | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) | ||
| if not use_wna16: | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d1be882 to
9fe852c
Compare
|
@paulyu12 this pr implement AWQ quantization, and now it is under testing, just at you to take a look |
|
Validation on this issue #4378 |
4e14b12 to
4faf918
Compare
44fdc96 to
5953663
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Please rebase to main now. |
5953663 to
c04232e
Compare
Signed-off-by: menogrey <1299267905@qq.com>
DeepSeek-V3.1-AWQ. Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
7251f58 to
ce953d2
Compare
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
What this PR does / why we need it?
Add AWQ quantization in vllm-ascend. Most of the code refer to sglang implement: sgl-project/sglang#10158 , and new quantization adaptation refer to compressed tensor: #4036 .
Does this PR introduce any user-facing change?
How was this patch tested?