Skip to content

Conversation

@1092626063
Copy link
Contributor

@1092626063 1092626063 commented Nov 22, 2025

What this PR does / why we need it?

This pr is cherry-pick from : #2958 and #4340

Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of torch_npu.npu_moe_gating_top_k_softmax are included in torch_npu.npu_moe_gating_top_k

CANN: depends on 8.3.RC1

Performance:

  1. GLM4.5-w8a8, TPS improve 6%
  2. Qwen3, the same as before

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for a fused gatingtopk MoE operator. The changes introduce a check function, _check_npu_moe_gating_top_k, to determine if the fused operator can be used, and refactors the expert selection logic to use this check. The tests have also been updated to reflect these changes.

My review has identified a critical bug in _check_npu_moe_gating_top_k (and its duplicated version in the test file) that could lead to a ZeroDivisionError. The order of checks is incorrect, using num_expert_group as a divisor before validating it. Please see the detailed comments for the fix.

Comment on lines 99 to 108
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function has a potential ZeroDivisionError. The variable num_expert_group is used as a divisor on lines 100 and 104, but it is only checked to be greater than 0 on line 106. If num_expert_group is 0, this will cause a crash. To fix this, the check for num_expert_group > 0 should be performed before any division by it. It would also be better to import and use the _check_npu_moe_gating_top_k function from vllm_ascend.ops.moe.experts_selector to avoid code duplication and ensure consistency.

Suggested change
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False

Comment on lines 114 to 123
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential ZeroDivisionError in this function. The variable num_expert_group is used as a divisor on lines 115 and 119 before it is checked to be greater than 0 on line 121. If num_expert_group is 0, the code will crash. The check for num_expert_group > 0 should be moved before it is used in any division operations to prevent this.

Suggested change
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
if not (num_expert_group > 0 and hidden_states.shape[-1] % num_expert_group
== 0 and hidden_states.shape[-1] // num_expert_group > 2):
return False
if top_k < 1 or \
top_k > (hidden_states.shape[-1] / (num_expert_group * topk_group)):
return False
if topk_group < 1 or topk_group > num_expert_group:
return False
if topk_group * hidden_states.shape[-1] / num_expert_group < top_k:
return False

@1092626063 1092626063 changed the title support gatingtopk [refactor]support gatingtopk operator generalization Nov 22, 2025
@1092626063 1092626063 force-pushed the gatingtopk0110 branch 4 times, most recently from dc446ff to d517c57 Compare November 24, 2025 08:52
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Nov 24, 2025
@1092626063 1092626063 force-pushed the gatingtopk0110 branch 3 times, most recently from b3ad464 to 4e6de1e Compare December 3, 2025 01:55
…lm-project#4050)

### What this PR does / why we need it?
pick from : vllm-project#2958
Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before


Signed-off-by: 1092626063 <1092626063@qq.com>
Signed-off-by: 1092626063 <1092626063@qq.com>
@wangxiyuan wangxiyuan merged commit c4a11a7 into vllm-project:v0.11.0-dev Dec 4, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants