[task] Add fused gdn gating triton kernel #4304

OsirisDuan · 2025-11-20T07:50:19Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Ascendyh <hw7osiris@outlook.com>

github-actions · 2025-11-20T07:50:27Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a fused Triton kernel for GDN gating to optimize performance on Ascend hardware. The overall approach is sound, but I've identified a few critical issues that need to be addressed. There is a breaking import in qwen3_next.py that will cause a runtime error. More importantly, the new Triton kernel in fused_gdn_gating.py does not correctly handle non-contiguous tensors, which could lead to incorrect calculations. I've provided detailed comments and code suggestions to resolve these problems. I also included a suggestion to investigate a performance tuning parameter that appears suboptimal.

gemini-code-assist · 2025-11-20T07:51:46Z

vllm_ascend/models/qwen3_next.py

-    fused_gdn_gating)
+    Qwen3NextGatedDeltaNet, Qwen3NextModel, Qwen3NextSparseMoeBlock)

+from vllm_ascend.ops.fla import fused_sigmoid_gating_delta_rule_update


This import will cause a runtime error because fused_sigmoid_gating_delta_rule_update is not defined in vllm_ascend.ops.fla. Furthermore, this function is not used anywhere in the file. This line should be removed.

gemini-code-assist · 2025-11-20T07:51:46Z

vllm_ascend/ops/fused_gdn_gating.py

+def fused_gdn_gating(
+    A_log: torch.Tensor,
+    a: torch.Tensor,
+    dt_bias: torch.Tensor,
+    beta: float = 1.0,
+    threshold: float = 20.0,
+) -> torch.Tensor:
+    batch, num_heads = a.shape
+    seq_len = 1
+    NUM_BATCH_GROUPS = batch
+    BLK_BATCHES = 1
+    if batch > 40:
+        BLK_BATCHES = triton.next_power_of_2(triton.cdiv(batch, 32))
+        NUM_BATCH_GROUPS = triton.cdiv(batch, BLK_BATCHES)
+
+    grid = (NUM_BATCH_GROUPS, seq_len, triton.cdiv(num_heads, 8))
+    g = torch.empty_like(a, dtype=torch.float32)
+    fused_gdn_gating_kernel[grid](g,
+                                  A_log,
+                                  a,
+                                  dt_bias,
+                                  seq_len,
+                                  num_heads,
+                                  batch,
+                                  beta,
+                                  threshold,
+                                  8,
+                                  BLK_BATCHES=BLK_BATCHES,
+                                  num_warps=1)
+    return g


The Triton kernel fused_gdn_gating_kernel implicitly assumes that the input tensor a and output tensor g are contiguous, as it performs pointer arithmetic without considering strides. However, a could be non-contiguous depending on the operations that produced it. Using torch.empty_like(a) for g will preserve the memory layout of a, so g could also be non-contiguous. This can lead to incorrect memory access and wrong results.

To fix this, you should ensure a is contiguous before using it and create g as a new contiguous tensor.

def fused_gdn_gating( A_log: torch.Tensor, a: torch.Tensor, dt_bias: torch.Tensor, beta: float = 1.0, threshold: float = 20.0, ) -> torch.Tensor: a = a.contiguous() batch, num_heads = a.shape seq_len = 1 NUM_BATCH_GROUPS = batch BLK_BATCHES = 1 if batch > 40: BLK_BATCHES = triton.next_power_of_2(triton.cdiv(batch, 32)) NUM_BATCH_GROUPS = triton.cdiv(batch, BLK_BATCHES) grid = (NUM_BATCH_GROUPS, seq_len, triton.cdiv(num_heads, 8)) g = torch.empty((batch, num_heads), dtype=torch.float32, device=a.device) fused_gdn_gating_kernel[grid](g, A_log, a, dt_bias, seq_len, num_heads, batch, beta, threshold, 8, BLK_BATCHES=BLK_BATCHES, num_warps=1) return g

gemini-code-assist · 2025-11-20T07:51:46Z

vllm_ascend/ops/fused_gdn_gating.py

+                                  threshold,
+                                  8,
+                                  BLK_BATCHES=BLK_BATCHES,
+                                  num_warps=1)


Using num_warps=1 is likely to be suboptimal for performance. Triton kernels, especially memory-bound ones, benefit from having multiple warps to hide memory latency (memory-level parallelism). While this kernel has some compute-intensive operations (exp, log), it still involves significant data movement. A single warp might underutilize the hardware's execution units and memory bandwidth.

I recommend increasing num_warps (e.g., to 4 or 8) and benchmarking to find the optimal value for your target hardware.

Suggested change

num_warps=1)

num_warps=4)

Add fused gdn gating

1d8165f

Signed-off-by: Ascendyh <hw7osiris@outlook.com>

github-actions bot added the module:ops label Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

OsirisDuan changed the title ~~Add fused gdn gating~~ [task] Add fused gdn gating triton kernel Nov 20, 2025

[fix] UB overflow bugfix

d19d822

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[task] Add fused gdn gating triton kernel #4304

[task] Add fused gdn gating triton kernel #4304

OsirisDuan commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Uh oh!

gemini-code-assist bot Nov 20, 2025

Uh oh!

gemini-code-assist bot Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[task] Add fused gdn gating triton kernel #4304

Are you sure you want to change the base?

[task] Add fused gdn gating triton kernel #4304

Conversation

OsirisDuan commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OsirisDuan commented Nov 20, 2025 •

edited by github-actions bot

Loading