Fix the hang issue of multimodal model when running with DP>1 and cudagraph_mode is FULL_DECODE_ONLY #4361

wujinyuan1 · 2025-11-22T13:04:00Z

What this PR does / why we need it?

When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

github-actions · 2025-11-22T13:04:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request addresses a hang issue for multimodal models when data parallelism is used. The fix involves using the correct num_tokens variable instead of positions.shape[0], which is incorrect for models with multi-dimensional rotary positional embeddings. The change is correct, but I've pointed out in a review comment that the fix is incomplete as a similar issue exists in another code path within the same function that was not addressed.

gemini-code-assist · 2025-11-22T13:05:05Z

vllm_ascend/worker/model_runner_v1.py

                if self.pcp_size * self.dcp_size > 1:
                    update_attn_dcp_pcp_params(self.update_stream,
                                               forward_context,
-                                               positions.shape[0])
+                                               num_tokens)
                else:
                    update_attn_params(self.update_stream, forward_context,
-                                       positions.shape[0])
+                                       num_tokens)


This change correctly replaces positions.shape[0] with num_tokens to fix the hang issue for multimodal models. However, the same logic error exists in the if self.vllm_config.model_config.use_mla: block on lines 2920-2931. To fully resolve the bug, positions.shape[0] should also be replaced with num_tokens in that block.

Fix the hang issue when running with DP enabled in graph mode

6183cac

gemini-code-assist bot reviewed Nov 22, 2025

View reviewed changes

wjy9595 added 2 commits November 22, 2025 22:19

Fix the hang issue when running with DP enabled in graph mode

aadf93c

Merge remote-tracking branch 'origin/main'

9c7e275

wujinyuan1 closed this by deleting the head repository Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the hang issue of multimodal model when running with DP>1 and cudagraph_mode is FULL_DECODE_ONLY #4361

Fix the hang issue of multimodal model when running with DP>1 and cudagraph_mode is FULL_DECODE_ONLY #4361

Uh oh!

wujinyuan1 commented Nov 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix the hang issue of multimodal model when running with DP>1 and cudagraph_mode is FULL_DECODE_ONLY #4361

Fix the hang issue of multimodal model when running with DP>1 and cudagraph_mode is FULL_DECODE_ONLY #4361

Uh oh!

Conversation

wujinyuan1 commented Nov 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wujinyuan1 commented Nov 22, 2025 •

edited by github-actions bot

Loading