Skip to content

Commit 1cdf9ff

Browse files
[Bugfix] fix hang in async scheduling (#4233)
### What this PR does / why we need it? After #4113, there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device): <img width="1812" height="489" alt="image" src="https://github.com/user-attachments/assets/373fe655-afe5-4d7d-807e-b0aacf24a543" /> after add synchonization, record is launched after graph replay: <img width="1803" height="466" alt="image" src="https://github.com/user-attachments/assets/a8a68053-bd7d-49f5-a79c-9a26ef1285cc" /> bubble time caused by synchronization is about 85 us on G8600: <img width="1491" height="804" alt="image" src="https://github.com/user-attachments/assets/968611ee-f39a-4329-8150-1c4adba25dd1" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: hwhaokun <haokun0405@163.com>
1 parent 91b6ba8 commit 1cdf9ff

File tree

2 files changed

+29
-1
lines changed

2 files changed

+29
-1
lines changed

tests/e2e/singlecard/test_ascend_scheduler.py

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def test_chunked_prefill_with_scheduler_dynamic_batch(
128128
)
129129

130130

131-
def test_async_scheduling() -> None:
131+
def test_async_scheduling_eager() -> None:
132132
prompts = [
133133
"Hello, my name is",
134134
"The president of the United States is",
@@ -148,3 +148,25 @@ def test_async_scheduling() -> None:
148148
async_scheduling=True,
149149
) as vllm_model:
150150
vllm_model.generate(prompts, sampling_params=sampling_params)
151+
152+
153+
def test_async_scheduling_with_full_graph() -> None:
154+
prompts = [
155+
"Hello, my name is",
156+
"The president of the United States is",
157+
"The capital of France is",
158+
"The future of AI is",
159+
] * 10
160+
sampling_params = SamplingParams(temperature=0.2,
161+
max_tokens=10,
162+
stop_token_ids=None)
163+
164+
with VllmRunner("Qwen/Qwen3-8B",
165+
max_model_len=4096,
166+
max_num_seqs=50,
167+
dtype="bfloat16",
168+
gpu_memory_utilization=0.9,
169+
async_scheduling=True,
170+
compilation_config={"cudagraph_mode":
171+
"FULL"}) as vllm_model:
172+
vllm_model.generate(prompts, sampling_params=sampling_params)

vllm_ascend/compilation/acl_graph.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,12 @@ def __call__(self, *args, **kwargs):
186186
f"got {new_input_addresses}")
187187

188188
logger.info_once("Replaying aclgraph")
189+
# In async scheduling or multi-threaded (MT) scenarios, it is possible that
190+
# the CPU's record event (from update_attn_params) for the iteration i completes
191+
# before the grph replay of iteration i-1.
192+
# To ensure proper ordering, we must call synchronize here before replaying,
193+
# so that update_attn_params only executes after the previous graph replay has fully completed.
194+
torch.npu.synchronize()
189195
entry.aclgraph.replay()
190196
return entry.output
191197

0 commit comments

Comments
 (0)