-
Notifications
You must be signed in to change notification settings - Fork 622
[Fix] Remove unnecessary NPU synchronization in MTP proposer #4325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…formances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations.. A more proper one is already implemented in vllm-project#4233. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
|
This is a follow up of former discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request proposes removing a torch.npu.synchronize() call within the speculative decoding loop in MtpProposer. This change aims to enhance performance by allowing greater overlap between CPU-based metadata preparation and NPU computation. Based on my analysis, the explicit synchronization is redundant because data dependencies are managed by the NPU stream, and a more suitable synchronization point is already in place for graph-based execution. Therefore, this modification is a sound performance optimization, and I have not identified any potential correctness issues.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes an unnecessary NPU synchronization call from the MTP proposer to improve pipeline efficiency by enabling better overlap between CPU and NPU operations. The PR description indicates that a more appropriate synchronization mechanism is already implemented in PR #4233.
- Removed
torch.npu.synchronize()call from the metadata preparation loop in the_proposemethod
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MengqingCao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
There was an irrlevant lint error. |
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…oject#4325) ### What this PR does / why we need it? Remove unnecessary NPU synchronization in MTP proposer to improve performances. Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in vllm-project#4233 ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
What this PR does / why we need it?
Remove unnecessary NPU synchronization in MTP proposer to improve performances.
Removing this synchronization point improves pipeline efficiency by allowing for better overlap between CPU and NPU operations. A more proper one is already implemented in #4233
Does this PR introduce any user-facing change?
None.
How was this patch tested?
None.