-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Core] Send kv events from worker side to scheduler side #28309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Send kv events from worker side to scheduler side #28309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to send KV cache events from the worker side to the scheduler side, which is a crucial feature for connectors that generate these events. The changes are logical, but I've identified a critical bug in the lmcache_connector.py that could lead to event loss, and an incorrect type hint in kv_connector_model_runner_mixin.py. Addressing these issues will improve the correctness and maintainability of the new functionality.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
Outdated
Show resolved
Hide resolved
fc375f8 to
499a425
Compare
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
This commit follows recommendation from @markmc to use a specific event property instead of piggybacking on stats. PR vllm-project#28309 adds the events property to KVConnectorOutput and this commit picks up the new property and uses it to pass the events from worker side to scheduler side. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
|
@hickeyma I think you're missing aggregation code in @njhill Let's think how this change fits to the longer-term future.
At the time when I introduced worker->scheduler output aggregation (#19555), my initial suggestion was to allow abstract metadata to flow from worker connectors to the scheduler connector.
My view is that we should rename cc @sdavidbd |
I agree with this in general - e.g. in an earlier iteration of #26172 I found myself adding NIXL-specific semantics to the contents of
I'm not sure I agree with this, though:
But I do think it could be positive to add |
|
Thanks for the great breakdown @orozery.
I understand your point here, but I believe that adding more fields to |
|
I agree with the usefulness of My main point is actually on the other hand, for fields that are useful only for a single connector (and I think this may be the case here?), we should have an abstract field that each connector can use freely (similar to |
Agree 👍
Honestly, I don't yet fully understand the motivation for the lmcache connector to emit its own events (as per #28252 and LMCache/LMCache#1846) in addition to the KV events already emitted by vLLM ... so I don't have a strong sense either way whether this need is highly-specific to lmcache |
|
Thanks @njhill @orozery @markmc @NickLucche for the feedback. I will get to it in due course as bit delayed this week because I'm at KubeCon.
@markmc The reason that a connector like LMCache would need to generate their own events is down to specific details or information. How LMcache stores the tokens/hashes (e.g. LMcache stores tokens with a granularity of 256 as a block and vLLM granularity is 16 as a block) and where it stores them (e.g. LMCache might store them to disk) will differ to vLLM. Therefore, a consumer of the events would be missing the full context of the cache if it only consumed the vLLM events. |
I agree with @NickLucche about having a separate field. @orozery I understand you reservation about only 1 connector possibly using it but still think the separation is best to clearly identify it. |
On LMCache-side, the motivation for LMCache to emit |
|
@ApostaC Can you also take a look? |
|
I like the idea of separating the KVEvents. My understanding is that KVEvents is becoming a standard format of fine-grained KV cache monitoring, and there will be other components using it in the future. Therefore, leaving a standard interface for it makes a lot of sense to me. Regarding |
1763194 to
d6a23a6
Compare
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
|
Quick follow-up regarding the KV event aggregator: @hickeyma There is a class that combines worker processes' the connector outputs to a single output and send it back to the scheduler class: vllm/vllm/distributed/kv_transfer/kv_connector/utils.py Lines 123 to 125 in d8874c6
IIUC, since KV events should work for all the connectors, it's better to add such aggregation logic for KV events in cc @orozery @NickLucche @njhill (please correct me if my understanding is wrong). |
7bcaa13 to
7ad9600
Compare
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
7a2e362 to
6487bda
Compare
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Clarify that it is the workser side that calls the method. Review comment: - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
fc498d8 to
577abd2
Compare
|
I chatted with @orozery today and he raised 2 concerns as follows:
Conclusion: I spoke with @ApostaC about this and he agrees that events needs to be checked against the number of workers.
Conclusion: With that said, he is ok to allow the events field in @orozery Let me know if I captured your thoughts correctly. |
577abd2 to
7cbff40
Compare
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Clarify that it is the workser side that calls the method. Review comment: - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Clarify that it is the workser side that calls the method. Review comment: - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
7cbff40 to
6973123
Compare
This is required for when worker side operations like CPU offloading generate KV cache events. This commit enables theses events to be passed to the scheduler side so that they can be published by the engine. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
The changes to the connector is for a separate PR and this PR is independent of it for now. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
LMCache connector changes are not part of this PR as this PR deals with adding ability to pass KV events between worker and scheduler sides and not a specific connector usage of this. The connector implementation is in a separate PR. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
It is part of the aggregation of kv_connector_output from all workers. For KV cache events, this means combining events from all workers, remvoing any duplications. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Clarify that it is the workser side that calls the method. Review comment: - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Equality methods are unnecessary as its provided by the parent class 'msgspec.Struct' which they extend from. Review comment: https://github.com/vllm-project/vllm/pull/28309/files#r2539044714 Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Workers will generate duplicate kv events in LMCache. This commit adds capability to aggregate on events from workers by returning only those that were emitted by all workers. It also provide an abstract class KVConnectorKVEevnts that is implemented by connectors to handle how they emit events. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
fdf5784 to
db43f44
Compare
|
Updated the PR to handle duplicate KV events emitted by LMCache workers (as per pt 1. in #28309 (comment)). I have added an abstract class @njhill @orozery @markmc @NickLucche @ApostaC @KuntaiDu @sdavidbd I would appreciate if you could review again. Thanks. |
Purpose
This is required for when worker side operations from a connector generates KV cache events. This PR enables events to be passed to the scheduler side so that they can be published by the engine.
Related to #28252 and from feedback by @markmc in #28252 (comment)
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.