-
Notifications
You must be signed in to change notification settings - Fork 944
[New blog post] continuous_batching #3176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New blog post] continuous_batching #3176
Conversation
|
Cannot add reviewers here, so I will just tag @ArthurZucker @McPatate and @pcuenca for any feedback you might want to give! Thanks |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Remi! I think the blog post is good. It's not easy to explain CB simply, and there's a good red line of where you want to get the reader to.
I left a few comments.
The main sentiment is: you're diving deep in CB, with chunked prefill and KV caching, yet you're starting from the very beginning (what is a token).
This means that you need to cover a significant spread of information to get the reader up to speed. Right now it feels like the early notions are covered, then some is skipped, and we dive in.
I'd encourage you to either try and explain all technical terms used (things like prefill) when you introduce them, or to start with an assumption "We assume the reader is familiar with the terms tokens, prefill, masking, [...]", with some links to some explanations of those terms
continuous_batching.md
Outdated
| Consider the initial prompt "I am sure this project". It is tokenized as 7 tokens: `[<bos>, I, am, sure, this, pro, ject]`. The `<bos>` token is a special token that is added at the start of the prompt (BoS stands for Beginning of Sequence). | ||
| As a broad-stroke picture, attention can be represented this way: | ||
|
|
||
|  |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this picture followed by the text a bit complex to go through as it displays a lot of terms we have not seen before. We've heard of tokens and attention, and now we have attention masks, scores, outputs, some Wq, some Wk, some Wv, etc.
Would it make sense to break this down in simpler blocks, and gradually build up to that image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely not a lot of work: your paragraphs are already split in a way where each concept is isolated. I think it's probably a matter of having isolated images to accompany each concept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did just that, let me know if the breakdown is comprehensive enough!
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just clarified a bit to avoid confusion, very cool blog 🙌🏻
91f7dbd to
d6a5163
Compare
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it's ready to merge after these comments!
|  | ||
|
|
||
| ## Introduction | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe drop a small TL;DR here so people know what you will be talking about before reading a lengthy post 🙌🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this:
TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
d6a5163 to
9ae26e5
Compare
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, covering a lot of ground! Made a few suggestions, mostly for continuity and flow.
continuous_batching.md
Outdated
|
|
||
| ## Introduction | ||
|
|
||
| If you've ever used Qwen, Claude, or any other AI chatbot, you've probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That's because at the heart of it, all LLMs are just fancy next token predictors. During generation, first comes the **prefill** phase, where the model ingests your initial prompt to predict one new token. Then they ingest that token as well, to produce another new token: this is the **decoding** phase. The process is repeated until they feel like they have generated enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to introduce prefill here? I understand we want to explain it at some point, but it feels a bit tangential / distracting when all we are saying in the intro is that LLM generation is iterative and there are multiple implementation optimizations. I'd maybe just explain the sequential nature of next-token generation, and then proceed with the second paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it was a bit shoehorned in. I stuck to the auto-regressive explanation and direct transition to 2nd paragraph
continuous_batching.md
Outdated
|
|
||
| By removing the batch dimension and using attention masks to control token interactions, continuous batching allows mixing prefill and decode phases in the same batch, dramatically improving efficiency for serving multiple requests. This is why services like ChatGPT can handle thousands of concurrent users efficiently. | ||
|
|
||
| In the next article in this series, we'll explore efficient KV cache management. If you'd like to see a deep dive on other continuous batching topics, please let us know in the comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could mention "paged attention" as one technique that's usually associated to CB, but not really a part of it, maybe as a potential future deep dive? (or linking to some other resource)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added through __paged attention__ after we'll explore efficient KV cache management
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This PR adds the
continuous_batchingarticle titled "Continuous batching from first principles"The drafts were reviewed by @McPatate and @ArthurZucker
I followed the README so the article should have the right format.