Skip to content

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Nov 19, 2025

This PR adds the continuous_batching article titled "Continuous batching from first principles"
The drafts were reviewed by @McPatate and @ArthurZucker
I followed the README so the article should have the right format.

@remi-or
Copy link
Collaborator Author

remi-or commented Nov 19, 2025

Cannot add reviewers here, so I will just tag @ArthurZucker @McPatate and @pcuenca for any feedback you might want to give! Thanks

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Remi! I think the blog post is good. It's not easy to explain CB simply, and there's a good red line of where you want to get the reader to.

I left a few comments.

The main sentiment is: you're diving deep in CB, with chunked prefill and KV caching, yet you're starting from the very beginning (what is a token).

This means that you need to cover a significant spread of information to get the reader up to speed. Right now it feels like the early notions are covered, then some is skipped, and we dive in.

I'd encourage you to either try and explain all technical terms used (things like prefill) when you introduce them, or to start with an assumption "We assume the reader is familiar with the terms tokens, prefill, masking, [...]", with some links to some explanations of those terms

Consider the initial prompt "I am sure this project". It is tokenized as 7 tokens: `[<bos>, I, am, sure, this, pro, ject]`. The `<bos>` token is a special token that is added at the start of the prompt (BoS stands for Beginning of Sequence).
As a broad-stroke picture, attention can be represented this way:

![attention.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/attention.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this picture followed by the text a bit complex to go through as it displays a lot of terms we have not seen before. We've heard of tokens and attention, and now we have attention masks, scores, outputs, some Wq, some Wk, some Wv, etc.

Would it make sense to break this down in simpler blocks, and gradually build up to that image?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely not a lot of work: your paragraphs are already split in a way where each concept is isolated. I think it's probably a matter of having isolated images to accompany each concept

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did just that, let me know if the breakdown is comprehensive enough!

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just clarified a bit to avoid confusion, very cool blog 🙌🏻

@remi-or remi-or force-pushed the remi-or/continuous-batching branch 2 times, most recently from 91f7dbd to d6a5163 Compare November 21, 2025 10:56
Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it's ready to merge after these comments!

![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)

## Introduction

Copy link
Contributor

@merveenoyan merveenoyan Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe drop a small TL;DR here so people know what you will be talking about before reading a lengthy post 🙌🏻

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this:
TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.

remi-or and others added 10 commits November 25, 2025 10:06
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
Co-authored-by: Merve Noyan <merve@huggingface.co>
@remi-or remi-or force-pushed the remi-or/continuous-batching branch from d6a5163 to 9ae26e5 Compare November 25, 2025 09:27
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, covering a lot of ground! Made a few suggestions, mostly for continuity and flow.


## Introduction

If you've ever used Qwen, Claude, or any other AI chatbot, you've probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That's because at the heart of it, all LLMs are just fancy next token predictors. During generation, first comes the **prefill** phase, where the model ingests your initial prompt to predict one new token. Then they ingest that token as well, to produce another new token: this is the **decoding** phase. The process is repeated until they feel like they have generated enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to introduce prefill here? I understand we want to explain it at some point, but it feels a bit tangential / distracting when all we are saying in the intro is that LLM generation is iterative and there are multiple implementation optimizations. I'd maybe just explain the sequential nature of next-token generation, and then proceed with the second paragraph.

Copy link
Collaborator Author

@remi-or remi-or Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it was a bit shoehorned in. I stuck to the auto-regressive explanation and direct transition to 2nd paragraph


By removing the batch dimension and using attention masks to control token interactions, continuous batching allows mixing prefill and decode phases in the same batch, dramatically improving efficiency for serving multiple requests. This is why services like ChatGPT can handle thousands of concurrent users efficiently.

In the next article in this series, we'll explore efficient KV cache management. If you'd like to see a deep dive on other continuous batching topics, please let us know in the comments!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could mention "paged attention" as one technique that's usually associated to CB, but not really a part of it, maybe as a potential future deep dive? (or linking to some other resource)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added through __paged attention__ after we'll explore efficient KV cache management

remi-or and others added 3 commits November 25, 2025 11:41
@McPatate McPatate merged commit 794abb4 into huggingface:main Nov 25, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants