[New blog post] continuous_batching #3176

remi-or · 2025-11-19T09:22:02Z

This PR adds the continuous_batching article titled "Continuous batching from first principles"
The drafts were reviewed by @McPatate and @ArthurZucker
I followed the README so the article should have the right format.

remi-or · 2025-11-19T09:27:18Z

Cannot add reviewers here, so I will just tag @ArthurZucker @McPatate and @pcuenca for any feedback you might want to give! Thanks

LysandreJik

Thanks Remi! I think the blog post is good. It's not easy to explain CB simply, and there's a good red line of where you want to get the reader to.

I left a few comments.

The main sentiment is: you're diving deep in CB, with chunked prefill and KV caching, yet you're starting from the very beginning (what is a token).

This means that you need to cover a significant spread of information to get the reader up to speed. Right now it feels like the early notions are covered, then some is skipped, and we dive in.

I'd encourage you to either try and explain all technical terms used (things like prefill) when you introduce them, or to start with an assumption "We assume the reader is familiar with the terms tokens, prefill, masking, [...]", with some links to some explanations of those terms

assets/continuous_batching/thumbnail.png

continuous_batching.md

LysandreJik · 2025-11-19T12:59:20Z

continuous_batching.md

+Consider the initial prompt "I am sure this project". It is tokenized as 7 tokens: `[<bos>, I, am, sure, this, pro, ject]`. The `<bos>` token is a special token that is added at the start of the prompt (BoS stands for Beginning of Sequence).  
+As a broad-stroke picture, attention can be represented this way:
+
+![attention.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/attention.png)


I find this picture followed by the text a bit complex to go through as it displays a lot of terms we have not seen before. We've heard of tokens and attention, and now we have attention masks, scores, outputs, some Wq, some Wk, some Wv, etc.

Would it make sense to break this down in simpler blocks, and gradually build up to that image?

It's likely not a lot of work: your paragraphs are already split in a way where each concept is isolated. I think it's probably a matter of having isolated images to accompany each concept

Did just that, let me know if the breakdown is comprehensive enough!

continuous_batching.md

merveenoyan

just clarified a bit to avoid confusion, very cool blog 🙌🏻

continuous_batching.md

merveenoyan

imo it's ready to merge after these comments!

continuous_batching.md

merveenoyan · 2025-11-25T08:46:38Z

continuous_batching.md

+![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)
+
+## Introduction
+


maybe drop a small TL;DR here so people know what you will be talking about before reading a lengthy post 🙌🏻

Added this:
TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.

continuous_batching.md

Co-authored-by: Lysandre Debut <hi@lysand.re>

Co-authored-by: Merve Noyan <merve@huggingface.co>

pcuenca

Great work, covering a lot of ground! Made a few suggestions, mostly for continuity and flow.

_blog.yml

continuous_batching.md

pcuenca · 2025-11-25T07:48:32Z

continuous_batching.md

+
+## Introduction
+
+If you've ever used Qwen, Claude, or any other AI chatbot, you've probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That's because at the heart of it, all LLMs are just fancy next token predictors. During generation, first comes the **prefill** phase, where the model ingests your initial prompt to predict one new token. Then they ingest that token as well, to produce another new token: this is the **decoding** phase. The process is repeated until they feel like they have generated enough.


Do we need to introduce prefill here? I understand we want to explain it at some point, but it feels a bit tangential / distracting when all we are saying in the intro is that LLM generation is iterative and there are multiple implementation optimizations. I'd maybe just explain the sequential nature of next-token generation, and then proceed with the second paragraph.

I agree that it was a bit shoehorned in. I stuck to the auto-regressive explanation and direct transition to 2nd paragraph

continuous_batching.md

pcuenca · 2025-11-25T09:47:10Z

continuous_batching.md

+
+By removing the batch dimension and using attention masks to control token interactions, continuous batching allows mixing prefill and decode phases in the same batch, dramatically improving efficiency for serving multiple requests. This is why services like ChatGPT can handle thousands of concurrent users efficiently. 
+
+In the next article in this series, we'll explore efficient KV cache management. If you'd like to see a deep dive on other continuous batching topics, please let us know in the comments!


Perhaps we could mention "paged attention" as one technique that's usually associated to CB, but not really a part of it, maybe as a potential future deep dive? (or linking to some other resource)

added through __paged attention__ after we'll explore efficient KV cache management

continuous_batching.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

LysandreJik reviewed Nov 19, 2025

View reviewed changes

merveenoyan reviewed Nov 20, 2025

View reviewed changes

remi-or force-pushed the remi-or/continuous-batching branch 2 times, most recently from 91f7dbd to d6a5163 Compare November 21, 2025 10:56

merveenoyan reviewed Nov 25, 2025

View reviewed changes

remi-or and others added 10 commits November 25, 2025 10:06

[New blog post] continuous_batching

694997b

Update continuous_batching.md

16be178

Co-authored-by: Lysandre Debut <hi@lysand.re>

Review compliance

680b4c8

Update continuous_batching.md

5fe9de3

Co-authored-by: Merve Noyan <merve@huggingface.co>

Update continuous_batching.md

b00e795

Co-authored-by: Merve Noyan <merve@huggingface.co>

Update continuous_batching.md

7e6a2ed

Co-authored-by: Merve Noyan <merve@huggingface.co>

Update continuous_batching.md

2425d5b

Co-authored-by: Merve Noyan <merve@huggingface.co>

Review compliance huggingface#2

1ef3bd6

_blog cohenrency

28903f3

TLDR, correct latex, removed intro delimiter

9ae26e5

remi-or force-pushed the remi-or/continuous-batching branch from d6a5163 to 9ae26e5 Compare November 25, 2025 09:27

pcuenca approved these changes Nov 25, 2025

View reviewed changes

pcuenca reviewed Nov 25, 2025

View reviewed changes

continuous_batching.md Outdated Show resolved Hide resolved

pcuenca reviewed Nov 25, 2025

View reviewed changes

continuous_batching.md Outdated Show resolved Hide resolved

pcuenca reviewed Nov 25, 2025

View reviewed changes

continuous_batching.md Outdated Show resolved Hide resolved

pcuenca reviewed Nov 25, 2025

View reviewed changes

continuous_batching.md Outdated Show resolved Hide resolved

remi-or and others added 3 commits November 25, 2025 11:41

Apply suggestions from Pedro's review

1a1617d

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

REview compliance

f53b086

Final review compliance

e6ea934

McPatate merged commit 794abb4 into huggingface:main Nov 25, 2025
1 check passed

		![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)

		## Introduction


		## Introduction

		If you've ever used Qwen, Claude, or any other AI chatbot, you've probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That's because at the heart of it, all LLMs are just fancy next token predictors. During generation, first comes the prefill phase, where the model ingests your initial prompt to predict one new token. Then they ingest that token as well, to produce another new token: this is the decoding phase. The process is repeated until they feel like they have generated enough.


		By removing the batch dimension and using attention masks to control token interactions, continuous batching allows mixing prefill and decode phases in the same batch, dramatically improving efficiency for serving multiple requests. This is why services like ChatGPT can handle thousands of concurrent users efficiently.

		In the next article in this series, we'll explore efficient KV cache management. If you'd like to see a deep dive on other continuous batching topics, please let us know in the comments!

[New blog post] continuous_batching #3176

[New blog post] continuous_batching #3176

Uh oh!

Conversation

remi-or commented Nov 19, 2025

Uh oh!

remi-or commented Nov 19, 2025

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LysandreJik Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

LysandreJik Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merveenoyan Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remi-or Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pcuenca Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

remi-or Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

merveenoyan Nov 25, 2025 •

edited

Loading

remi-or Nov 25, 2025 •

edited

Loading