Skip to content

Commit 3085f9f

Browse files
sayakpaulcbensimonpcuenca
authored
Add regional AoT compilation (#3057)
* aot comments. * up * up * add a section on reusing a compiled model. * toc. * Update zerogpu-aoti.md Co-authored-by: Charles <charles@huggingface.co> * up * up * up * Update zerogpu-aoti.md Co-authored-by: Pedro Cuenca <pedro@huggingface.co> --------- Co-authored-by: Charles <charles@huggingface.co> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
1 parent e4c8141 commit 3085f9f

File tree

1 file changed

+37
-2
lines changed

1 file changed

+37
-2
lines changed

zerogpu-aoti.md

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in Zer
3333
- [Dynamic shapes](#dynamic-shapes)
3434
- [Multi-compile / shared weights](#multi-compile--shared-weights)
3535
- [FlashAttention-3](#flashattention-3)
36+
- [Regional compilation](#regional-compilation)
37+
- [Use a compiled graph from the Hub](#use-a-compiled-graph-from-the-hub)
3638
- [AoT compiled ZeroGPU Spaces demos](#aot-compiled-zerogpu-spaces-demos)
3739
- [Conclusion](#conclusion)
3840
- [Resources](#resources)
@@ -340,6 +342,34 @@ It tries to load a kernel from the [`kernels-community/vllm-flash-attn3`](https:
340342

341343
Here is a [fully working example of an FA3 attention processor](https://gist.github.com/sayakpaul/ff715f979793d4d44beb68e5e08ee067#file-fa3_qwen-py) for the Qwen-Image model.
342344

345+
### Regional compilation
346+
347+
So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significantly long cold start times. Long cold start times make the development experience unpleasant.
348+
349+
We can also choose to compile _regions_ within a model, significantly reducing the cold start times, while retaining almost all the benefits of full model compilation. Regional compilation becomes promising when
350+
a model has repeated blocks of computation. A standard language model, for example, has a number of
351+
identically structured Transformer blocks.
352+
353+
In our example, we can compile the repeated blocks of the Flux transformer ahead of time, and propagate the compiled graph to the remaining repeated blocks. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`.
354+
355+
You can check out [this Space](https://huggingface.co/spaces/cbensimon/FLUX.1-dev-fa3-aoti/tree/main) for a complete example.
356+
357+
> [!TIP]
358+
> 💡 For Flux.1-Dev, switching to regional compilation reduces the compilation time from _6 minutes_ to just _30 seconds_ while delivering identical speedups.
359+
360+
### Use a compiled graph from the Hub
361+
362+
Once a model (or even a model block) is compiled ahead of time, we can serialize the compiled graph module
363+
as an artifact and reuse later. In the context of a ZeroGPU-powered demo on Spaces, this will significantly
364+
cut down the demo startup time by skipping the compilation time.
365+
366+
To keep the storage light, we can just save the compiled model graph without including any model parameters
367+
inside the artifact.
368+
369+
Check out [this collection](https://huggingface.co/collections/zerogpu-aoti/using-compiled-graph-from-the-hub-68c2afcc03de7609f9f91e35) that shows a full workflow of obtaining compiled model graph, pushing it
370+
to the Hub, and then using it to build a demo.
371+
372+
343373
## AoT compiled ZeroGPU Spaces demos
344374

345375
### Speedup comparison
@@ -350,7 +380,11 @@ Here is a [fully working example of an FA3 attention processor](https://gist.git
350380
- [FLUX.1 Kontext](https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev)
351381
- [QwenImage Edit](https://huggingface.co/spaces/multimodalart/Qwen-Image-Edit-Fast)
352382
- [Wan 2.2](https://huggingface.co/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster)
353-
- [LTX Video](https://huggingface.co/spaces/zerogpu-aoti/ltx-dev-fast)
383+
384+
### Regional compilation
385+
- [Regional compilation recipe](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html)
386+
- [Native integration in Diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)
387+
- [More performance numbers](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/)
354388

355389
## Conclusion
356390

@@ -363,6 +397,7 @@ We demonstrate speedups with Flux.1-Dev, but these techniques are not limited to
363397
- Visit our [ZeroGPU-AOTI org on the Hub](https://huggingface.co/zerogpu-aoti) to refer to a collection of demos that leverage the techniques discussed in this post.
364398
- Browse `spaces.aoti_*` APIs [source code](https://pypi-browser.org/package/spaces/spaces-0.40.1-py3-none-any.whl/spaces/zero/torch/aoti.py) to learn more about the interface
365399
- Check out [Kernels Community org on the hub](https://huggingface.co/kernels-community)
400+
- Learn more about regional compilation from [here](pytorch.org/tutorials/recipes/regional_compilation.html)
366401
- Upgrade to [Pro](https://huggingface.co/pro) on Hugging Face to create your own ZeroGPU Spaces (and get 25 minutes of H200 usage every day)
367402

368-
*Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post.*
403+
*Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post. Thanks to Angela Yi from the PyTorch team for helping us with AOT guidance.*

0 commit comments

Comments
 (0)