-
Notifications
You must be signed in to change notification settings - Fork 937
Add regional AoT compilation #3057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
f47e955
fe67ff3
b190ad4
783752b
de28ccd
fac2ac6
57aa721
5fe5025
b8b6779
ad2fbc4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,6 +33,8 @@ In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in Zer | |
| - [Dynamic shapes](#dynamic-shapes) | ||
| - [Multi-compile / shared weights](#multi-compile--shared-weights) | ||
| - [FlashAttention-3](#flashattention-3) | ||
| - [Regional compilation](#regional-compilation) | ||
| - [Use a compiled graph from the Hub](#use-a-compiled-graph-from-the-hub) | ||
| - [AoT compiled ZeroGPU Spaces demos](#aot-compiled-zerogpu-spaces-demos) | ||
| - [Conclusion](#conclusion) | ||
| - [Resources](#resources) | ||
|
|
@@ -340,6 +342,34 @@ It tries to load a kernel from the [`kernels-community/vllm-flash-attn3`](https: | |
|
|
||
| Here is a [fully working example of an FA3 attention processor](https://gist.github.com/sayakpaul/ff715f979793d4d44beb68e5e08ee067#file-fa3_qwen-py) for the Qwen-Image model. | ||
|
|
||
| ### Regional compilation | ||
|
|
||
| So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significantly long cold start times. Long cold start times make the development experience unpleasant. | ||
|
|
||
| We can also choose to compile _regions_ within a model, significantly reducing the cold start times, while retaining almost all the benefits of full model compilation. Regional compilation becomes promising when | ||
| a model has repeated blocks of computation. A standard language model, for example, has a number of | ||
| identically structured Transformer blocks. | ||
|
|
||
| In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. | ||
|
|
||
| You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a complete example. | ||
|
||
|
|
||
| > [!TIP] | ||
| > 💡 For Flux.1-Dev, switching to regional compilation reduces the compilation time from _6 minutes_ to just _30 seconds_ while delivering identical speedups. | ||
|
|
||
| ### Use a compiled graph from the Hub | ||
|
|
||
| Once a model (or even a model block) is compiled ahead of time, we can serialize the compiled graph module | ||
| as an artifact and reuse later. In the context of a ZeroGPU-powered demo on Spaces, this will significantly | ||
| cut down the demo startup time. | ||
|
|
||
| To keep the storage light, we can just save the compiled model graph without including any model parameters | ||
| inside the artifact. | ||
|
|
||
| Check out [this collection](TODO) that shows a full workflow of obtaining compiled model graph, pushing it | ||
| to the Hub, and then using it to build a demo. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this section. What are the benefits of persisting the serialization vs the code demonstrated in the previous example? Also, the collection is missing.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We skip the compilation time reusing a compiled graph. |
||
|
|
||
|
|
||
| ## AoT compiled ZeroGPU Spaces demos | ||
|
|
||
| ### Speedup comparison | ||
|
|
@@ -350,7 +380,11 @@ Here is a [fully working example of an FA3 attention processor](https://gist.git | |
| - [FLUX.1 Kontext](https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev) | ||
| - [QwenImage Edit](https://huggingface.co/spaces/multimodalart/Qwen-Image-Edit-Fast) | ||
| - [Wan 2.2](https://huggingface.co/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster) | ||
| - [LTX Video](https://huggingface.co/spaces/zerogpu-aoti/ltx-dev-fast) | ||
|
|
||
| ### Regional compilation | ||
| - [Regional compilation recipe](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👏
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I initially thought that it was your recent tutorial on regional AoT. Still nice to include this one though
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's about to be merged: pytorch/tutorials#3543 |
||
| - [Native integration in Diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/fp16) | ||
| - [More performance numbers](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) | ||
|
|
||
| ## Conclusion | ||
|
|
||
|
|
@@ -363,6 +397,7 @@ We demonstrate speedups with Flux.1-Dev, but these techniques are not limited to | |
| - Visit our [ZeroGPU-AOTI org on the Hub](https://huggingface.co/zerogpu-aoti) to refer to a collection of demos that leverage the techniques discussed in this post. | ||
| - Browse `spaces.aoti_*` APIs [source code](https://pypi-browser.org/package/spaces/spaces-0.40.1-py3-none-any.whl/spaces/zero/torch/aoti.py) to learn more about the interface | ||
| - Check out [Kernels Community org on the hub](https://huggingface.co/kernels-community) | ||
| - Learn more about regional compilation from [here](pytorch.org/tutorials/recipes/regional_compilation.html) | ||
| - Upgrade to [Pro](https://huggingface.co/pro) on Hugging Face to create your own ZeroGPU Spaces (and get 25 minutes of H200 usage every day) | ||
|
|
||
| *Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post.* | ||
Uh oh!
There was an error while loading. Please reload this page.