rendering

mikaylagawarecki · mikaylagawarecki · commit ee5482cd5f18 · 2024-10-31T11:38:42.000-07:00
diff --git a/intermediate_source/transformer_building_blocks.py b/intermediate_source/transformer_building_blocks.py
@@ -46,72 +46,73 @@
 If you are only interested in performant attention score modifications, please
 head to the `FlexAttention blog <https://pytorch.org/blog/flexattention/>`_ that
 contains a `gym of masks <https://github.com/pytorch-labs/attention-gym>`_.
-
 If you are wondering about what building blocks the ``torch`` library provides
 for writing your own transformer layers and best practices, you are in the
 right place, please keep reading!
 
 
-Introducing the Building Blocks
-===============================
-First, we will briefly introduce the 4 technologies mentioned in the introduction
+"""
+
+################################################################################
+# Introducing the Building Blocks
+# ===============================
+# First, we will briefly introduce the 4 technologies mentioned in the introduction
 
-* `torch.nested <https://pytorch.org/tutorials/prototype/nestedtensor.html>`_
+# * `torch.nested <https://pytorch.org/tutorials/prototype/nestedtensor.html>`_
 
-Nested tensors generalize the shape of regular dense tensors, allowing for
-representation of ragged-sized data with the same tensor UX. In the context of
-transformers, we can think of nested tensors as a tool for representing variable
-sequence lengths. They eliminate the need for the bug-prone practices of explicit
-padding and masking (think ``key_padding_mask`` in ``nn.MultiHeadAttention``).
+# Nested tensors generalize the shape of regular dense tensors, allowing for
+# representation of ragged-sized data with the same tensor UX. In the context of
+# transformers, we can think of nested tensors as a tool for representing variable
+# sequence lengths. They eliminate the need for the bug-prone practices of explicit
+# padding and masking (think ``key_padding_mask`` in ``nn.MultiHeadAttention``).
 
-* `scaled_dot_product_attention <https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html>`_
+# * `scaled_dot_product_attention <https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html>`_
 
-``scaled_dot_product_attention`` is a primitive for
-:math:`\text{softmax}(\frac{QK^T}{\sqrt{E}} + B)V` that dispatches into either fused
-implementations of the operator or a fallback implementation. It works out of
-the box in eager mode (i.e. the default mode of using PyTorch where operations
-are executed on the fly as they are encountered) and also integrates seamlessly
-with ``torch.compile()``. As of 2.6, it will also offer grouped query attention
-natively.
+# ``scaled_dot_product_attention`` is a primitive for
+# :math:`\text{softmax}(\frac{QK^T}{\sqrt{E}} + B)V` that dispatches into either fused
+# implementations of the operator or a fallback implementation. It works out of
+# the box in eager mode (i.e. the default mode of using PyTorch where operations
+# are executed on the fly as they are encountered) and also integrates seamlessly
+# with ``torch.compile()``. As of 2.6, it will also offer grouped query attention
+# natively.
 
-* `torch.compile() <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
+# * `torch.compile() <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
 
-``torch.compile()`` is a compiler introduced in version 2.0 that is able to
-capture a graph of PyTorch code and perform various optimizations on it, such as
-fusing together sequences of ops. Nested tensors with the ``torch.jagged`` layout
-and ``scaled_dot_product_attention`` work seamlessly with compile. In the
-context of transformers, the value add of using compile with nested tensor
-and SDPA is that compile can remove framework overhead ones sees in eager mode
-and fuse sequences of ops in transformers together (e.g. projection and
-activation).
+# ``torch.compile()`` is a compiler introduced in version 2.0 that is able to
+# capture a graph of PyTorch code and perform various optimizations on it, such as
+# fusing together sequences of ops. Nested tensors with the ``torch.jagged`` layout
+# and ``scaled_dot_product_attention`` work seamlessly with compile. In the
+# context of transformers, the value add of using compile with nested tensor
+# and SDPA is that compile can remove framework overhead ones sees in eager mode
+# and fuse sequences of ops in transformers together (e.g. projection and
+# activation).
 
-* `FlexAttention <https://pytorch.org/blog/flexattention/>`_
+# * `FlexAttention <https://pytorch.org/blog/flexattention/>`_
 
-``FlexAttention`` is a primitive that allows users to modify attention scores
-prior to the softmax operation. It generalizes the additive ``B`` term above
-for `scaled_dot_product_attention`, allowing for arbitrary calculation. It
-requires compile to achieve good performance.
+# ``FlexAttention`` is a primitive that allows users to modify attention scores
+# prior to the softmax operation. It generalizes the additive ``B`` term above
+# for ``scaled_dot_product_attention``, allowing for arbitrary calculation. It
+# requires compile to achieve good performance.
 
-The above building blocks are "All You Need" (as of October 2024)
-==================================================================
+# The above building blocks are "All You Need" (as of October 2024)
+# ==================================================================
 
-The main premise in this section is that most transformer variations are
-GPT-style, consisting of layers like Embedding, Positional Encoding, Attention
-Blocks and Feed Forward networks. If we were to try to classify the differences
-in this space, we might land on something like:
+# The main premise in this section is that most transformer variations are
+# GPT-style, consisting of layers like Embedding, Positional Encoding, Attention
+# Blocks and Feed Forward networks. If we were to try to classify the differences
+# in this space, we might land on something like:
 
-1.   Layer type (activation functions e.g. ``SwiGLU``, normalization functions
-     e.g. ``RMSNorm`` etc., positional encodings e.g. Sinusoidal, Rotary etc.)
-2.   Layer ordering (where to apply norms, where to apply positional encoding etc.)
-3.   Modifications to attention score (``ALiBi``, Relative Positional Bias etc.)
+# 1.   Layer type (activation functions e.g. ``SwiGLU``, normalization functions
+#      e.g. ``RMSNorm`` etc., positional encodings e.g. Sinusoidal, Rotary etc.)
+# 2.   Layer ordering (where to apply norms, where to apply positional encoding etc.)
+# 3.   Modifications to attention score (``ALiBi``, Relative Positional Bias etc.)
 
 
-In a pre-compiler world, one might write their custom transformer and observe
-that it works but is slow. Then, one might write a custom fused kernel for
-the specific series of ops. In a compiler world, one can do the former, compile
-and profit.
+# In a pre-compiler world, one might write their custom transformer and observe
+# that it works but is slow. Then, one might write a custom fused kernel for
+# the specific series of ops. In a compiler world, one can do the former, compile
+# and profit.
 
-"""
 
 ###############################################################################
 # MultiheadAttention
@@ -399,13 +400,12 @@ def benchmark(func, *args, **kwargs):
 ######################################################################################
 # For reference some sample outputs on A100:
 # 
-# ```
-# padded_time=0.03454, padded_peak_memory=4.14 GB
-# nested_time=0.00612, nested_peak_memory=0.76 GB
-# Difference between vanilla and nested result 0.0
-# Nested speedup: 5.65
-# Nested peak memory reduction 3.39 GB
-# ````
+# ..code::
+#   padded_time=0.03454, padded_peak_memory=4.14 GB
+#   nested_time=0.00612, nested_peak_memory=0.76 GB
+#   Difference between vanilla and nested result 0.0
+#   Nested speedup: 5.65
+#   Nested peak memory reduction 3.39 GB
 #
 # We can also see the same for backward pass
 
@@ -429,16 +429,16 @@ def benchmark(func, *args, **kwargs):
 ##################################################################################
 # Sample outputs on A100:
 #
-# ```
-# padded_bw_time=2.09337, padded_bw_peak_mem=5.10 GB
-# nested_bw_time=0.01452, nested_bw_peak_mem=3.24 GB
-# Nested backward speedup: 144.13
-# Nested backward peak memory reduction 1.86 GB
-# Difference in out_proj.weight.grad 0.000244140625
-# Difference in packed_proj.weight.grad 0.001556396484375
-# Difference in out_proj.bias.grad 0.0
-# Difference in packed_proj.bias.grad 0.001953125
-# ```
+# ..code::
+#   padded_bw_time=2.09337, padded_bw_peak_mem=5.10 GB
+#   nested_bw_time=0.01452, nested_bw_peak_mem=3.24 GB
+#   Nested backward speedup: 144.13
+#   Nested backward peak memory reduction 1.86 GB
+#   Difference in out_proj.weight.grad 0.000244140625
+#   Difference in packed_proj.weight.grad 0.001556396484375
+#   Difference in out_proj.bias.grad 0.0
+#   Difference in packed_proj.bias.grad 0.001953125
+#
 
 ##################################################################################
 # GPT-style layer
@@ -462,13 +462,13 @@ def benchmark(func, *args, **kwargs):
 # classification of modifications to the transformer architecture, recall that we
 # classified the modifications into layer type, layer ordering, and modifications
 # to the attention score. We trust that changing layer type and layer ordering
-# (e.g. swapping``LayerNorm`` for ``RMSNorm``) is fairly straightforward.
+# (e.g. swapping ``LayerNorm`` for ``RMSNorm``) is fairly straightforward.
 # 
 # In this section, we will discuss various functionalities using the
 # aforementioned building blocks. In particular,
 # 
 # * Cross Attention
-# * Fully masked rows no longer cause ``NaN``s
+# * Fully masked rows no longer cause NaNs
 # * Modifying attention score: ALiBi with FlexAttention and NJT
 # * Packed Projection