44
55**Author:** `Justin Silver <https://github.com/j-silv>`__
66
7- When training neural networks with PyTorch, it’s possible to ignore some
8- of the library’s internal mechanisms. For example, running
9- backpropagation requires a simple call to ``backward()``. This tutorial
10- dives into how those gradients are calculated and stored in two
11- different kinds of PyTorch tensors: leaf vs. non-leaf. It will also
12- cover how we can extract and visualize gradients at any layer in the
13- network’s computational graph. By inspecting how information flows from
14- the end of the network to the parameters we want to optimize, we can
15- debug issues that occur during training such as `vanishing or exploding
16- gradients <https://arxiv.org/abs/1211.5063>`__.
17-
18- By the end of this tutorial, you will be able to:
19-
20- - Differentiate leaf vs. non-leaf tensors
21- - Know when to use ``requires_grad`` vs. ``retain_grad``
22- - Visualize gradients after backpropagation in a neural network
23-
24- We will start off with a simple network to understand how PyTorch
25- calculates and stores gradients. Building on this knowledge, we will
26- then visualize the gradient flow of a more complicated model and see the
27- effect that `batch normalization <https://arxiv.org/abs/1502.03167>`__
28- has on the gradient distribution.
29-
30- Before starting, it is recommended to have a solid understanding of
31- `tensors and how to manipulate
7+ This tutorial explains the subtleties of ``requires_grad``,
8+ ``retain_grad``, leaf, and non-leaf tensors using a simple example. It
9+ then covers how to extract and visualize gradients at any layer in a
10+ neural network. By inspecting how information flows from the end of the
11+ network to the parameters we want to optimize, we can debug issues such
12+ as `vanishing or exploding
13+ gradients <https://arxiv.org/abs/1211.5063>`__ that occur during
14+ training.
15+
16+ Before starting, make sure you understand `tensors and how to manipulate
3217them <https://docs.pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html>`__.
3318A basic knowledge of `how autograd
3419works <https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html>`__
5439
5540
5641######################################################################
57- # Next, we will instantiate a simple network so that we can focus on the
58- # gradients. This will be an affine layer, followed by a ReLU activation,
59- # and ending with a MSE loss between the prediction and label tensors.
42+ # Next, we instantiate a simple network to focus on the gradients. This
43+ # will be an affine layer, followed by a ReLU activation, and ending with
44+ # a MSE loss between prediction and label tensors.
6045#
6146# .. math::
6247#
137122######################################################################
138123# The distinction between leaf and non-leaf determines whether the
139124# tensor’s gradient will be stored in the ``grad`` property after the
140- # backward pass, and thus be usable for gradient descent optimization.
141- # We’ll cover this some more in the `following section <#retain-grad>`__.
125+ # backward pass, and thus be usable for `gradient
126+ # descent <https://en.wikipedia.org/wiki/Gradient_descent>`__. We’ll cover
127+ # this some more in the `following section <#retain-grad>`__.
142128#
143129# Let’s now investigate how PyTorch calculates and stores gradients for
144130# the tensors in its computational graph.
149135# ``requires_grad``
150136# -----------------
151137#
152- # To start the generation of the computational graph which can be used for
153- # gradient calculation, we need to pass in the ``requires_grad=True``
154- # parameter to a tensor constructor. By default, the value is ``False``,
155- # and thus PyTorch does not track gradients on any created tensors. To
156- # verify this, try not setting ``requires_grad``, re-run the forward pass,
157- # and then run backpropagation. You will see:
138+ # To build the computational graph which can be used for gradient
139+ # calculation, we need to pass in the ``requires_grad=True`` parameter to
140+ # a tensor constructor. By default, the value is ``False``, and thus
141+ # PyTorch does not track gradients on any created tensors. To verify this,
142+ # try not setting ``requires_grad``, re-run the forward pass, and then run
143+ # backpropagation. You will see:
158144#
159145# ::
160146#
161147# >>> loss.backward()
162148# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
163149#
164- # PyTorch is telling us that because the tensor is not tracking gradients,
165- # autograd can’t backpropagate to any leaf tensors . If you need to change
166- # the property, you can call ``requires_grad_()`` on the tensor (notice
167- # the ’_’ suffix).
150+ # This error means that autograd can’t backpropagate to any leaf tensors
151+ # because ``loss`` is not tracking gradients . If you need to change the
152+ # property, you can call ``requires_grad_()`` on the tensor (notice the \_
153+ # suffix).
168154#
169- # We can sanity- check which nodes require gradient calculation, just like
155+ # We can sanity check which nodes require gradient calculation, just like
170156# we did above with the ``is_leaf`` attribute:
171157#
172158
176162
177163
178164######################################################################
179- # It’s useful to remember that by definition a non-leaf tensor has
180- # ``requires_grad=True``. Backpropagation would fail if this wasn’t the
181- # case . If the tensor is a leaf, then it will only have
165+ # It’s useful to remember that a non-leaf tensor has
166+ # ``requires_grad=True`` by definition, since backpropagation would fail
167+ # otherwise . If the tensor is a leaf, then it will only have
182168# ``requires_grad=True`` if it was specifically set by the user. Another
183- # way to phrase this is that if at least one of the inputs to the tensor
169+ # way to phrase this is that if at least one of the inputs to a tensor
184170# requires the gradient, then it will require the gradient as well.
185171#
186172# There are two exceptions to this rule:
193179#
194180# In summary, ``requires_grad`` tells autograd which tensors need to have
195181# their gradients calculated for backpropagation to work. This is
196- # different from which gradients have to be stored inside the tensor,
197- # which is the topic of the next section.
182+ # different from which tensors have their ``grad`` field populated, which
183+ # is the topic of the next section.
198184#
199185
200186
210196
211197
212198######################################################################
213- # This single function call populated the ``grad`` property of all leaf
214- # tensors which had ``requires_grad=True``. The ``grad`` is the gradient
215- # of the loss with respect to the tensor we are probing. Before running
199+ # Calling ``backward()`` populates the ``grad`` field of all leaf tensors
200+ # which had ``requires_grad=True``. The ``grad`` is the gradient of the
201+ # loss with respect to the tensor we are probing. Before running
216202# ``backward()``, this attribute is set to ``None``.
217203#
218204
242228
243229
244230######################################################################
245- # We also get ``None`` for the gradient, but now PyTorch warns us that a
231+ # PyTorch returns ``None`` for the gradient and also warns us that a
246232# non-leaf node’s ``grad`` attribute is being accessed. Although autograd
247233# has to calculate intermediate gradients for backpropagation to work, it
248234# assumes you don’t need to access the values afterwards. To change this
304290# >>> x.retain_grad()
305291# RuntimeError: can't retain_grad on Tensor that has requires_grad=False
306292#
307- # In summary, using ``retain_grad()`` and ``retains_grad`` only make sense
308- # for non-leaf nodes, since the ``grad`` attribute will already be
309- # populated for leaf tensors that have ``requires_grad=True``. By default,
310- # these non-leaf nodes do not retain (store) their gradient after
293+
294+
295+ ######################################################################
296+ # Summary table
297+ # -------------
298+ #
299+ # Using ``retain_grad()`` and ``retains_grad`` only make sense for
300+ # non-leaf nodes, since the ``grad`` attribute will already be populated
301+ # for leaf tensors that have ``requires_grad=True``. By default, these
302+ # non-leaf nodes do not retain (store) their gradient after
311303# backpropagation. We can change that by rerunning the forward pass,
312304# telling PyTorch to store the gradients, and then performing
313305# backpropagation.
314306#
315- # The following table can be used as a cheat-sheet which summarizes the
307+ # The following table can be used as a reference which summarizes the
316308# above discussions. The following scenarios are the only ones that are
317309# valid for PyTorch tensors.
318310#
344336# To illustrate the importance of gradient visualization, we will
345337# instantiate one version of the network with batch normalization
346338# (BatchNorm), and one without it. Batch normalization is an extremely
347- # effective technique to resolve the vanishing/exploding gradients issue,
348- # and we will be verifying that experimentally.
349- #
350- # The model we will use has a specified number of repeating
351- # fully-connected layers which alternate between ``nn.Linear``,
352- # ``norm_layer``, and ``nn.Sigmoid``. If we apply batch normalization,
353- # then ``norm_layer`` will use
339+ # effective technique to resolve `vanishing/exploding
340+ # gradients <https://arxiv.org/abs/1211.5063>`__, and we will be verifying
341+ # that experimentally.
342+ #
343+ # The model we use has a configurable number of repeating fully-connected
344+ # layers which alternate between ``nn.Linear``, ``norm_layer``, and
345+ # ``nn.Sigmoid``. If batch normalization is enabled, then ``norm_layer``
346+ # will use
354347# `BatchNorm1d <https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html>`__,
355- # otherwise it will use the identity transformation
356- # `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__.
348+ # otherwise it will use the
349+ # `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__
350+ # transformation.
357351#
358352
359353def fc_layer (in_size , out_size , norm_layer ):
@@ -416,60 +410,60 @@ def forward(self, x):
416410
417411
418412######################################################################
419- # Because we are using a ``nn.Module`` instead of individual tensors for
420- # our forward pass, we need another method to access the intermediate
421- # gradients. This is done by `registering a
422- # hook <https://www.digitalocean.com/community/tutorials/pytorch-hooks-gradient-clipping-debugging>`__.
413+ # Because we wrapped up the logic and state of our model in a
414+ # ``nn.Module``, we need another method to access the intermediate
415+ # gradients if we want to avoid modifying the module code directly. This
416+ # is done by `registering a
417+ # hook <https://docs.pytorch.org/docs/stable/notes/autograd.html#backward-hooks-execution>`__.
423418#
424419# .. warning::
425420#
426- # Note that using backward pass hooks to probe an intermediate nodes gradient is preferred over using `retain_grad()`.
427- # It avoids the memory retention overhead if gradients aren't needed after backpropagation.
428- # It also lets you modify and/or clamp gradients during the backward pass, so they don't vanish or explode.
429- # However, if in-place operations are performed, you cannot use the backward pass hook
430- # since it wraps the forward pass with views instead of the actual tensors. For more information
431- # please refer to https://github.com/pytorch/pytorch/issues/61519.
421+ # Using backward pass hooks attached to output tensors is preferred over using ``retain_grad()`` on the tensors themselves. An alternative method is to directly attach module hooks (e.g. ``register_full_backward_hook()``) so long as the ``nn.Module`` instance does not do perform any in-place operations. For more information, please refer to `this issue <https://github.com/pytorch/pytorch/issues/61519>`__.
432422#
433- # The following code defines our forward pass hook (notice the call to
434- # ``retain_grad()``) and also gathers descriptive names for the network’s
435- # layers.
423+ # The following code defines our hooks and gathers descriptive names for
424+ # the network’s layers.
436425#
437426
438- def hook_forward_wrapper (module_name , outputs ):
439- """Python function closure so we can pass args"""
427+ # note that wrapper functions are used for Python closure
428+ # so that we can pass arguments.
429+
430+ def hook_forward_wrapper (module_name , grads ):
440431 def hook_forward (module , args , output ):
441- """Hook for forward pass which retains gradients and saves intermediate tensors"""
442- output .retain_grad ()
443- outputs .append ((module_name , output ))
432+ """Forward pass hook which attaches backward pass hooks to intermediate tensors"""
433+ output .register_hook (hook_backward_wrapper (module_name , grads ))
444434 return hook_forward
435+
436+ def hook_backward_wrapper (module_name , grads ):
437+ def hook_backward (grad ):
438+ """Backward pass hook which appends gradients"""
439+ grads .append ((module_name , grad ))
440+ return hook_backward
445441
446442def get_all_layers (model , hook_fn ):
447- """Register forward pass hook to all outputs in model
443+ """Register forward pass hook (hook_fn) to model outputs
448444
449- Returns layers, a dict with keys as layer/module and values as layer/module names
450- e.g.: layers[nn.Conv2d] = layer1.0.conv1
451-
452- Returns outputs, a list of tuples with module name and tensor output. e.g.:
453- outputs[0] == (layer1.0.conv1, tensor.Torch(...))
454-
455- The layer name is passed to a forward hook which will eventually go into a tuple
445+ Returns:
446+ - layers: a dict with keys as layer/module and values as layer/module names
447+ e.g. layers[nn.Conv2d] = layer1.0.conv1
448+ - grads: a list of tuples with module name and tensor output gradient
449+ e.g. grads[0] == (layer1.0.conv1, tensor.Torch(...))
456450 """
457451 layers = dict ()
458- outputs = []
452+ grads = []
459453 for name , layer in model .named_modules ():
454+ # skip Sequential and/or wrapper modules
460455 if any (layer .children ()) is False :
461- # skip Sequential and/or wrapper modules
462456 layers [layer ] = name
463- layer .register_forward_hook (hook_forward_wrapper (name , outputs ))
464- return layers , outputs
457+ layer .register_forward_hook (hook_fn (name , grads ))
458+ return layers , grads
465459
466460# register hooks
467- layers_bn , outputs_bn = get_all_layers (model_bn , hook_forward_wrapper )
468- layers_nobn , outputs_nobn = get_all_layers (model_nobn , hook_forward_wrapper )
461+ layers_bn , grads_bn = get_all_layers (model_bn , hook_forward_wrapper )
462+ layers_nobn , grads_nobn = get_all_layers (model_nobn , hook_forward_wrapper )
469463
470464
471465######################################################################
472- # Now let’s train the models for a few epochs:
466+ # Let’s now train the models for a few epochs:
473467#
474468
475469epochs = 10
@@ -478,8 +472,8 @@ def get_all_layers(model, hook_fn):
478472
479473 # important to clear, because we append to
480474 # outputs everytime we do a forward pass
481- outputs_bn .clear ()
482- outputs_nobn .clear ()
475+ grads_bn .clear ()
476+ grads_nobn .clear ()
483477
484478 optimizer_bn .zero_grad ()
485479 optimizer_nobn .zero_grad ()
@@ -498,33 +492,33 @@ def get_all_layers(model, hook_fn):
498492
499493
500494######################################################################
501- # After running the forward and backward pass, the ``grad`` values for all
502- # the intermediate tensors should be present in ``outputs_bn `` and
503- # ``outputs_nobn ``. We reduce the gradient matrix to a single number (mean
504- # absolute value) so that we can compare the two models.
495+ # After running the forward and backward pass, the gradients for all the
496+ # intermediate tensors should be present in ``grads_bn `` and
497+ # ``grads_nobn ``. We compute the mean absolute value of each gradient
498+ # matrix so that we can compare the two models.
505499#
506500
507- def get_grads (outputs ):
501+ def get_grads (grads ):
508502 layer_idx = []
509503 avg_grads = []
510- for idx , (name , output ) in enumerate (outputs ):
511- if output . grad is not None :
512- avg_grad = output . grad .abs ().mean ()
504+ for idx , (name , grad ) in enumerate (grads ):
505+ if grad is not None :
506+ avg_grad = grad .abs ().mean ()
513507 avg_grads .append (avg_grad )
514- layer_idx .append (idx )
508+ # idx is backwards since we appended in backward pass
509+ layer_idx .append (len (grads ) - 1 - idx )
515510 return layer_idx , avg_grads
516511
517- layer_idx_bn , avg_grads_bn = get_grads (outputs_bn )
518- layer_idx_nobn , avg_grads_nobn = get_grads (outputs_nobn )
512+ layer_idx_bn , avg_grads_bn = get_grads (grads_bn )
513+ layer_idx_nobn , avg_grads_nobn = get_grads (grads_nobn )
519514
520515
521516######################################################################
522- # Now that we have all our gradients stored in ``avg_grads``, we can plot
523- # them and see how the average gradient values change as a function of the
524- # network depth. We see that when we don’t have batch normalization, the
525- # gradient values in the intermediate layers fall to zero very quickly.
526- # The batch normalization model, however, maintains non-zero gradients in
527- # its intermediate layers.
517+ # With the average gradients computed, we can now plot them and see how
518+ # the values change as a function of the network depth. Notice that when
519+ # we don’t apply batch normalization, the gradient values in the
520+ # intermediate layers fall to zero very quickly. The batch normalization
521+ # model, however, maintains non-zero gradients in its intermediate layers.
528522#
529523
530524fig , ax = plt .subplots ()
@@ -566,7 +560,7 @@ def get_grads(outputs):
566560# - Try increasing the number of layers (``num_layers``) in our model and
567561# see what effect this has on the gradient flow graph
568562# - How would you adapt the code to visualize average activations instead
569- # of average gradients? (*Hint: in the ``get_grads()`` function we have
563+ # of average gradients? (*Hint: in the hook_forward() function we have
570564# access to the raw tensor output*)
571565# - What are some other methods to deal with vanishing and exploding
572566# gradients?
@@ -585,9 +579,6 @@ def get_grads(outputs):
585579# mechanics <https://docs.pytorch.org/docs/stable/notes/autograd.html>`__
586580# - `Batch Normalization: Accelerating Deep Network Training by Reducing
587581# Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`__
588- #
589-
590-
591- ######################################################################
592- #
582+ # - `On the difficulty of training Recurrent Neural
583+ # Networks <https://arxiv.org/abs/1211.5063>`__
593584#
0 commit comments