|
1 | 1 | # -*- coding: utf-8 -*- |
2 | 2 | """ |
3 | | -A Gentle Introduction to ``torch.autograd`` |
4 | | -=========================================== |
5 | | -
|
6 | | -``torch.autograd`` is PyTorch’s automatic differentiation engine that powers |
7 | | -neural network training. In this section, you will get a conceptual |
8 | | -understanding of how autograd helps a neural network train. |
9 | | -
|
10 | | -Background |
11 | | -~~~~~~~~~~ |
12 | | -Neural networks (NNs) are a collection of nested functions that are |
13 | | -executed on some input data. These functions are defined by *parameters* |
14 | | -(consisting of weights and biases), which in PyTorch are stored in |
15 | | -tensors. |
16 | | -
|
17 | | -Training a NN happens in two steps: |
18 | | -
|
19 | | -**Forward Propagation**: In forward prop, the NN makes its best guess |
20 | | -about the correct output. It runs the input data through each of its |
21 | | -functions to make this guess. |
22 | | -
|
23 | | -**Backward Propagation**: In backprop, the NN adjusts its parameters |
24 | | -proportionate to the error in its guess. It does this by traversing |
25 | | -backwards from the output, collecting the derivatives of the error with |
26 | | -respect to the parameters of the functions (*gradients*), and optimizing |
27 | | -the parameters using gradient descent. For a more detailed walkthrough |
28 | | -of backprop, check out this `video from |
29 | | -3Blue1Brown <https://www.youtube.com/watch?v=tIeHLnjs5U8>`__. |
30 | | -
|
| 3 | +:orphan: |
31 | 4 |
|
| 5 | +A Gentle Introduction to ``torch.autograd`` |
| 6 | +============================================== |
32 | 7 |
|
| 8 | +This tutorial has been deprecated because there is an identical basics tutorial. |
33 | 9 |
|
34 | | -Usage in PyTorch |
35 | | -~~~~~~~~~~~~~~~~ |
36 | | -Let's take a look at a single training step. |
37 | | -For this example, we load a pretrained resnet18 model from ``torchvision``. |
38 | | -We create a random data tensor to represent a single image with 3 channels, and height & width of 64, |
39 | | -and its corresponding ``label`` initialized to some random values. Label in pretrained models has |
40 | | -shape (1,1000). |
| 10 | +Redirecting in 3 seconds... |
41 | 11 |
|
42 | | -.. note:: |
43 | | - This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA). |
| 12 | +.. raw:: html |
44 | 13 |
|
| 14 | + <meta http-equiv="Refresh" content="3; url='https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html'" /> |
45 | 15 | """ |
46 | | -import torch |
47 | | -from torchvision.models import resnet18, ResNet18_Weights |
48 | | -model = resnet18(weights=ResNet18_Weights.DEFAULT) |
49 | | -data = torch.rand(1, 3, 64, 64) |
50 | | -labels = torch.rand(1, 1000) |
51 | | - |
52 | | -############################################################ |
53 | | -# Next, we run the input data through the model through each of its layers to make a prediction. |
54 | | -# This is the **forward pass**. |
55 | | -# |
56 | | - |
57 | | -prediction = model(data) # forward pass |
58 | | - |
59 | | -############################################################ |
60 | | -# We use the model's prediction and the corresponding label to calculate the error (``loss``). |
61 | | -# The next step is to backpropagate this error through the network. |
62 | | -# Backward propagation is kicked off when we call ``.backward()`` on the error tensor. |
63 | | -# Autograd then calculates and stores the gradients for each model parameter in the parameter's ``.grad`` attribute. |
64 | | -# |
65 | | - |
66 | | -loss = (prediction - labels).sum() |
67 | | -loss.backward() # backward pass |
68 | | - |
69 | | -############################################################ |
70 | | -# Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum <https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d>`__ of 0.9. |
71 | | -# We register all the parameters of the model in the optimizer. |
72 | | -# |
73 | | - |
74 | | -optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9) |
75 | | - |
76 | | -###################################################################### |
77 | | -# Finally, we call ``.step()`` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in ``.grad``. |
78 | | -# |
79 | | - |
80 | | -optim.step() #gradient descent |
81 | | - |
82 | | -###################################################################### |
83 | | -# At this point, you have everything you need to train your neural network. |
84 | | -# The below sections detail the workings of autograd - feel free to skip them. |
85 | | -# |
86 | | - |
87 | | - |
88 | | -###################################################################### |
89 | | -# -------------- |
90 | | -# |
91 | | - |
92 | | - |
93 | | -###################################################################### |
94 | | -# Differentiation in Autograd |
95 | | -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
96 | | -# Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with |
97 | | -# ``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked. |
98 | | -# |
99 | | - |
100 | | -import torch |
101 | | - |
102 | | -a = torch.tensor([2., 3.], requires_grad=True) |
103 | | -b = torch.tensor([6., 4.], requires_grad=True) |
104 | | - |
105 | | -###################################################################### |
106 | | -# We create another tensor ``Q`` from ``a`` and ``b``. |
107 | | -# |
108 | | -# .. math:: |
109 | | -# Q = 3a^3 - b^2 |
110 | | - |
111 | | -Q = 3*a**3 - b**2 |
112 | | - |
113 | | - |
114 | | -###################################################################### |
115 | | -# Let's assume ``a`` and ``b`` to be parameters of an NN, and ``Q`` |
116 | | -# to be the error. In NN training, we want gradients of the error |
117 | | -# w.r.t. parameters, i.e. |
118 | | -# |
119 | | -# .. math:: |
120 | | -# \frac{\partial Q}{\partial a} = 9a^2 |
121 | | -# |
122 | | -# .. math:: |
123 | | -# \frac{\partial Q}{\partial b} = -2b |
124 | | -# |
125 | | -# |
126 | | -# When we call ``.backward()`` on ``Q``, autograd calculates these gradients |
127 | | -# and stores them in the respective tensors' ``.grad`` attribute. |
128 | | -# |
129 | | -# We need to explicitly pass a ``gradient`` argument in ``Q.backward()`` because it is a vector. |
130 | | -# ``gradient`` is a tensor of the same shape as ``Q``, and it represents the |
131 | | -# gradient of Q w.r.t. itself, i.e. |
132 | | -# |
133 | | -# .. math:: |
134 | | -# \frac{dQ}{dQ} = 1 |
135 | | -# |
136 | | -# Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like ``Q.sum().backward()``. |
137 | | -# |
138 | | -external_grad = torch.tensor([1., 1.]) |
139 | | -Q.backward(gradient=external_grad) |
140 | | - |
141 | | - |
142 | | -####################################################################### |
143 | | -# Gradients are now deposited in ``a.grad`` and ``b.grad`` |
144 | | - |
145 | | -# check if collected gradients are correct |
146 | | -print(9*a**2 == a.grad) |
147 | | -print(-2*b == b.grad) |
148 | | - |
149 | | - |
150 | | -###################################################################### |
151 | | -# Optional Reading - Vector Calculus using ``autograd`` |
152 | | -# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
153 | | -# |
154 | | -# Mathematically, if you have a vector valued function |
155 | | -# :math:`\vec{y}=f(\vec{x})`, then the gradient of :math:`\vec{y}` with |
156 | | -# respect to :math:`\vec{x}` is a Jacobian matrix :math:`J`: |
157 | | -# |
158 | | -# .. math:: |
159 | | -# |
160 | | -# |
161 | | -# J |
162 | | -# = |
163 | | -# \left(\begin{array}{cc} |
164 | | -# \frac{\partial \bf{y}}{\partial x_{1}} & |
165 | | -# ... & |
166 | | -# \frac{\partial \bf{y}}{\partial x_{n}} |
167 | | -# \end{array}\right) |
168 | | -# = |
169 | | -# \left(\begin{array}{ccc} |
170 | | -# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ |
171 | | -# \vdots & \ddots & \vdots\\ |
172 | | -# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
173 | | -# \end{array}\right) |
174 | | -# |
175 | | -# Generally speaking, ``torch.autograd`` is an engine for computing |
176 | | -# vector-Jacobian product. That is, given any vector :math:`\vec{v}`, compute the product |
177 | | -# :math:`J^{T}\cdot \vec{v}` |
178 | | -# |
179 | | -# If :math:`\vec{v}` happens to be the gradient of a scalar function :math:`l=g\left(\vec{y}\right)`: |
180 | | -# |
181 | | -# .. math:: |
182 | | -# |
183 | | -# |
184 | | -# \vec{v} |
185 | | -# = |
186 | | -# \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T} |
187 | | -# |
188 | | -# then by the chain rule, the vector-Jacobian product would be the |
189 | | -# gradient of :math:`l` with respect to :math:`\vec{x}`: |
190 | | -# |
191 | | -# .. math:: |
192 | | -# |
193 | | -# |
194 | | -# J^{T}\cdot \vec{v}=\left(\begin{array}{ccc} |
195 | | -# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\ |
196 | | -# \vdots & \ddots & \vdots\\ |
197 | | -# \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
198 | | -# \end{array}\right)\left(\begin{array}{c} |
199 | | -# \frac{\partial l}{\partial y_{1}}\\ |
200 | | -# \vdots\\ |
201 | | -# \frac{\partial l}{\partial y_{m}} |
202 | | -# \end{array}\right)=\left(\begin{array}{c} |
203 | | -# \frac{\partial l}{\partial x_{1}}\\ |
204 | | -# \vdots\\ |
205 | | -# \frac{\partial l}{\partial x_{n}} |
206 | | -# \end{array}\right) |
207 | | -# |
208 | | -# This characteristic of vector-Jacobian product is what we use in the above example; |
209 | | -# ``external_grad`` represents :math:`\vec{v}`. |
210 | | -# |
211 | | - |
212 | | - |
213 | | - |
214 | | -###################################################################### |
215 | | -# Computational Graph |
216 | | -# ~~~~~~~~~~~~~~~~~~~ |
217 | | -# |
218 | | -# Conceptually, autograd keeps a record of data (tensors) & all executed |
219 | | -# operations (along with the resulting new tensors) in a directed acyclic |
220 | | -# graph (DAG) consisting of |
221 | | -# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__ |
222 | | -# objects. In this DAG, leaves are the input tensors, roots are the output |
223 | | -# tensors. By tracing this graph from roots to leaves, you can |
224 | | -# automatically compute the gradients using the chain rule. |
225 | | -# |
226 | | -# In a forward pass, autograd does two things simultaneously: |
227 | | -# |
228 | | -# - run the requested operation to compute a resulting tensor, and |
229 | | -# - maintain the operation’s *gradient function* in the DAG. |
230 | | -# |
231 | | -# The backward pass kicks off when ``.backward()`` is called on the DAG |
232 | | -# root. ``autograd`` then: |
233 | | -# |
234 | | -# - computes the gradients from each ``.grad_fn``, |
235 | | -# - accumulates them in the respective tensor’s ``.grad`` attribute, and |
236 | | -# - using the chain rule, propagates all the way to the leaf tensors. |
237 | | -# |
238 | | -# Below is a visual representation of the DAG in our example. In the graph, |
239 | | -# the arrows are in the direction of the forward pass. The nodes represent the backward functions |
240 | | -# of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors ``a`` and ``b``. |
241 | | -# |
242 | | -# .. figure:: /_static/img/dag_autograd.png |
243 | | -# |
244 | | -# .. note:: |
245 | | -# **DAGs are dynamic in PyTorch** |
246 | | -# An important thing to note is that the graph is recreated from scratch; after each |
247 | | -# ``.backward()`` call, autograd starts populating a new graph. This is |
248 | | -# exactly what allows you to use control flow statements in your model; |
249 | | -# you can change the shape, size and operations at every iteration if |
250 | | -# needed. |
251 | | -# |
252 | | -# Exclusion from the DAG |
253 | | -# ^^^^^^^^^^^^^^^^^^^^^^ |
254 | | -# |
255 | | -# ``torch.autograd`` tracks operations on all tensors which have their |
256 | | -# ``requires_grad`` flag set to ``True``. For tensors that don’t require |
257 | | -# gradients, setting this attribute to ``False`` excludes it from the |
258 | | -# gradient computation DAG. |
259 | | -# |
260 | | -# The output tensor of an operation will require gradients even if only a |
261 | | -# single input tensor has ``requires_grad=True``. |
262 | | -# |
263 | | - |
264 | | -x = torch.rand(5, 5) |
265 | | -y = torch.rand(5, 5) |
266 | | -z = torch.rand((5, 5), requires_grad=True) |
267 | | - |
268 | | -a = x + y |
269 | | -print(f"Does `a` require gradients?: {a.requires_grad}") |
270 | | -b = x + z |
271 | | -print(f"Does `b` require gradients?: {b.requires_grad}") |
272 | | - |
273 | | - |
274 | | -###################################################################### |
275 | | -# In a NN, parameters that don't compute gradients are usually called **frozen parameters**. |
276 | | -# It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters |
277 | | -# (this offers some performance benefits by reducing autograd computations). |
278 | | -# |
279 | | -# In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. |
280 | | -# Let's walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters. |
281 | | - |
282 | | -from torch import nn, optim |
283 | | - |
284 | | -model = resnet18(weights=ResNet18_Weights.DEFAULT) |
285 | | - |
286 | | -# Freeze all the parameters in the network |
287 | | -for param in model.parameters(): |
288 | | - param.requires_grad = False |
289 | | - |
290 | | -###################################################################### |
291 | | -# Let's say we want to finetune the model on a new dataset with 10 labels. |
292 | | -# In resnet, the classifier is the last linear layer ``model.fc``. |
293 | | -# We can simply replace it with a new linear layer (unfrozen by default) |
294 | | -# that acts as our classifier. |
295 | | - |
296 | | -model.fc = nn.Linear(512, 10) |
297 | | - |
298 | | -###################################################################### |
299 | | -# Now all parameters in the model, except the parameters of ``model.fc``, are frozen. |
300 | | -# The only parameters that compute gradients are the weights and bias of ``model.fc``. |
301 | | - |
302 | | -# Optimize only the classifier |
303 | | -optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9) |
304 | | - |
305 | | -########################################################################## |
306 | | -# Notice although we register all the parameters in the optimizer, |
307 | | -# the only parameters that are computing gradients (and hence updated in gradient descent) |
308 | | -# are the weights and bias of the classifier. |
309 | | -# |
310 | | -# The same exclusionary functionality is available as a context manager in |
311 | | -# `torch.no_grad() <https://pytorch.org/docs/stable/generated/torch.no_grad.html>`__ |
312 | | -# |
313 | | - |
314 | | -###################################################################### |
315 | | -# -------------- |
316 | | -# |
317 | | - |
318 | | -###################################################################### |
319 | | -# Further readings: |
320 | | -# ~~~~~~~~~~~~~~~~~~~ |
321 | | -# |
322 | | -# - `In-place operations & Multithreaded Autograd <https://pytorch.org/docs/stable/notes/autograd.html>`__ |
323 | | -# - `Example implementation of reverse-mode autodiff <https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC>`__ |
324 | | -# - `Video: PyTorch Autograd Explained - In-depth Tutorial <https://www.youtube.com/watch?v=MswxJw-8PvE>`__ |
0 commit comments