Automatic differentiation

Author

Marie-Hélène Burle

PyTorch has automatic differentiation capabilities—meaning that it can track all the operations performed on tensors during the forward pass and compute all the gradients automatically for the backpropagation—thanks to its package torch.autograd.

Let’s have a look at this.

Some definitions

Derivative of a function:
Rate of change of a function with a single variable w.r.t. its variable.

Partial derivative:
Rate of change of a function with multiple variables w.r.t. one variable while other variables are considered as constants.

Gradient:
Vector of partial derivatives of function with several variables.

Differentiation:
Calculation of the derivatives of a function.

Chain rule:
Formula to calculate the derivatives of composite functions.

Automatic differentiation:
Automatic computation of partial derivatives by algorithms.

Backpropagation

First, we need to talk about backpropagation: the backward pass following each forward pass and which adjusts the model’s parameters to minimize the output of the loss function.

The last 2 videos of 3Blue1Brown neural network series explains backpropagation and its manual calculation very well.

What is backpropagation?

14 min video.

There is one minor terminological error in this video: they call the use of mini-batches stochastic gradient descent. In fact, this is called mini-batch gradient descent. Stochastic gradient descent uses a single example at each iteration.

How does backpropagation work?

10 min video.

Automatic differentiation

If we had to do all this manually, it would be absolute hell. Thankfully, many tools—including PyTorch—can do this automatically.

Tracking computations

For the automation of the calculation of all those derivatives through chain rules, PyTorch needs to track computations during the forward pass.

PyTorch does not however track all the computations on all the tensors (this would be extremely memory intensive!). To start tracking computations on a vector, set the requires_grad attribute to True:

import torch

x = torch.ones(2, 4, requires_grad=True)
x
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
      3 x = torch.ones(2, 4, requires_grad=True)
      4 x

ModuleNotFoundError: No module named 'torch'

The grad_fun attribute

Whenever a tensor is created by an operation involving a tracked tensor, it has a grad_fun attribute:

y = x + 1
y
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 y = x + 1
      2 y

NameError: name 'x' is not defined
y.grad_fn
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 y.grad_fn

NameError: name 'y' is not defined

Judicious tracking

You don’t want to track more than is necessary. There are multiple ways to avoid tracking what you don’t want.

You can stop tracking computations on a tensor with the method detach:

x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 x

NameError: name 'x' is not defined
x.detach_()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 x.detach_()

NameError: name 'x' is not defined

You can change its requires_grad flag:

x = torch.zeros(2, 3, requires_grad=True)
x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 x = torch.zeros(2, 3, requires_grad=True)
      2 x

NameError: name 'torch' is not defined
x.requires_grad_(False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 x.requires_grad_(False)

NameError: name 'x' is not defined

Alternatively, you can wrap any code you don’t want to track under with torch.no_grad():

x = torch.ones(2, 4, requires_grad=True)

with torch.no_grad():
    y = x + 1

y
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 x = torch.ones(2, 4, requires_grad=True)
      3 with torch.no_grad():
      4     y = x + 1

NameError: name 'torch' is not defined

Compare this with what we just did above.

Calculating gradients

Let’s calculate gradients manually, then use autograd, in a very simple case: imagine that x, y, and z are tensors containing the parameters of a model and that the error e could be calculated with the equation:

e=2x4y3+3z2

Manual derivative calculation

Let’s see how we would do this manually.

First, we need the model parameters tensors:

x = torch.tensor([1., 2.])
y = torch.tensor([3., 4.])
z = torch.tensor([5., 6.])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 x = torch.tensor([1., 2.])
      2 y = torch.tensor([3., 4.])
      3 z = torch.tensor([5., 6.])

NameError: name 'torch' is not defined

We calculate e following the above equation:

e = 2*x**4 - y**3 + 3*z**2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 e = 2*x**4 - y**3 + 3*z**2

NameError: name 'x' is not defined

The gradients of the error e w.r.t. the parameters x, y, and z are:

dedx=8x3 dedy=3y2 dedz=6z

We can calculate them with:

gradient_x = 8*x**3
gradient_x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 gradient_x = 8*x**3
      2 gradient_x

NameError: name 'x' is not defined
gradient_y = -3*y**2
gradient_y
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 gradient_y = -3*y**2
      2 gradient_y

NameError: name 'y' is not defined
gradient_z = 6*z
gradient_z
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 gradient_z = 6*z
      2 gradient_z

NameError: name 'z' is not defined

Automatic derivative calculation

For this method, we need to define our model parameters with requires_grad set to True:

x = torch.tensor([1., 2.], requires_grad=True)
y = torch.tensor([3., 4.], requires_grad=True)
z = torch.tensor([5., 6.], requires_grad=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 x = torch.tensor([1., 2.], requires_grad=True)
      2 y = torch.tensor([3., 4.], requires_grad=True)
      3 z = torch.tensor([5., 6.], requires_grad=True)

NameError: name 'torch' is not defined

e is calculated in the same fashion (except that here, all the computations on x, y, and z are tracked):

e = 2*x**4 - y**3 + 3*z**2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 e = 2*x**4 - y**3 + 3*z**2

NameError: name 'x' is not defined

The backward propagation is done automatically with:

e.backward(torch.tensor([1., 1.]))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 e.backward(torch.tensor([1., 1.]))

NameError: name 'e' is not defined

And we have our 3 partial derivatives:

print(x.grad)
print(y.grad)
print(z.grad)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 print(x.grad)
      2 print(y.grad)
      3 print(z.grad)

NameError: name 'x' is not defined

Comparison

The result is the same, as can be tested with:

8*x**3 == x.grad
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 8*x**3 == x.grad

NameError: name 'x' is not defined
-3*y**2 == y.grad
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 -3*y**2 == y.grad

NameError: name 'y' is not defined
6*z == z.grad
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 6*z == z.grad

NameError: name 'z' is not defined

Of course, calculating the gradients manually here was extremely easy, but imagine how tedious and lengthy it would be to write the chain rules to calculate the gradients of all the composite functions in a neural network manually…