What are the two main features of PyTorch?
- N-dimensional Tensor that can run on GPUs
- Autograd that enables automatic differentiation for building and training neural network
Why PyTorch Tensor and not Numpy?
Numpy provides n-dimensional array which it’s similar to n-dimensional tensor. However, Numpy is a scientific computing framework that doesn’t know anything about computation graphs or deep learning or gradients. This means that it cannot utilise GPUs to accelerate numerical computations.
This is why Tensor is used here. Tensor is conceptually identical to numpy array except it keeps track of a computational graph and gradients. It can utilise GPUs.
PyTorch on 2-layer Neural Network
1. Import dependencies and initialise input and weights
import torch dtype = torch.float device = torch.device('cpu') # running on CPU # device = torch.device('cuda:0') # running on GPU
N = 64 # batch_size D_in = 1000 # input dimension H = 100 # hidden dimension D_out = 10 # output dimension
x = torch.randn(N, D_in, device = device, dtype = dtype) # randomly generated input y = torch.randn(N, D_out, device = device, dtype = dtype) # randomly generated output
torch.Size([64, 1000]) torch.Size([64, 10])
w1 = torch.randn(D_in, H, device = device, dtype = dtype) # randomly initialise weight1 w2 = torch.randn(H, D_out, device = device, dtype = dtype) # randomly initialise weight2
torch.Size([1000, 100]) torch.Size([100, 10])
learning_rate = 1e-6
for t in range(500): # Forward h = x.mm(w1) # matrix multiplication of x (64, 1000) and w1 (1000, 100) h_relu = h.clamp(min = 0) # non-linearity function y_pred = h_relu.mm(w2) # matrix multiplication between output of layer 1 (64, 100) and w2 (100, 10) # Compute loss loss = (y_pred - y).pow(2).sum().item() # compute mean squared error if t % 100 == 99: print(t, loss) # Backprop grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h_relu.t().mm(grad_y_pred) grad_h_relu = grad_y_pred.mm(w2.t()) grad_h = grad_h_relu.clone() grad_h[h < 0] = 0 grad_w1 = x.t().mm(grad_h) # Update weights w1 = w1 - learning_rate*grad_w1 w2 = w2 - learning_rate*grad_w2
99 308.5810546875 199 0.8532795310020447 299 0.00400793831795454 399 0.00013242007116787136 499 2.9330951292649843e-05
In the above we have to manually implement both forward and backward passes. As our neural network gets larger and larger, the implementation of the backward pass becomes more complex. The autograd of PyTorch can automatically compute backward passes for us. When using autograd, the forward pass would define the computational graph whereby the nodes would be tensor and the edges would be functions that produce output tensors. Backprop through this graph allows us to easily compute gradients.
If x is a Tensor (with required_grad = True), then x would have the attribute x.grad which stores the gradient of x.
1. Initialise inputs and weights
Remember to set requires_grad = True for those variables that you want to do backprop!
x = torch.randn(N, D_in, device=device, dtype=dtype) y = torch.randn(N, D_out, device=device, dtype=dtype) w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True) w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
2. Training (with autograd)
for t in range(500): # Forward pass y_pred = x.mm(w1).clamp(min = 0).mm(w2) # Compute loss loss = (y_pred - y).pow(2).sum() if t % 100 == 99: print(t, loss.item()) # .item() gets the scalar value when tensor is shape (1, ) # Backprop using autograd. This will compute gradient of loss w.r.t all Tensors with requires_grad = True, which # in our case is weight 1 and 2. loss.backward() with torch.no_grad(): # update the weights inside the torch.no_grad() as we don't need to track the weights in autograd # inplace subtraction instead of assigning it to a new tensor w1.sub_(w1.grad*learning_rate) w2.sub_(w2.grad*learning_rate) # set gradient to zero after weights update w1.grad.zero_() w2.grad.zero_()
99 513.0426635742188 199 1.752084493637085 299 0.008917410857975483 399 0.00017502835544291884 499 3.062380346818827e-05
PyTorch nn module
The nn package provides high level abstractions over raw computational graphs. It includes a set of modules that are roughly equivalent to neural network layers. A module takes in input Tensor and output Tensor and can hold internal state such as learnable parameters.
The nn package also has a set of common loss functions.
The optim package provides implementations of commonly used optimisation algorithms.
x = torch.randn(N, D_in) y = torch.randn(N, D_out)
1. Using nn package to define different layers in a two-layer network
model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.ReLU(), torch.nn.Linear(H, D_out), )
2. Using nn package to define loss function us
loss_fn = torch.nn.MSELoss(reduction = 'sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
for t in range(500): # Forward y_pred = model(x) # Loss loss = loss_fn(y_pred, y) if t % 100 == 99: print(t, loss.item()) # Zero the gradients before running backprop (as gradients are accumulated by default) optimizer.zero_grad() # Backprop loss.backward() # Update weights optimizer.step()
99 47.96703338623047 199 1.0740092992782593 299 0.04033464938402176 399 0.0012894074898213148 499 1.719343345030211e-05
How to create your own complex model using torch.nn.Module
- Create a subclass inheriting from torch.nn.Module
- Define the forward function that takes in input Tensors and return output Tensors
class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): super(TwoLayerNet, self).__init__() # Initialise all the layers of your model self.linear1 = torch.nn.Linear(D_in, H) self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x): # Here, we define our forward pass! h_relu = self.linear1(x).clamp(min = 0) y_pred = self.linear2(h_relu) return y_pred
model = TwoLayerNet(D_in, H, D_out)