##### What are the two main features of PyTorch?

- N-dimensional Tensor that can run on GPUs
- Autograd that enables automatic differentiation for building and training neural network

##### Why PyTorch Tensor and not Numpy?

Numpy provides n-dimensional array which it’s similar to n-dimensional tensor. However, Numpy is a scientific computing framework that doesn’t know anything about computation graphs or deep learning or gradients. This means that it cannot utilise GPUs to accelerate numerical computations.

This is why Tensor is used here. Tensor is conceptually identical to numpy array except it keeps track of a computational graph and gradients. It can utilise GPUs.

### PyTorch on 2-layer Neural Network

#### 1. Import dependencies and initialise input and weights

```
import torch
dtype = torch.float
device = torch.device('cpu') # running on CPU
# device = torch.device('cuda:0') # running on GPU
```

```
N = 64 # batch_size
D_in = 1000 # input dimension
H = 100 # hidden dimension
D_out = 10 # output dimension
```

```
x = torch.randn(N, D_in, device = device, dtype = dtype) # randomly generated input
y = torch.randn(N, D_out, device = device, dtype = dtype) # randomly generated output
```

```
print(x.shape)
print(y.shape)
```

```
w1 = torch.randn(D_in, H, device = device, dtype = dtype) # randomly initialise weight1
w2 = torch.randn(H, D_out, device = device, dtype = dtype) # randomly initialise weight2
```

```
print(w1.shape)
print(w2.shape)
```

`learning_rate = 1e-6`

#### 2. Training

```
for t in range(500):
# Forward
h = x.mm(w1) # matrix multiplication of x (64, 1000) and w1 (1000, 100)
h_relu = h.clamp(min = 0) # non-linearity function
y_pred = h_relu.mm(w2) # matrix multiplication between output of layer 1 (64, 100) and w2 (100, 10)
# Compute loss
loss = (y_pred - y).pow(2).sum().item() # compute mean squared error
if t % 100 == 99:
print(t, loss)
# Backprop
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights
w1 = w1 - learning_rate*grad_w1
w2 = w2 - learning_rate*grad_w2
```

### AutoGrad

In the above we have to manually implement both forward and backward passes. As our neural network gets larger and larger, the implementation of the backward pass becomes more complex. The autograd of PyTorch can automatically compute backward passes for us. When using autograd, the forward pass would define the computational graph whereby the nodes would be tensor and the edges would be functions that produce output tensors. Backprop through this graph allows us to easily compute gradients.

If x is a Tensor (with required_grad = True), then x would have the attribute x.grad which stores the gradient of x.

#### 1. Initialise inputs and weights

Remember to set requires_grad = True for those variables that you want to do backprop!

```
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
```

#### 2. Training (with autograd)

```
for t in range(500):
# Forward pass
y_pred = x.mm(w1).clamp(min = 0).mm(w2)
# Compute loss
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item()) # .item() gets the scalar value when tensor is shape (1, )
# Backprop using autograd. This will compute gradient of loss w.r.t all Tensors with requires_grad = True, which
# in our case is weight 1 and 2.
loss.backward()
with torch.no_grad():
# update the weights inside the torch.no_grad() as we don't need to track the weights in autograd
# inplace subtraction instead of assigning it to a new tensor
w1.sub_(w1.grad*learning_rate)
w2.sub_(w2.grad*learning_rate)
# set gradient to zero after weights update
w1.grad.zero_()
w2.grad.zero_()
```

### PyTorch nn module

The nn package provides high level abstractions over raw computational graphs. It includes a set of modules that are roughly equivalent to neural network layers. A module takes in input Tensor and output Tensor and can hold internal state such as learnable parameters.

The nn package also has a set of common loss functions.

#### optim

The optim package provides implementations of commonly used optimisation algorithms.

```
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
```

#### 1. Using nn package to define different layers in a two-layer network

```
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
```

#### 2. Using nn package to define loss function us

`loss_fn = torch.nn.MSELoss(reduction = 'sum')`

#### 3. Training

```
learning_rate = 1e-4
```

```
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
```

```
for t in range(500):
# Forward
y_pred = model(x)
# Loss
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero the gradients before running backprop (as gradients are accumulated by default)
optimizer.zero_grad()
# Backprop
loss.backward()
# Update weights
optimizer.step()
```

### How to create your own complex model using torch.nn.Module

- Create a subclass inheriting from torch.nn.Module
- Define the forward function that takes in input Tensors and return output Tensors

```
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
super(TwoLayerNet, self).__init__()
# Initialise all the layers of your model
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
# Here, we define our forward pass!
h_relu = self.linear1(x).clamp(min = 0)
y_pred = self.linear2(h_relu)
return y_pred
```

`model = TwoLayerNet(D_in, H, D_out)`