Neural net from scratch with Mnist

Kerem Dede
8 min readOct 21, 2023

What is the core idea of an ML model? What is going on in a model so that it makes sense out of images? I’ll walk you through my recent experimentation with Mnist dataset and try to answer those questions.

💡 Mnist is an image dataset that include handwritten digits such as the 3 below:

Getting the Mnist images

We’ll use fastai to download the images. Make sure to import fastai like so

import torch
from fastai.vision.all import URLs, untar_data, Image
import numpy as np

URLs.MNIST gives us the location to download the resource from. untar_data downloads and extracts the data into the .fastai directory at your home path (~). Image allows us to actually open an image file. I’ve prepared a get_data method like below:

# load data into arrays
def get_data(type_of_data):
mnist = untar_data(URLs.MNIST)
mnist_train = mnist/type_of_data

x, y = [], []

for current_cat_dir in mnist_train.ls():
current_cat = int(current_cat_dir.__str__().split('/')[-1])

for current_img in current_cat_dir.ls():
current_val = torch.tensor(np.array(Image.open(current_img))) / 255

x.append(current_val)
y.append(current_cat)

x = torch.stack(x).view((len(x), -1))
y = torch.tensor(y)

return x, y

The reason for the get_data method is that we need images to be in tensors so that we can process them with PyTorch. Let’s go over main points of get_data:

  • we download the dataset and assign it into mnist variable.
  • type_of_data arg tells whether we want the training or testing data.
  • current_cat contains the category (or label) of the current images. We extract that category from the file path by taking the last piece, e.g. kerem/~/.fastai/data/mnist_png/training/3. By the way, it needs to be numerical.
  • current_val has the pixel data of a single image.
  • The line with stack and view converts x from a regular python list to a pytorch tensor and we reshape the dimension with view from (60000, 28, 28) to (60000, 784).

What a model is

Now the data is ready for use, we can train our model on it. Before we jump into it, let’s have a quick pause and think about what a model actually is.

Assuming you gave it a thought, here is an answer: it is a function that is mostly consisted of matrices. So we need to create some matrices.

These matrices come in two purposes, weight and bias. A pair of these matrices make up a single (linear) layer of a neural net. We multiply this single layer with images, or more generally with the independent variables.

prediction = imageAsMatrix * weight + bias

Two things to notice here. First is that the above equation on its own is a linear function. Second one is because that is a matrix multiplication, we need to take care of the dimensions.

What is the problem with linear functions? Their expresiveness is limited to only linear problems so if your dataset contains non-linear relationships, then the above equation won’t work, So, we need to make it non-linear, but how?

There are so-called activation functions such as ReLU (Rectified Linear Unit) or Sigmoid. These functions break up the linear relation between the input (e,g, image) and the output (e,g, prediction).

The relu function

Our new non-linear layer then looks like this:

prediction = relu ( imageAsMatrix * weight + bias )

With such a function, we can describe any relationships with the correct weight and bias. Finding good parameters is a question of training. We go into that in a bit.

Let’s see how we solve the issue with the matrix dimensions. As you remember, we have 60000 images and each single image is 28 x 28. We then reshaped those images to become a single dimensional matrix or a vector (or rank-1 tensor). Now, these images are in x that has the dimensions (60000, 784). Therefore, we need the weight to be in (784, 10). The height is 784 because it has to match with the width of x. In the second dimension, we have 10 because there are 10 classes of images and the model predicts the probability for each class. The result of the multuplication has the dimensions (60000, 10).

Width of x and height of weight must match

What about bias? It also needs to match the second dimension of the weight, 10. The bias will be added to each 60000 images.

If you were careful with the numbers, you realized we have only one set of 10 numbers in bias, but 60000 set of 10 number in result of weights * x. Why does it work? There is something called broadcasting in Pytorch that automatically matches the dimesions of the smaller matrix to the bigger one’s. When the pytorch makes the addition, it considers the bias as if it’s dimensions were (60000, 10).

After the images go through this layer, we end up with a matrix of (60000, 10). If that were at the last layer of our neural net, then the dimensions of this resulting matrix would have the following meaning: 60000 stands for each image and 10 is predicted probability that the image belongs to the category. The category is simply the index of that vector where the prediction is located.

Training the model — background

What does it mean to train a model? As we have just seen, a model contains bunchs of numbers. Let’s call all the weights and biases in a model params, short for parameteres. We start with random numbers that are usually normally distributed around zero with a standard deviation of one (s. 0.3, -0.4, 0.2).

In the first epoch, the model’s predictive capabilities are basically zero because it is only a random output. What is the step that allows us to move from random to meaningful numbers? With gradients!

What do we do with gradients? We use them to adjust our params. Because a gradient tells us the slope (increase or decrease at a certain point) of a param, we know what to do to that param so that it becomes more meaningful for our objective which is to classify images into 10 categories.

How do we calculate the gradients? This is the easy part thanks to pytorch. You just call backward and you have it. (You’ll see exactly how in the next section.)

We also have loss that tells us how bad our model predicts.

The last important piece is learning rate. Remember that the gradients recommend us what to do to params, but it is with the learning rate we decide the “how much” we do that recommendation.

So, to sum it up, we multiply the images with our randomly initialized params. Then, we calculate the loss. Based on the loss, we let the pytorch tell us the gradients. Combining gradients with the learning rate, we then adjust the params so that we get a better prediction next time. This is it. With enough iterations, we can approach to the ideal params.

Training the model — example

Let’s get our independent and dependent variables:

x, y = get_data('training')

x is here the indenpendent variable that contains the images. y is the dependent variable which is the labels for the images.

Then come the params that are the weights and the bias for our single layer neural net.

weights = torch.rand((28 * 28, 10), requires_grad=True)

b = torch.rand(10, requires_grad=True)

There are two interesting points regarding that initialization. The first one is the dimensions. Remeber the 28 * 28 corresponds to the dimensions of the images. The second remark is the requires_grad. It basically tells pytorch to mark those variables because later it will need to calculate the gradients for them.

lr = 1e-3
ce_loss = torch.nn.CrossEntropyLoss()

This is our learning rate. If you don’t know what it should be, then 0.001 is usually a good starting point.

ce_loss is the CrossEntropy loss. This produces good results for classification problems.

Now, it is time for the rubber to hit the road. Let’s calculate our first prediction by multiplying the weights with the images and adding the biases.

preds = x @ weights + b

We have our initial predictions and let’s get the loss.

loss = ce_loss(preds, y)

Let pytorch calculate the gradients for us by calling backward like so:

loss.backward()

Now, adjust the params based on the gradients and the learning rate:

with torch.no_grad():
weights.data -= weights.grad * lr
b.data -= b.grad * lr

Set the gradients zero before the next epoch. Pytorch accumulates the gradients. Because we already adjusted our params, we don’t want the current gradients to affect our future params. We will calculate relevant gradients anew.

weights.grad.zero_()
b.grad.zero_()

Because we need to iterate those exact steps multiple times, it’s more convenient to put them into a loop, like so:

num_of_epoch = 100
for i in range(num_of_epoch):
preds = x @ weights + b

loss = ce_loss(preds, y)
loss.backward()
print(f'{i}. loss: {loss:.2f}')

weights.data -= weights.grad * lr
b.data -= b.grad * lr

weights.grad.zero_()
b.grad.zero_()

Once we run this training loop, we see that our loss is getting less at each iteration. That means, we are training our model successfully and adjusting the trainable parameters (weights and b) to make a better prediction.

Alright, we are training our model successfully, but how do we know if it’s actually getting better? The loss is mainly to be used by the machine and not the humans. It’s only a proxy to communicate with the computer whether it’s doing a better or worse job. What we actually care about are the metrics. Let’s see how can calculate one.

Calculating a metric: accuracy

At the end of the day, we want our model to be accurate. It should classify images correctly. Whether the loss is a little down or up, as long as the accuracy satisfies our requirements, we are happy.

Another important difference between the loss and a metric is that while loss is calculated based on the training data, we measure the metric using validation data. Again, we don’t really care if the model is a oracle-like predictor with training data. No. The real test is whether it can actually predict with the data it has never seen before (during training). Only then, we can conclude whether it actually learned the underlying patterns among our data and not just memorized the training data.

Import the validation data like so:

valid_x, valid_y = get_data('testing')

Let’s define our function that calculates the accuracy for us in each epoch:

@torch.no_grad()
def get_acc(weights, b):
val_preds = (valid_x @ weights + b).squeeze(1).argmax(dim=-1)
classifications = (val_preds - valid_y).abs() < 0.5
acc = ((classifications).float().sum() / len(valid_y))
return acc

Here, we are getting the predictions first so that we can compare them with the correct labels. classifications tell whether the prediction is near to the correct label. Lastly, we convert the booleans into float and find the mean.

I hope I could clarify the inner workings of an ML model. I’d appericate your feedback if you have any. Leave them here or you can reach me on LinkedIn.

Have a good day,

Kerem Dede

--

--