Collaborative Filtering from scratch

6 min readNov 6, 2023

Using simple lists, pytorch’s Embeddings and Fastai

Collaborative filter is a mechanism that guesses how likely that a particular user prefers a particular item. An item can be a movie, an e-commerce product, a song, etc.

We need to know characteristics of user and item so that a meaningful prediction is feasable. Most of the time we do not have complete information neither of user nor of item. With collaborative filtering, we don’t have to know; it learns those missing pieces on its own.

To store the characteristics of items and users, we’ll use an array for each user and item. These arrays are also called latent factors.

On the one side, we have a dataset containing user, item, and rating column and on the other side we have latent factors. The behaviour we want to achieve is that the multiplication of a user’s and an item’s latent factors are bigger if the user is likely to prefer the item. To multiply those two latent factors, we need to access them with the indexes from user-item-rating dataset. Usually, you would just index into latent factors matrices, but that “indexing” is not a differentiable operation. Therefore, the model cannot calculate its gradients to learn so we need a differentiable way to look up latent factors for any given index of user or item. The concept of embedding comes into play here.

The best way to understand something is to do it yourself. So, let’s create the embeddings from scratch. After you understand how it is built, then you can use the optimized version from Pytorch with confidence.

The dataset you’ll be working with is MovieLens dataset. It contains ratings of users for movies on a scale from 1 to 5.

Let’s download the dataset.

from fastai.vision.all import untar_data, URLs
import torch
import pandas as pd

movie_lens_data = untar_data(URLs.ML_100k)
movie_lens = pd.read_csv(movie_lens_data/'u.data', delimiter='\t', names=['user', 'movie', 'rating', 'time'])

movie_lens_items = pd.read_csv(movie_lens_data/'u.item', usecols=(0, 1), names=('movie', 'title'), encoding='latin-1', delimiter='|')
users = movie_lens['user']
movies = movie_lens['movie']

Firstly, create two matrices as one hot encoding (OHE) for looking up users’ and movies’ latent factors. These OHEs are just another representation of the given training dataset. Nothing crazy.

users_ohe = torch.from_numpy(pd.get_dummies(users).to_numpy()).float()
movies_ohe = torch.from_numpy(pd.get_dummies(movies).to_numpy()).float()

Similarly, define the ratings

ratings = torch.from_numpy(movie_lens['rating'].to_numpy()).float()

Now is a good time to create the latent factors themselves.

num_of_users = users.nunique()
num_of_movies = movies.nunique()

user_latent_factors = torch.rand((num_of_users, 50),requires_grad=True)
movie_latent_factors = torch.rand((num_of_movies, 50),requires_grad=True)

These latent factors are the parameters that need training because they will be representing the characteristics of each user/item.

To look up, you just need to multiply the embeddings and the latent factors like so:

looked_up_u_latent_factors = users_ohe @ user_latent_factors    
looked_up_m_latent_factors = movies_ohe @ movie_latent_factors

Remember that the measurement you’ll be using is the result of multiplication. So, the loss is then the difference between the rating and the multiplication for each data point. Let’s calculate the mean squared error (MSE) and update the latent factors (a.k.a. weights) accordingly:

users_dot_movies = looked_up_u_latent_factors * looked_up_m_latent_factors
users_dot_movies_sum = users_dot_movies.sum(dim=-1)

loss = (users_dot_movies_sum - ratings).square().mean()
loss.backward()
print(f'{i}. loss:{loss:.2f}')

with torch.no_grad():
    user_latent_factors.data = torch.sub(user_latent_factors.data, user_latent_factors.grad * lr)
    movie_latent_factors.data = torch.sub(movie_latent_factors.data, movie_latent_factors.grad * lr)

user_latent_factors.grad.zero_()
movie_latent_factors.grad.zero_()

If you put them in a training loop and let it train for a couple of epochs, you’ll see that it actually works! The loss is getting less and less.

lr = 1.5
num_of_epochs = 50
for i in range(num_of_epochs):
    looked_up_u_latent_factors = users_ohe @ user_latent_factors    
    looked_up_m_latent_factors = movies_ohe @ movie_latent_factors
    
    users_dot_movies = looked_up_u_latent_factors * looked_up_m_latent_factors
    users_dot_movies_sum = users_dot_movies.sum(dim=-1)
    
    loss = (users_dot_movies_sum - ratings).square().mean()
    loss.backward()
    print(f'{i}. loss:{loss:.2f}')
    
    with torch.no_grad():
        user_latent_factors.data = torch.sub(user_latent_factors.data, user_latent_factors.grad * lr)
        movie_latent_factors.data = torch.sub(movie_latent_factors.data, movie_latent_factors.grad * lr)
    
    user_latent_factors.grad.zero_()
    movie_latent_factors.grad.zero_()

Now that you understand better what embeddings are composed of and how they are used, let’s go an abstraction level higher und use Pytorch’s optimized Embedding class. And, while we are at it, let’s wrap our model into a proper Pytorch model as well.

Pytorch Embedding and Model

Sidenote: There will be lots of parameters in the model to train. When I was training the model, it took around 40GB of memory, so I recommend you to start with a portion (~10%) of the original 100k dataset.

Selecting a portion of the dataset

One caveat of only having a subset of users and movies, all of the distinct values won’t be included in that 10% so you need to write the numbers down, like so:

original_num_of_users = movie_lens['user'].nunique()
original_num_of_movies = movie_lens['movie'].nunique()

Reducing the original dataset to 10% of its size.

movie_lens = movie_lens[:int(len(movie_lens) * 0.1)]

Preparing the data

Let’s do it in a “better” way by splitting validation and training. As I mentioned from the previous blog about mnist image classification, you care more about model’s behaviour on new data than the training data. You measure that behaviour through a metric of choice on the validation set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(movie_lens[['user', 'movie']].to_numpy(), movie_lens['rating'].to_numpy())

users_ohe = torch.from_numpy(pd.get_dummies(x_train[:, 0]).to_numpy()).float()
movies_ohe = torch.from_numpy(pd.get_dummies(x_train[:, 1]).to_numpy()).float()
x_train = torch.tensor(x_train)
y_train = torch.tensor(y_train)

Define the latent factors and their biases within the model’s initialization function:

embedding_size = 50
class ColabModel(torch.nn.Module):
    def __init__(self, user_size, movie_size):
        super(ColabModel, self).__init__()
        self.u_latents = torch.nn.Embedding(user_size, 10)
        self.m_latents = torch.nn.Embedding(movie_size, 10)

        self.u_bias = torch.nn.Embedding(user_size, 10)
        self.m_bias = torch.nn.Embedding(movie_size, 10)

        self.layers = torch.nn.Sequential(
            torch.nn.Linear(40, embedding_size),
            torch.nn.ReLU(),
            torch.nn.Linear(embedding_size, 1)
            )

You don’t need to set require_grad for embeddings as they are already initialized as pytorch Parameters.

Let’s write the forward function of the model.

def forward(self, x):
  user_embeds = self.u_latents(x[:, 0])
  movie_embeds = self.m_latents(x[:, 1])
  
  return self.layers(torch.cat((user_embeds, movie_embeds, self.u_bias(x[:, 0]), self.m_bias(x[:, 1])), 1))

Here, x represents a two column user-movie dataset. Therefore, x[:, 0] means all the rows from user column and similarly, x[:, 1] means all the rows from movie column. You look the embedding values up and multiply and sum them at their original row. You want to keep the embeddings’ multiplications result separate because the bias needs to be added.

Let’s define the not-so-fast-but-also-not-so-slow learning rate (1e-3), instantiate the model and an SGD optimizer.

lr = 1e-3
colab_model = ColabModel(original_num_of_users, original_num_of_movies)
optim = torch.optim.Adam(colab_model.parameters(), lr, weight_decay=1e-3)

Putting everything together in a training loop:

num_of_epochs = 50
for i in range(num_of_epochs):
    users_dot_movies_sum_activated = colab_model(x_train)
    
    loss = (users_dot_movies_sum_activated - y_train).square().mean()
    loss.backward()

    optim.step()

    y_test_preds = colab_model(torch.tensor(x_test))
    test_loss = (torch.tensor(y_test) - y_test_preds).square().mean()

    print(f'{i}. loss:{loss:.2f} - test loss {test_loss:.2f}')

    colab_model.zero_grad()

Once, you start the training you realize the performance is a lot better than the model with python-lists-custom-built-embeddings.

Using Fastai with custom model

Sidenote: If you are like me, running your notebook on a CPU, then don’t forget to set use_cuda = False before you use fastai.

from fastai.torch_core import defaults

defaults.use_cuda = False

Let’s load our data through Dataloaders because Learner expects to read data from a dataloaders.

from fastai.vision.all import untar_data, URLs, Learner
from fastai.collab import CollabDataLoaders, MSELossFlat
from fastai.torch_core import defaults

defaults.use_cuda = False

movie_lens_data = untar_data(URLs.ML_100k)
movie_lens = pd.read_csv(movie_lens_data/'u.data', delimiter='\t', names=['user', 'movie', 'rating', 'time'])

dls = CollabDataLoaders.from_df(movie_lens)
dls.show_batch()

It’s time to define the model. Let’s do something different here and instead of optimizing the embedding through a matrix multiplication, optimize them through a neural net. In a similar fashion like above, define the model like so:

in_layer_size = 50
embedding_size = 50

class ColabModel(torch.nn.Module):
    def __init__(self, user_size, movie_size):
        super(ColabModel, self).__init__()
        self.u_latents = torch.nn.Embedding(user_size, embedding_size)
        self.m_latents = torch.nn.Embedding(movie_size, embedding_size)

        self.u_bias = torch.nn.Embedding(user_size, embedding_size)
        self.m_bias = torch.nn.Embedding(movie_size, embedding_size)

        self.layers = torch.nn.Sequential(
            torch.nn.Linear(embedding_size * 4, in_layer_size),
            torch.nn.ReLU(),
            torch.nn.Linear(in_layer_size, 1)
            )

    def forward(self, x):
        user_embeds = self.u_latents(x[:, 0])
        movie_embeds = self.m_latents(x[:, 1])

        return self.layers(torch.cat((user_embeds, movie_embeds, self.u_bias(x[:, 0]), self.m_bias(x[:, 1])), 1))

Notice that there is no multiplication in the forward method; the embeddings are only stacked and fed into linear layer. Amazing that it actually works, isn’t it?

Now, let’s instantiate the model, the learner and make the model learn.

colab_model = ColabModel(original_num_of_users+1, original_num_of_movies+1)
learner = Learner(dls, colab_model, loss_func=MSELossFlat())
learner.fit_one_cycle(10, 5e-3, wd=0.5)

You can check the actual predictions your model is doing, like so:

learner.get_preds()

Using builtin fastai collab_learner

Preparing the dataloaders is the same and there is no need to define a custom model. You can just use the collab_learner.

from fastai.collab import collab_learner

learner = collab_learner(dls, use_nn=True, layers=[100, 50], y_range=(0.5, 5.5))
learner.fit_one_cycle(5, 1e-3, wd=0.1)

We told the collab_learner to use a neural net (use_nn=True) and have two layers with 100 and 50 parameters (layers=[100, 50]).

Similarly, you can also use it without a neural net like so:

learner = collab_learner(dls, y_range=(1, 5))
learner.fit_one_cycle(5, 5e-3)

I hope you could learn something new after reading this post. If you have any feedback, leave them here or reach me on LinkedIn.

Collaborative Filtering from scratch

Using simple lists, pytorch’s Embeddings and Fastai

Pytorch Embedding and Model

Using Fastai with custom model

Using builtin fastai collab_learner

Written by Kerem Dede