Online streaming platforms like Netflix have plenty of movies in their repository and if we can build a Recommendation System to recommend relevant movies to users, based on their historical interactions, this would improve customer satisfaction thereby improving the revenue of the platform. The techniques employed here can be employed for any item for which a recommendation system is appropriate.
This project features three kinds of recommendation systems:
I used the ratings dataset, which can be downloaded as a .csv file here.
The ratings dataset contains the following attributes:
Installing surprise library
# Install surprise library
!pip install surprise
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
ent already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.2.0)
Requirement already satisfied: numpy>=1.11.2 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.21.6)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.7.3)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.15.0)
Building wheels for collected packages: scikit-surprise
Building wheel for scikit-surprise (setup.py) ... e=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633970 sha256=46c3bd64da03464dab9be98362f87962a4c28fa86ad1bfe137814015d8759182
Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1
# Used to ignore the warning given as output of the code
import warnings
'ignore')
warnings.filterwarnings(
# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd
# Basic library for data visualization
import matplotlib.pyplot as plt
# Slightly advanced library for data visualization
import seaborn as sns
# A dictionary output that does not raise a key error
from collections import defaultdict
# A performance metrics in surprise
from surprise import accuracy
# Class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For model tuning model hyper-parameters
from surprise.model_selection import GridSearchCV
# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split
# For implementing similarity based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
# For implementing cross validation
from surprise.model_selection import KFold
# Import the dataset
= pd.read_csv('./ratings.csv') rating
Let's check the info of the data
rating.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 100004 non-null int64
1 movieId 100004 non-null int64
2 rating 100004 non-null float64
3 timestamp 100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
# Dropping timestamp column
= rating.drop(['timestamp'], axis=1) rating
Let's explore the dataset and answer some basic data-related questions:
What do the top 5 rows of the data set look like?
userId | movieId | rating | |
---|---|---|---|
0 | 1 | 31 | 2.5 |
1 | 1 | 1029 | 3.0 |
2 | 1 | 1061 | 3.0 |
3 | 1 | 1129 | 2.0 |
4 | 1 | 1172 | 4.0 |
= (12, 4))
plt.figure(figsize
= "rating", data = rating)
sns.countplot(x
= 10)
plt.tick_params(labelsize "Distribution of Ratings ", fontsize = 10)
plt.title("Ratings", fontsize = 10)
plt.xlabel("Number of Ratings", fontsize = 10)
plt.ylabel( plt.show()
Rating '4.0' has the highest count of ratings (>30k). Rating '3.0' is second with 20K+ and Rating '5.0' is third in count of ratings with around 15K.
The ratings are biased towards these 3 numbers significantly more than others.
# Finding number of unique users
'userId'].nunique() rating[
671
There are 671 unique users in the data set.
# Finding number of unique movies
'movieId'].nunique() rating[
9066
There are 9066 unique movies in the data set.
'userId', 'movieId']).count() rating.groupby([
rating | ||
---|---|---|
userId | movieId | |
1 | 31 | 1 |
1029 | 1 | |
1061 | 1 | |
1129 | 1 | |
1172 | 1 | |
... | ... | ... |
671 | 6268 | 1 |
6269 | 1 | |
6365 | 1 | |
6385 | 1 | |
6565 | 1 |
100004 rows × 1 columns
'userId', 'movieId']).count()['rating'].sum() rating.groupby([
100004
The sum of ratings is equal to the total number of ratings. This implies that there is only one interaction between a pair of items and a user.
'movieId'].value_counts() rating[
356 341
296 324
318 311
593 304
260 291
...
98604 1
103659 1
104419 1
115927 1
6425 1
Name: movieId, Length: 9066, dtype: int64
The movie with the ID 356 is the most interacted-with movie in the dataset.
# Plotting distributions of ratings for 341 interactions with movieid 356
=(7,7))
plt.figure(figsize
'movieId'] == 356]['rating'].value_counts().plot(kind='bar')
rating[rating[
'Rating')
plt.xlabel(
'Count')
plt.ylabel(
plt.show()
This movie appears to be popular in a positive way, as a relatively high proportion of its ratings are 4.0 or 5.0 relative to the average of the whole data set.
'userId'].value_counts() rating[
547 2391
564 1868
624 1735
15 1700
73 1610
...
296 20
289 20
249 20
221 20
1 20
Name: userId, Length: 671, dtype: int64
The user with the ID 547 interacted the most with movies in the dataset.
# Finding user-movie interactions distribution
= rating.groupby('userId').count()['movieId']
count_interactions count_interactions
userId
1 20
2 76
3 51
4 204
5 100
...
667 68
668 20
669 37
670 31
671 115
Name: movieId, Length: 671, dtype: int64
# Plotting user-movie interactions distribution
=(15,7))
plt.figure(figsize
sns.histplot(count_interactions)
'Number of Interactions by Users')
plt.xlabel(
plt.show()
The distribution is highly skewed right. Very few users interacted with more than 50 movies.
Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have cold start problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend movies to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend movies to the new user.
To build the rank-based recommendation system, we take average of all the ratings provided to each movie and then rank them based on their average rating.
# Calculating average ratings
= rating.groupby('movieId').mean()['rating']
average_rating
# Calculating the count of ratings
= rating.groupby('movieId').count()['rating']
count_rating
# Making a dataframe with the count and average of ratings
= pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating}) final_rating
final_rating.head()
avg_rating | rating_count | |
---|---|---|
movieId | ||
1 | 3.872470 | 247 |
2 | 3.401869 | 107 |
3 | 3.161017 | 59 |
4 | 2.384615 | 13 |
5 | 3.267857 | 56 |
Now, let's create a function to find the top n movies for a recommendation based on the average ratings of movies. We can also add a threshold for a minimum number of interactions for a movie to be considered for recommendation.
def top_n_movies(data, n, min_interaction=100):
#Finding movies with minimum number of interactions
= data[data['rating_count'] >= min_interaction]
recommendations
#Sorting values w.r.t average rating
= recommendations.sort_values(by='avg_rating', ascending=False)
recommendations
return recommendations.index[:n]
We can use this function with different n's and minimum interactions to get movies to recommend
list(top_n_movies(final_rating, 5, 50))
[858, 318, 969, 913, 1221]
list(top_n_movies(final_rating, 5, 100))
[858, 318, 1221, 50, 527]
list(top_n_movies(final_rating, 5, 200))
[858, 318, 50, 527, 608]
Now let's assume we've got some additional data to work with. We can utilize Collaborative Filtering Based Recommendation Systems to better understand the needs of the user and thereby improve UX.
In this type of recommendation system, we do not need any information about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example -
Types of Collaborative Filtering
Similarity/Neighborhood based
User-User Similarity Based
Item-Item similarity based
Model based
cosine
similarity and using KNN to find similar users which are the nearest neighbor to the given user.surprise
, to build the remaining models. Let's first import the necessary classes and functions from this library.We'll load the rating
dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds
, which is required by this library. To do this, we will be using the classes Reader
and Dataset
. Finally, we'll split the data into train and test sets.
# Instantiating Reader scale with expected rating scale
= Reader(rating_scale=(0, 5))
reader
# Loading the rating dataset
= Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
data
# Splitting the data into train and test dataset
= train_test_split(data, test_size=0.2, random_state=42) trainset, testset
= {'name': 'cosine',
sim_options 'user_based': True}
# Defining Nearest neighbour algorithm
= KNNBasic(sim_options=sim_options,verbose=False)
algo_knn_user
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_user.fit(trainset)
# Predict ratings for the testset
= algo_knn_user.test(testset)
predictions
# Then compute RMSE
accuracy.rmse(predictions)
RMSE: 0.9925
0.9924509041520163
The RMSE for the baseline system is 0.9925.
Let's us now predict rating for an user with userId=4
and for movieId=10
4, 10, r_ui=4, verbose=True) algo_knn_user.predict(
user: 4 item: 10 r_ui = 4.00 est = 3.62 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})
Movie 10 has an estimated rating of 3.62 for user 4.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
4, 3, verbose=True) algo_knn_user.predict(
user: 4 item: 3 r_ui = None est = 3.20 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})
Movie 3 has an estimated rating of 3.20 for user 4.
Below we will be tuning hyper-parmeters for the KNNBasic
algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm -
For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html
# Setting up parameter grid to tune the hyperparameters
= {'k': [20, 30, 40], 'min_k': [3, 6, 9],
param_grid 'sim_options': {'name': ['msd', 'cosine'],
'user_based': [True]}
}
# Performing 3-fold cross validation to tune the hyperparameters
= GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
grid_obj
# Fitting the data
grid_obj.fit(data)
# Best RMSE score
print(grid_obj.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
0.9652553929644568
{'k': 20, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': True}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
= pd.DataFrame.from_dict(grid_obj.cv_results)
results_df results_df.head()
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_k | param_min_k | param_sim_options | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.965568 | 0.961561 | 0.968637 | 0.965255 | 0.002897 | 1 | 0.744661 | 0.738655 | 0.743292 | 0.742203 | 0.002570 | 1 | 0.129260 | 0.006148 | 3.364002 | 0.037595 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'msd', 'user_based': True} |
1 | 0.994451 | 0.992335 | 0.996891 | 0.994559 | 0.001861 | 14 | 0.770900 | 0.766768 | 0.769493 | 0.769054 | 0.001715 | 12 | 0.743921 | 0.021103 | 3.064117 | 0.138485 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'cosine', 'user_based': True} |
2 | 0.970929 | 0.965168 | 0.971683 | 0.969260 | 0.002910 | 4 | 0.748282 | 0.741616 | 0.745940 | 0.745279 | 0.002761 | 3 | 0.113563 | 0.005318 | 3.041815 | 0.100373 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'msd', 'user_based': True} |
3 | 0.998277 | 0.994462 | 0.998090 | 0.996943 | 0.001756 | 15 | 0.773384 | 0.768445 | 0.770557 | 0.770795 | 0.002023 | 15 | 0.665888 | 0.040078 | 3.042022 | 0.025506 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'cosine', 'user_based': True} |
4 | 0.975982 | 0.970484 | 0.978132 | 0.974866 | 0.003220 | 7 | 0.752162 | 0.746114 | 0.751751 | 0.750009 | 0.002759 | 6 | 0.123360 | 0.019757 | 3.219795 | 0.086318 | {'k': 20, 'min_k': 9, 'sim_options': {'name': ... | 20 | 9 | {'name': 'msd', 'user_based': True} |
Now, let's build the final model by using tuned values of the hyperparameters, which we received by using grid search cross-validation.
# Using the optimal similarity measure for user-user based collaborative filtering
# Creating an instance of KNNBasic with optimal hyperparameter values
= KNNBasic(sim_options = sim_options, k=40, min_k=6,verbose=False)
similarity_algo_optimized_user
# Training the algorithm on the trainset
similarity_algo_optimized_user.fit(trainset)
# Predicting ratings for the testset
= similarity_algo_optimized_user.test(testset)
predictions
# Computing RMSE on testset
accuracy.rmse(predictions)
RMSE: 0.9908
0.9907613369496804
We can see from above that after tuning hyperparameters, RMSE for testset has reduced to 0.98 from 0.9925. Thus, hyperparameter tuning has slightly improved our model.
Let's now predict rating for an user with userId=4
and for movieId=10
with the optimized model
4,10, r_ui=4, verbose=True) similarity_algo_optimized_user.predict(
user: 4 item: 10 r_ui = 4.00 est = 3.62 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})
The predicted rating for the optimized algorithm is still 3.62.
Below we are predicting rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
, by using the optimized model as shown below -
4,3, verbose=True) similarity_algo_optimized_user.predict(
user: 4 item: 3 r_ui = None est = 3.20 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})
The predicted rating for the optimized algorithm is still 3.20.
We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the userId=4
based on the msd
distance metric
4, k=5) similarity_algo_optimized_user.get_neighbors(
[357, 220, 590, 491, 647]
Below we will be implementing a function where the input parameters are -
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended movie ids
= []
recommendations
# Creating an user item interactions matrix
= data.pivot(index='userId', columns='movieId', values='rating')
user_item_interactions_matrix
# Extracting those movie ids which the user_id has not interacted yet
= user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
non_interacted_movies
# Looping through each of the movie id which user_id has not interacted yet
for item_id in non_interacted_movies:
# Predicting the ratings for those non interacted movie ids by this user
= algo.predict(user_id, item_id).est
est
# Appending the predicted ratings
recommendations.append((item_id, est))
# Sorting the predicted ratings in descending order
=lambda x: x[1], reverse=True)
recommendations.sort(key
return recommendations[:top_n] # returing top n highest predicted rating movies for this user
= get_recommendations(rating,4,5,similarity_algo_optimized_user) recommendations
recommendations
[(98491, 4.832340578646058),
(116, 4.753206589295344),
(6669, 4.748048450384675),
(1221, 4.662571141751736),
(1192, 4.65824768595177)]
# Definfing similarity measure
= {'name': 'cosine',
sim_options 'user_based': False}
# Defining Nearest neighbour algorithm
= KNNBasic(sim_options=sim_options, verbose=False)
algo_knn_item
# Train the algorithm on the trainset or fitting the model on train dataset
algo_knn_item.fit(trainset)
# Predict ratings for the testset
= algo_knn_item.test(testset)
predictions
# Then compute RMSE
accuracy.rmse(predictions)
RMSE: 1.0032
1.003221450633729
The baseline item-based system has an RMSE of 1.0032.
Let's now predict rating for an user with userId=4
and for movieId=10.
4,10, r_ui=4, verbose=True) algo_knn_item.predict(
user: 4 item: 10 r_ui = 4.00 est = 4.37 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=4.373794871885004, details={'actual_k': 40, 'was_impossible': False})
The system predicts a rating of 4.37 for user 4 for movie 10.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
4,3, verbose=True) algo_knn_item.predict(
user: 4 item: 3 r_ui = None est = 4.07 {'actual_k': 40, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=4.071601862880049, details={'actual_k': 40, 'was_impossible': False})
The system predicts a rating of 4.07 for user 4 for movie 3.
# Setting up parameter grid to tune the hyperparameters
= {'k': [20, 30,40], 'min_k': [3,6,9],
param_grid 'sim_options': {'name': ['msd', 'cosine'],
'user_based': [False]}
}
# Performing 3-fold cross validation to tune the hyperparameters
= GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
grid_obj
# Fitting the data
grid_obj.fit(data)
# Best RMSE score
print(grid_obj.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
0.9401320571134547
{'k': 40, 'min_k': 6, 'sim_options': {'name': 'msd', 'user_based': False}}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
= pd.DataFrame.from_dict(grid_obj.cv_results)
results_df results_df.head()
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_k | param_min_k | param_sim_options | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.951277 | 0.950129 | 0.950116 | 0.950508 | 0.000544 | 8 | 0.734416 | 0.734226 | 0.733322 | 0.733988 | 0.000477 | 7 | 7.439870 | 0.545805 | 11.326026 | 0.512808 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'msd', 'user_based': False} |
1 | 1.012004 | 1.016567 | 1.014096 | 1.014222 | 0.001865 | 17 | 0.789165 | 0.793400 | 0.791117 | 0.791227 | 0.001731 | 16 | 20.378437 | 1.080052 | 10.987197 | 0.362469 | {'k': 20, 'min_k': 3, 'sim_options': {'name': ... | 20 | 3 | {'name': 'cosine', 'user_based': False} |
2 | 0.951213 | 0.950136 | 0.950120 | 0.950490 | 0.000512 | 7 | 0.734455 | 0.734210 | 0.733590 | 0.734085 | 0.000364 | 8 | 6.380261 | 0.199738 | 11.128263 | 0.456450 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'msd', 'user_based': False} |
3 | 1.011952 | 1.016582 | 1.014071 | 1.014202 | 0.001892 | 16 | 0.789239 | 0.793386 | 0.791373 | 0.791333 | 0.001693 | 17 | 19.012726 | 0.650604 | 11.196412 | 0.564934 | {'k': 20, 'min_k': 6, 'sim_options': {'name': ... | 20 | 6 | {'name': 'cosine', 'user_based': False} |
4 | 0.951610 | 0.950883 | 0.950264 | 0.950919 | 0.000550 | 9 | 0.734750 | 0.734794 | 0.733791 | 0.734445 | 0.000463 | 9 | 5.797267 | 0.316200 | 11.779861 | 0.483101 | {'k': 20, 'min_k': 9, 'sim_options': {'name': ... | 20 | 9 | {'name': 'msd', 'user_based': False} |
Now let's build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation.
# Creating an instance of KNNBasic with optimal hyperparameter values
= KNNBasic(sim_options={'name': 'msd', 'user_based': False}, k=30, min_k=6,verbose=False)
similarity_algo_optimized_item
# Training the algorithm on the trainset
similarity_algo_optimized_item.fit(trainset)
# Predicting ratings for the testset
= similarity_algo_optimized_item.test(testset)
predictions
# Computing RMSE on testset
accuracy.rmse(predictions)
RMSE: 0.9465
0.9465120620317036
The final model has a RMSE of 0.9465.
Let's now predict rating for an user with userId=4
and for movieId=10
with the optimized model as shown below
4,10, r_ui=4, verbose=True) similarity_algo_optimized_item.predict(
user: 4 item: 10 r_ui = 4.00 est = 4.30 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=4.298279280483517, details={'actual_k': 30, 'was_impossible': False})
The predicted rating for movie 10 is 4.298.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
, by using the optimized model:
4, 3, verbose=True) similarity_algo_optimized_item.predict(
user: 4 item: 3 r_ui = None est = 3.86 {'actual_k': 30, 'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.859023126306401, details={'actual_k': 30, 'was_impossible': False})
The predicted rating for movie 3 is 3.859.
We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the userId=4
based on the msd
distance metric
4, k=5) similarity_algo_optimized_item.get_neighbors(
[77, 85, 115, 119, 127]
= get_recommendations(rating, 4, 5, similarity_algo_optimized_item) recommendations
recommendations
[(84, 5), (1040, 5), (2481, 5), (3515, 5), (4521, 5)]
Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
SVD is used to compute the latent features from the user-item matrix. But SVD does not work when we miss values in the user-item matrix.
First we need to convert the movie-rating dataset into an user-item matrix. We have already done this above while computing cosine similarities.
SVD decomposes this above matrix into three separate matrices:
An n x k matrix, where:
A k x k matrix, where:
A k x n matrix, where:
# Using SVD matrix factorization
= SVD()
algo_svd
# Training the algorithm on the trainset
algo_svd.fit(trainset)
# Predicting ratings for the testset
= algo_svd.test(testset)
predictions
# Computing RMSE on the testset
accuracy.rmse(predictions)
RMSE: 0.9031
0.9031390885282595
The baseline SVD-based system has an RMSE of 0.9031.
Let's now predict rating for an user with userId=4
and for movieId=10
4, 10, r_ui=4, verbose=True) algo_svd.predict(
user: 4 item: 10 r_ui = 4.00 est = 3.92 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.9176599589678984, details={'was_impossible': False})
The SVD system predicts a rating of 3.918 for movie 10 for user 4.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
:
4, 3, verbose=True) algo_svd.predict(
user: 4 item: 3 r_ui = None est = 3.64 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.6373797205620138, details={'was_impossible': False})
The SVD system predicts a rating of 3.637 for movie 3 for user 4.
In SVD, rating is predicted as -
r̂ui = μ + bu + bi + qiTpu
If user u is unknown, then the bias bu and the factors pu are assumed to be zero. The same applies for item i with bi and qi.
To estimate all the unknown, we minimize the following regularized squared error:
∑rui ∈ Rtrain (rui−r̂ui)2 + λ(bi2+bu2+∥qi∥2+∥pu∥2)
The minimization is performed by a stochastic gradient descent. There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here.
Below we will be tuning only three hyperparameters -
# Set the parameter space to tune
= {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
param_grid 'reg_all': [0.2, 0.4, 0.6]}
# Performing 3-fold gridsearch cross validation
= GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
gs
# Fitting data
gs.fit(data)
# Best RMSE score
print(gs.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
0.8937463944379922
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}
Once the grid search is complete, we can get the optimal values for each of those hyperparameters, as shown above.
Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters
= pd.DataFrame.from_dict(gs.cv_results)
results_df results_df.head()
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_n_epochs | param_lr_all | param_reg_all | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.937088 | 0.941831 | 0.950693 | 0.943204 | 0.005638 | 25 | 0.734681 | 0.736846 | 0.742574 | 0.738034 | 0.003330 | 25 | 4.822517 | 0.412795 | 0.588699 | 0.047995 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.2} | 10 | 0.001 | 0.2 |
1 | 0.941132 | 0.946314 | 0.955085 | 0.947510 | 0.005759 | 26 | 0.739717 | 0.742279 | 0.747904 | 0.743300 | 0.003419 | 26 | 5.001963 | 0.058948 | 0.622447 | 0.061483 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.4} | 10 | 0.001 | 0.4 |
2 | 0.946071 | 0.951274 | 0.960601 | 0.952649 | 0.006011 | 27 | 0.744921 | 0.747856 | 0.753383 | 0.748720 | 0.003508 | 27 | 5.115259 | 0.133874 | 0.694518 | 0.083622 | {'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.6} | 10 | 0.001 | 0.6 |
3 | 0.899909 | 0.907059 | 0.913570 | 0.906846 | 0.005579 | 10 | 0.698638 | 0.702193 | 0.706866 | 0.702566 | 0.003369 | 9 | 4.998776 | 0.122980 | 0.601300 | 0.095280 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.2} | 10 | 0.005 | 0.2 |
4 | 0.907404 | 0.913606 | 0.920989 | 0.914000 | 0.005553 | 15 | 0.706287 | 0.709524 | 0.714526 | 0.710112 | 0.003389 | 15 | 4.929785 | 0.117247 | 0.628400 | 0.108293 | {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4} | 10 | 0.005 | 0.4 |
Now, we will the build final model by using tuned values of the hyperparameters, which we received using grid search cross-validation above.
# Building the optimized SVD model using optimal hyperparameter search
= SVD(n_epochs=20, lr_all=0.01, reg_all=0.2)
svd_algo_optimized
# Training the algorithm on the trainset
svd_algo_optimized.fit(trainset)
# Predicting ratings for the testset
= svd_algo_optimized.test(testset)
predictions
# Computing RMSE
accuracy.rmse(predictions)
RMSE: 0.8973
0.8972580357427976
Let's us now predict rating for an user with userId=4
and for movieId=10
with the optimized model
4, 10, r_ui=4, verbose=True) svd_algo_optimized.predict(
user: 4 item: 10 r_ui = 4.00 est = 3.97 {'was_impossible': False}
Prediction(uid=4, iid=10, r_ui=4, est=3.9746300660681904, details={'was_impossible': False})
The predicted rating of movie 10 for user 4 using SVD collab filtering is 3.975.
Let's predict the rating for the same userId=4
but for a movie which this user has not interacted before i.e. movieId=3
:
4, 3, verbose=True) svd_algo_optimized.predict(
user: 4 item: 3 r_ui = None est = 3.65 {'was_impossible': False}
Prediction(uid=4, iid=3, r_ui=None, est=3.6494509461174185, details={'was_impossible': False})
The predicted rating of movie 3 for user 4 using SVD collab filtering is 3.649.
4, 5, svd_algo_optimized) get_recommendations(rating,
[(926, 4.938831611532595),
(1192, 4.911663964684779),
(1217, 4.876880839499725),
(3035, 4.864153522050881),
(232, 4.8568399745706605)]
Below we are comparing the rating predictions of users for those movies which has been already watched by an user. This will help us to understand how well are predictions are as compared to the actual ratings provided by users
def predict_already_interacted_ratings(data, user_id, algo):
# Creating an empty list to store the recommended movie ids
= []
recommendations
# Creating an user item interactions matrix
= data.pivot(index='userId', columns='movieId', values='rating')
user_item_interactions_matrix
# Extracting those movie ids which the user_id has interacted already
= user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].notnull()].index.tolist()
interacted_movies
# Looping through each of the movie id which user_id has interacted already
for item_id in interacted_movies:
# Extracting actual ratings
= user_item_interactions_matrix.loc[user_id, item_id]
actual_rating
# Predicting the ratings for those non interacted movie ids by this user
= algo.predict(user_id, item_id).est
predicted_rating
# Appending the predicted ratings
recommendations.append((item_id, actual_rating, predicted_rating))
# Sorting the predicted ratings in descending order
=lambda x: x[1], reverse=True)
recommendations.sort(key
return pd.DataFrame(recommendations, columns=['movieId', 'actual_rating', 'predicted_rating']) # returing top n highest predicted rating movies for this user
Here we are comparing the predicted ratings by similarity based recommendation
system against actual ratings for userId=7
= predict_already_interacted_ratings(rating, 7, similarity_algo_optimized_item)
predicted_ratings_for_interacted_movies = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
df =df, x='value', hue='variable', kde=True); sns.displot(data
Below we are comparing the predicted ratings by matrix factorization based recommendation
system against actual ratings for userId=7
= predict_already_interacted_ratings(rating, 7, svd_algo_optimized)
predicted_ratings_for_interacted_movies = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
df =df, x='value', hue='variable', kde=True); sns.displot(data
# Instantiating Reader scale with expected rating scale
= Reader(rating_scale=(0, 5))
reader
# Loading the rating dataset
= Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)
data
# Splitting the data into train and test dataset
= train_test_split(data, test_size=0.2, random_state=42) trainset, testset
RMSE is not the only metric we can use here. We can also examine two fundamental measures, precision and recall. We also add a parameter k which is helpful in understanding problems with multiple rating outputs.
Precision@k - It is the fraction of recommended items that are relevant in top k
predictions. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.
Recall@k - It is the fraction of relevant items that are recommended to the user in top k
predictions.
Recall - It is the fraction of actually relevant items that are recommended to the user i.e. if out of 10 relevant movies, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.
Precision - It is the fraction of recommended items that are relevant actually i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.
To know more about precision recall in recommendation systems, you can refer to the documentation or this Medium article.
# Function can be found on surprise documentation FAQs
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
= defaultdict(list)
user_est_true for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
= dict()
precisions = dict()
recalls for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
=lambda x: x[0], reverse=True)
user_ratings.sort(key
# Number of relevant items
= sum((true_r >= threshold) for (_, true_r) in user_ratings)
n_rel
# Number of recommended items in top k
= sum((est >= threshold) for (est, _) in user_ratings[:k])
n_rec_k
# Number of relevant and recommended items in top k
= sum(((true_r >= threshold) and (est >= threshold))
n_rel_and_rec_k for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
= n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
precisions[uid]
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
= n_rel_and_rec_k / n_rel if n_rel != 0 else 0
recalls[uid]
return precisions, recalls
# A basic cross-validation iterator.
= KFold(n_splits=5)
kf
# Make list of k values
= [5, 10]
K
# Make list of models
= [algo_knn_user, similarity_algo_optimized_user,algo_knn_item,similarity_algo_optimized_item, algo_svd, svd_algo_optimized]
models
for k in K:
for model in models:
print('> k={}, model={}'.format(k,model.__class__.__name__))
= []
p = []
r for trainset, testset in kf.split(data):
model.fit(trainset)= model.test(testset, verbose=False)
predictions = precision_recall_at_k(predictions, k=k, threshold=3.5)
precisions, recalls
# Precision and recall can then be averaged over all users
sum(prec for prec in precisions.values()) / len(precisions))
p.append(sum(rec for rec in recalls.values()) / len(recalls))
r.append(
print('-----> Precision: ', round(sum(p) / len(p), 3))
print('-----> Recall: ', round(sum(r) / len(r), 3))
> k=5, model=KNNBasic
-----> Precision: 0.764
-----> Recall: 0.414
> k=5, model=KNNBasic
-----> Precision: 0.77
-----> Recall: 0.42
> k=5, model=KNNBasic
-----> Precision: 0.604
-----> Recall: 0.322
> k=5, model=KNNBasic
-----> Precision: 0.684
-----> Recall: 0.358
> k=5, model=SVD
-----> Precision: 0.756
-----> Recall: 0.386
> k=5, model=SVD
-----> Precision: 0.749
-----> Recall: 0.384
> k=10, model=KNNBasic
-----> Precision: 0.754
-----> Recall: 0.549
> k=10, model=KNNBasic
-----> Precision: 0.75
-----> Recall: 0.561
> k=10, model=KNNBasic
-----> Precision: 0.597
-----> Recall: 0.475
> k=10, model=KNNBasic
-----> Precision: 0.665
-----> Recall: 0.505
> k=10, model=SVD
-----> Precision: 0.738
-----> Recall: 0.522
> k=10, model=SVD
-----> Precision: 0.731
-----> Recall: 0.523
Baseline user-based and item-based Collaborative Models have nearly the same RMSE values. Clearly, tuned Collaborative Filtering Models have performed better than baseline model and the user-user based tuned model is performing better and have rmse of 0.9908.
The Collaborative Models use the user-item-ratings data to find similarities and make predictions rather than just predicting a random rating based on the distribution of the data. This could a reason why the Collaborative filtering performed well.
Collaborative Filtering searches for neighbors based on similarity of item (example) preferences and recommend items that those neighbors interacted while Matrix factorization works by decomposing the user-item matrix into the product of two lower dimensionality rectangular matrices.
RMSE for Matrix Factorization is better than the Collaborative Filtering Model. Tuning SVD matrix factorization model is not improving the base line SVD much. Matrix Factorization has lower RMSE due to the reason that it assumes that both items and users are present in some low dimensional space describing their properties and recommend a item based on its proximity to the user in the latent space.
In this case study, we saw three different ways of building recommendation systems:
We also understood advantages/disadvantages of these recommendation systems and when to use which kind of recommendation systems. Once we build these recommendation systems, we can use A/B Testing to measure the effectiveness of these systems. Here is an article explaining how Amazon use A/B Testing to measure effectiveness of its recommendation systems.