Movie Recommendation System

Installing surprise library

# Install surprise library
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
ent already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.2.0)
Requirement already satisfied: numpy>=1.11.2 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.21.6)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.7.3)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from scikit-surprise->surprise) (1.15.0)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... e=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633970 sha256=46c3bd64da03464dab9be98362f87962a4c28fa86ad1bfe137814015d8759182
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1

# Used to ignore the warning given as output of the code
import warnings                                 
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np                              
import pandas as pd

# Basic library for data visualization
import matplotlib.pyplot as plt     

# Slightly advanced library for data visualization            
import seaborn as sns                           

# A dictionary output that does not raise a key error
from collections import defaultdict             

# A performance metrics in surprise
from surprise import accuracy

# Class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For model tuning model hyper-parameters
from surprise.model_selection import GridSearchCV

# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing similarity based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing cross validation
from surprise.model_selection import KFold

# Import the dataset
rating = pd.read_csv('./ratings.csv')

Let's check the info of the data

rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB

There are 1,00,004 observations and 4 columns in the data
All the columns are of numeric data type
The data type of the timestamp column is int64 which is not correct. We can convert this to DateTime format but we don't need timestamp for our analysis. Hence, we can drop this column

# Dropping timestamp column
rating = rating.drop(['timestamp'], axis=1)

	userId	movieId	rating
0	1	31	2.5
1	1	1029	3.0
2	1	1061	3.0
3	1	1129	2.0
4	1	1172	4.0

plt.figure(figsize = (12, 4))

sns.countplot(x = "rating", data = rating)

plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.show()

Rating '4.0' has the highest count of ratings (>30k). Rating '3.0' is second with 20K+ and Rating '5.0' is third in count of ratings with around 15K.

The ratings are biased towards these 3 numbers significantly more than others.

# Finding number of unique users

rating['userId'].nunique()

There are 671 unique users in the data set.

# Finding number of unique movies

rating['movieId'].nunique()

There are 9066 unique movies in the data set.

rating.groupby(['userId', 'movieId']).count()

		rating
userId	movieId
1	31	1
	1029	1
	1061	1
	1129	1
	1172	1
...	...	...
671	6268	1
	6269	1
	6365	1
	6385	1
	6565	1

100004 rows × 1 columns

rating.groupby(['userId', 'movieId']).count()['rating'].sum()

The sum of ratings is equal to the total number of ratings. This implies that there is only one interaction between a pair of items and a user.


rating['movieId'].value_counts()

356       341
296       324
318       311
593       304
260       291
         ... 
98604       1
103659      1
104419      1
115927      1
6425        1
Name: movieId, Length: 9066, dtype: int64

The movie with the ID 356 is the most interacted-with movie in the dataset.

# Plotting distributions of ratings for 341 interactions with movieid 356 
plt.figure(figsize=(7,7))

rating[rating['movieId'] == 356]['rating'].value_counts().plot(kind='bar')

plt.xlabel('Rating')

plt.ylabel('Count')

plt.show()

This movie appears to be popular in a positive way, as a relatively high proportion of its ratings are 4.0 or 5.0 relative to the average of the whole data set.


rating['userId'].value_counts()

547    2391
564    1868
624    1735
15     1700
73     1610
       ... 
296      20
289      20
249      20
221      20
1        20
Name: userId, Length: 671, dtype: int64

The user with the ID 547 interacted the most with movies in the dataset.

# Finding user-movie interactions distribution
count_interactions = rating.groupby('userId').count()['movieId']
count_interactions

userId
1       20
2       76
3       51
4      204
5      100
      ... 
667     68
668     20
669     37
670     31
671    115
Name: movieId, Length: 671, dtype: int64

# Plotting user-movie interactions distribution

plt.figure(figsize=(15,7))

sns.histplot(count_interactions)

plt.xlabel('Number of Interactions by Users')

plt.show()

The distribution is highly skewed right. Very few users interacted with more than 50 movies.

Now that we've explored the data, let's start building some recommendation systems!

Creating A Rank-Based Recommendation System

Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have cold start problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend movies to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend movies to the new user.

To build the rank-based recommendation system, we take average of all the ratings provided to each movie and then rank them based on their average rating.



# Calculating average ratings
average_rating = rating.groupby('movieId').mean()['rating']

# Calculating the count of ratings
count_rating = rating.groupby('movieId').count()['rating']

# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})

final_rating.head()

	avg_rating	rating_count
movieId
1	3.872470	247
2	3.401869	107
3	3.161017	59
4	2.384615	13
5	3.267857	56

Now, let's create a function to find the top n movies for a recommendation based on the average ratings of movies. We can also add a threshold for a minimum number of interactions for a movie to be considered for recommendation.

def top_n_movies(data, n, min_interaction=100):
    
    #Finding movies with minimum number of interactions
    recommendations = data[data['rating_count'] >= min_interaction]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
    
    return recommendations.index[:n]

We can use this function with different n's and minimum interactions to get movies to recommend


list(top_n_movies(final_rating, 5, 50))

[858, 318, 969, 913, 1221]


list(top_n_movies(final_rating, 5, 100))

[858, 318, 1221, 50, 527]


list(top_n_movies(final_rating, 5, 200))

[858, 318, 50, 527, 608]

Now let's assume we've got some additional data to work with. We can utilize Collaborative Filtering Based Recommendation Systems to better understand the needs of the user and thereby improve UX.

In this type of recommendation system, we do not need any information about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example -

Ratings provided by users. For example - ratings of books on goodread, movie ratings on imdb etc
Likes of users on different facebook posts, likes on youtube videos
Use/buying of a product by users. For example - buying different items on e-commerce sites
Reading of articles by readers on various blogs

Types of Collaborative Filtering

Similarity/Neighborhood based
User-User Similarity Based
Item-Item similarity based
Model based

Below, we are building similarity-based recommendation systems using cosine similarity and using KNN to find similar users which are the nearest neighbor to the given user.
We will be using a new library, called surprise, to build the remaining models. Let's first import the necessary classes and functions from this library.

We'll load the rating dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds, which is required by this library. To do this, we will be using the classes Reader and Dataset. Finally, we'll split the data into train and test sets.

# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)


sim_options = {'name': 'cosine',
               'user_based': True}

# Defining Nearest neighbour algorithm
algo_knn_user = KNNBasic(sim_options=sim_options,verbose=False)

# Train the algorithm on the trainset or fitting the model on train dataset 
algo_knn_user.fit(trainset)

# Predict ratings for the testset
predictions = algo_knn_user.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9925

0.9924509041520163

The RMSE for the baseline system is 0.9925.

Let's us now predict rating for an user with userId=4 and for movieId=10

algo_knn_user.predict(4, 10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.62   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})

Movie 10 has an estimated rating of 3.62 for user 4.

Let's predict the rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3

algo_knn_user.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.20   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})

Movie 3 has an estimated rating of 3.20 for user 4.

Below we will be tuning hyper-parmeters for the KNNBasic algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm -

k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
sim_options (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
- cosine
- msd (default)
- pearson
- pearson baseline

For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html


# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30, 40], 'min_k': [3, 6, 9],
              'sim_options': {'name': ['msd', 'cosine'],
                              'user_based': [True]}
              }

# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# Fitting the data
grid_obj.fit(data)

# Best RMSE score
print(grid_obj.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])

0.9652553929644568
{'k': 20, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': True}}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above.

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()

	split0_test_rmse	split1_test_rmse	split2_test_rmse	mean_test_rmse	std_test_rmse	rank_test_rmse	split0_test_mae	split1_test_mae	split2_test_mae	mean_test_mae	std_test_mae	rank_test_mae	mean_fit_time	std_fit_time	mean_test_time	std_test_time	params	param_k	param_min_k	param_sim_options
0	0.965568	0.961561	0.968637	0.965255	0.002897	1	0.744661	0.738655	0.743292	0.742203	0.002570	1	0.129260	0.006148	3.364002	0.037595	{'k': 20, 'min_k': 3, 'sim_options': {'name': ...	20	3	{'name': 'msd', 'user_based': True}
1	0.994451	0.992335	0.996891	0.994559	0.001861	14	0.770900	0.766768	0.769493	0.769054	0.001715	12	0.743921	0.021103	3.064117	0.138485	{'k': 20, 'min_k': 3, 'sim_options': {'name': ...	20	3	{'name': 'cosine', 'user_based': True}
2	0.970929	0.965168	0.971683	0.969260	0.002910	4	0.748282	0.741616	0.745940	0.745279	0.002761	3	0.113563	0.005318	3.041815	0.100373	{'k': 20, 'min_k': 6, 'sim_options': {'name': ...	20	6	{'name': 'msd', 'user_based': True}
3	0.998277	0.994462	0.998090	0.996943	0.001756	15	0.773384	0.768445	0.770557	0.770795	0.002023	15	0.665888	0.040078	3.042022	0.025506	{'k': 20, 'min_k': 6, 'sim_options': {'name': ...	20	6	{'name': 'cosine', 'user_based': True}
4	0.975982	0.970484	0.978132	0.974866	0.003220	7	0.752162	0.746114	0.751751	0.750009	0.002759	6	0.123360	0.019757	3.219795	0.086318	{'k': 20, 'min_k': 9, 'sim_options': {'name': ...	20	9	{'name': 'msd', 'user_based': True}

Now, let's build the final model by using tuned values of the hyperparameters, which we received by using grid search cross-validation.

# Using the optimal similarity measure for user-user based collaborative filtering
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_user = KNNBasic(sim_options = sim_options, k=40, min_k=6,verbose=False)

# Training the algorithm on the trainset
similarity_algo_optimized_user.fit(trainset)

# Predicting ratings for the testset
predictions = similarity_algo_optimized_user.test(testset)

# Computing RMSE on testset
accuracy.rmse(predictions)

RMSE: 0.9908

0.9907613369496804

We can see from above that after tuning hyperparameters, RMSE for testset has reduced to 0.98 from 0.9925. Thus, hyperparameter tuning has slightly improved our model.

Let's now predict rating for an user with userId=4 and for movieId=10 with the optimized model


similarity_algo_optimized_user.predict(4,10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.62   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=3.6244912065910952, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for the optimized algorithm is still 3.62.

Below we are predicting rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3, by using the optimized model as shown below -


similarity_algo_optimized_user.predict(4,3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.20   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=3.202703552548654, details={'actual_k': 40, 'was_impossible': False})

The predicted rating for the optimized algorithm is still 3.20.

similarity_algo_optimized_user.get_neighbors(4, k=5)

[357, 220, 590, 491, 647]

Below we will be implementing a function where the input parameters are -

data: a rating dataset
user_id: an user id against which we want the recommendations
top_n: the number of movies we want to recommend
algo: the algorithm we want to use to predict the ratings

def get_recommendations(data, user_id, top_n, algo):

# Creating an empty list to store the recommended movie ids
recommendations = []

# Creating an user item interactions matrix 
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')

# Extracting those movie ids which the user_id has not interacted yet
non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

# Looping through each of the movie id which user_id has not interacted yet
for item_id in non_interacted_movies:

# Predicting the ratings for those non interacted movie ids by this user
est = algo.predict(user_id, item_id).est

# Appending the predicted ratings
recommendations.append((item_id, est))

# Sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)

return recommendations[:top_n] # returing top n highest predicted rating movies for this user

recommendations = get_recommendations(rating,4,5,similarity_algo_optimized_user)

recommendations

[(98491, 4.832340578646058),
 (116, 4.753206589295344),
 (6669, 4.748048450384675),
 (1221, 4.662571141751736),
 (1192, 4.65824768595177)]

# Definfing similarity measure
sim_options = {'name': 'cosine',
               'user_based': False}

# Defining Nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options=sim_options, verbose=False)

# Train the algorithm on the trainset or fitting the model on train dataset 
algo_knn_item.fit(trainset)

# Predict ratings for the testset
predictions = algo_knn_item.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 1.0032

1.003221450633729

The baseline item-based system has an RMSE of 1.0032.

Let's now predict rating for an user with userId=4 and for movieId=10.

algo_knn_item.predict(4,10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 4.37   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=4.373794871885004, details={'actual_k': 40, 'was_impossible': False})

The system predicts a rating of 4.37 for user 4 for movie 10.

Let's predict the rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3

algo_knn_item.predict(4,3, verbose=True)

user: 4          item: 3          r_ui = None   est = 4.07   {'actual_k': 40, 'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=4.071601862880049, details={'actual_k': 40, 'was_impossible': False})

The system predicts a rating of 4.07 for user 4 for movie 3.



# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30,40], 'min_k': [3,6,9],
              'sim_options': {'name': ['msd', 'cosine'],
                              'user_based': [False]}
              }

# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# Fitting the data
grid_obj.fit(data)

# Best RMSE score
print(grid_obj.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])

0.9401320571134547
{'k': 40, 'min_k': 6, 'sim_options': {'name': 'msd', 'user_based': False}}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()

	split0_test_rmse	split1_test_rmse	split2_test_rmse	mean_test_rmse	std_test_rmse	rank_test_rmse	split0_test_mae	split1_test_mae	split2_test_mae	mean_test_mae	std_test_mae	rank_test_mae	mean_fit_time	std_fit_time	mean_test_time	std_test_time	params	param_k	param_min_k	param_sim_options
0	0.951277	0.950129	0.950116	0.950508	0.000544	8	0.734416	0.734226	0.733322	0.733988	0.000477	7	7.439870	0.545805	11.326026	0.512808	{'k': 20, 'min_k': 3, 'sim_options': {'name': ...	20	3	{'name': 'msd', 'user_based': False}
1	1.012004	1.016567	1.014096	1.014222	0.001865	17	0.789165	0.793400	0.791117	0.791227	0.001731	16	20.378437	1.080052	10.987197	0.362469	{'k': 20, 'min_k': 3, 'sim_options': {'name': ...	20	3	{'name': 'cosine', 'user_based': False}
2	0.951213	0.950136	0.950120	0.950490	0.000512	7	0.734455	0.734210	0.733590	0.734085	0.000364	8	6.380261	0.199738	11.128263	0.456450	{'k': 20, 'min_k': 6, 'sim_options': {'name': ...	20	6	{'name': 'msd', 'user_based': False}
3	1.011952	1.016582	1.014071	1.014202	0.001892	16	0.789239	0.793386	0.791373	0.791333	0.001693	17	19.012726	0.650604	11.196412	0.564934	{'k': 20, 'min_k': 6, 'sim_options': {'name': ...	20	6	{'name': 'cosine', 'user_based': False}
4	0.951610	0.950883	0.950264	0.950919	0.000550	9	0.734750	0.734794	0.733791	0.734445	0.000463	9	5.797267	0.316200	11.779861	0.483101	{'k': 20, 'min_k': 9, 'sim_options': {'name': ...	20	9	{'name': 'msd', 'user_based': False}

Now let's build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation.

# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options={'name': 'msd', 'user_based': False}, k=30, min_k=6,verbose=False)

# Training the algorithm on the trainset
similarity_algo_optimized_item.fit(trainset)

# Predicting ratings for the testset
predictions = similarity_algo_optimized_item.test(testset)

# Computing RMSE on testset
accuracy.rmse(predictions)

RMSE: 0.9465

0.9465120620317036

The final model has a RMSE of 0.9465.

Let's now predict rating for an user with userId=4 and for movieId=10 with the optimized model as shown below

similarity_algo_optimized_item.predict(4,10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 4.30   {'actual_k': 30, 'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=4.298279280483517, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for movie 10 is 4.298.

Let's predict the rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3, by using the optimized model:

similarity_algo_optimized_item.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.86   {'actual_k': 30, 'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=3.859023126306401, details={'actual_k': 30, 'was_impossible': False})

The predicted rating for movie 3 is 3.859.

similarity_algo_optimized_item.get_neighbors(4, k=5)

[77, 85, 115, 119, 127]

recommendations = get_recommendations(rating, 4, 5, similarity_algo_optimized_item)

recommendations

[(84, 5), (1040, 5), (2481, 5), (3515, 5), (4521, 5)]

Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.

Singular Value Decomposition (SVD)

SVD is used to compute the latent features from the user-item matrix. But SVD does not work when we miss values in the user-item matrix.

First we need to convert the movie-rating dataset into an user-item matrix. We have already done this above while computing cosine similarities.

SVD decomposes this above matrix into three separate matrices:

U matrix
Sigma matrix
V transpose matrix

U-matrix

An n x k matrix, where:

n is number of users
k is number of latent features

Sigma-matrix

A k x k matrix, where:

k is number of latent features
Each diagonal entry is the singular value of the original interaction matrix

V-transpose matrix

A k x n matrix, where:

k is the number of latent features
n is the number of items

Build a baseline matrix factorization recommendation system

# Using SVD matrix factorization
algo_svd = SVD()

# Training the algorithm on the trainset
algo_svd.fit(trainset)

# Predicting ratings for the testset
predictions = algo_svd.test(testset)

# Computing RMSE on the testset
accuracy.rmse(predictions)

RMSE: 0.9031

0.9031390885282595

RMSE for baseline SVD based collaborative filtering recommendation system:

The baseline SVD-based system has an RMSE of 0.9031.

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3:

Let's now predict rating for an user with userId=4 and for movieId=10

algo_svd.predict(4, 10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.92   {'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=3.9176599589678984, details={'was_impossible': False})

The SVD system predicts a rating of 3.918 for movie 10 for user 4.

Let's predict the rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3:

algo_svd.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.64   {'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=3.6373797205620138, details={'was_impossible': False})

The SVD system predicts a rating of 3.637 for movie 3 for user 4.

Improving matrix factorization based recommendation system by tuning its hyper-parameters

In SVD, rating is predicted as -

r̂_ui = μ + b_u + b_i + q_i^Tp_u

If user u is unknown, then the bias b_u and the factors p_u are assumed to be zero. The same applies for item i with b_i and q_i.

To estimate all the unknown, we minimize the following regularized squared error:

∑_{r_ui ∈ R_train}(r_ui−r̂_ui)² + λ(b_i²+b_u²+∥q_i∥²+∥p_u∥²)

The minimization is performed by a stochastic gradient descent. There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here.

Below we will be tuning only three hyperparameters -

n_epochs: The number of iteration of the SGD algorithm
lr_all: The learning rate for all parameters
reg_all: The regularization term for all parameters

Performing hyperparameter tuning for the baseline SVD based collaborative filtering recommendation system and finding the RMSE for tuned SVD based collaborative filtering recommendation system:

# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# Performing 3-fold gridsearch cross validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# Fitting data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8937463944379922
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters, as shown above.

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.head()

	split0_test_rmse	split1_test_rmse	split2_test_rmse	mean_test_rmse	std_test_rmse	rank_test_rmse	split0_test_mae	split1_test_mae	split2_test_mae	mean_test_mae	std_test_mae	rank_test_mae	mean_fit_time	std_fit_time	mean_test_time	std_test_time	params	param_n_epochs	param_lr_all	param_reg_all
0	0.937088	0.941831	0.950693	0.943204	0.005638	25	0.734681	0.736846	0.742574	0.738034	0.003330	25	4.822517	0.412795	0.588699	0.047995	{'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.2}	10	0.001	0.2
1	0.941132	0.946314	0.955085	0.947510	0.005759	26	0.739717	0.742279	0.747904	0.743300	0.003419	26	5.001963	0.058948	0.622447	0.061483	{'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.4}	10	0.001	0.4
2	0.946071	0.951274	0.960601	0.952649	0.006011	27	0.744921	0.747856	0.753383	0.748720	0.003508	27	5.115259	0.133874	0.694518	0.083622	{'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.6}	10	0.001	0.6
3	0.899909	0.907059	0.913570	0.906846	0.005579	10	0.698638	0.702193	0.706866	0.702566	0.003369	9	4.998776	0.122980	0.601300	0.095280	{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.2}	10	0.005	0.2
4	0.907404	0.913606	0.920989	0.914000	0.005553	15	0.706287	0.709524	0.714526	0.710112	0.003389	15	4.929785	0.117247	0.628400	0.108293	{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}	10	0.005	0.4

Now, we will the build final model by using tuned values of the hyperparameters, which we received using grid search cross-validation above.

# Building the optimized SVD model using optimal hyperparameter search
svd_algo_optimized = SVD(n_epochs=20, lr_all=0.01, reg_all=0.2)

# Training the algorithm on the trainset
svd_algo_optimized.fit(trainset)

# Predicting ratings for the testset
predictions = svd_algo_optimized.test(testset)

# Computing RMSE
accuracy.rmse(predictions)

RMSE: 0.8973

0.8972580357427976

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3 using SVD based collaborative filtering:

Let's us now predict rating for an user with userId=4 and for movieId=10 with the optimized model

svd_algo_optimized.predict(4, 10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.97   {'was_impossible': False}

Prediction(uid=4, iid=10, r_ui=4, est=3.9746300660681904, details={'was_impossible': False})

The predicted rating of movie 10 for user 4 using SVD collab filtering is 3.975.

Let's predict the rating for the same userId=4 but for a movie which this user has not interacted before i.e. movieId=3:

svd_algo_optimized.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.65   {'was_impossible': False}

Prediction(uid=4, iid=3, r_ui=None, est=3.6494509461174185, details={'was_impossible': False})

The predicted rating of movie 3 for user 4 using SVD collab filtering is 3.649.

Predicting the top 5 movies for userId=4 with SVD based recommendation system:

get_recommendations(rating, 4, 5, svd_algo_optimized)

[(926, 4.938831611532595),
 (1192, 4.911663964684779),
 (1217, 4.876880839499725),
 (3035, 4.864153522050881),
 (232, 4.8568399745706605)]

Predicting ratings for already interacted movies:

Below we are comparing the rating predictions of users for those movies which has been already watched by an user. This will help us to understand how well are predictions are as compared to the actual ratings provided by users

def predict_already_interacted_ratings(data, user_id, algo):

# Creating an empty list to store the recommended movie ids
recommendations = []

# Creating an user item interactions matrix 
user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')

# Extracting those movie ids which the user_id has interacted already
interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].notnull()].index.tolist()

# Looping through each of the movie id which user_id has interacted already
for item_id in interacted_movies:

# Extracting actual ratings
actual_rating = user_item_interactions_matrix.loc[user_id, item_id]

# Predicting the ratings for those non interacted movie ids by this user
predicted_rating = algo.predict(user_id, item_id).est

# Appending the predicted ratings
recommendations.append((item_id, actual_rating, predicted_rating))

# Sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)

return pd.DataFrame(recommendations, columns=['movieId', 'actual_rating', 'predicted_rating']) # returing top n highest predicted rating movies for this user

Here we are comparing the predicted ratings by similarity based recommendation system against actual ratings for userId=7

predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, similarity_algo_optimized_item)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);

Below we are comparing the predicted ratings by matrix factorization based recommendation system against actual ratings for userId=7

predicted_ratings_for_interacted_movies = predict_already_interacted_ratings(rating, 7, svd_algo_optimized)
df = predicted_ratings_for_interacted_movies.melt(id_vars='movieId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=df, x='value', hue='variable', kde=True);

# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# Loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

Precision and Recall @ k

RMSE is not the only metric we can use here. We can also examine two fundamental measures, precision and recall. We also add a parameter k which is helpful in understanding problems with multiple rating outputs.

Precision@k - It is the fraction of recommended items that are relevant in top k predictions. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.

Recall@k - It is the fraction of relevant items that are recommended to the user in top k predictions.

Recall - It is the fraction of actually relevant items that are recommended to the user i.e. if out of 10 relevant movies, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.

Precision - It is the fraction of recommended items that are relevant actually i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.

To know more about precision recall in recommendation systems, you can refer to the documentation or this Medium article.

Computing the precision and recall, for each of the 6 models, at k = 5 and 10:

# Function can be found on surprise documentation FAQs
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""

# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))

precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():

# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)

# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])

# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.

precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.

recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

return precisions, recalls

# A basic cross-validation iterator.
kf = KFold(n_splits=5)

# Make list of k values
K = [5, 10]


# Make list of models
models = [algo_knn_user, similarity_algo_optimized_user,algo_knn_item,similarity_algo_optimized_item, algo_svd, svd_algo_optimized]

for k in K:
for model in models:
print('> k={}, model={}'.format(k,model.__class__.__name__))
        p = []
        r = []
        for trainset, testset in kf.split(data):
            model.fit(trainset)
            predictions = model.test(testset, verbose=False)
            precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=3.5)

            # Precision and recall can then be averaged over all users
            p.append(sum(prec for prec in precisions.values()) / len(precisions))
            r.append(sum(rec for rec in recalls.values()) / len(recalls))
        
        print('-----> Precision: ', round(sum(p) / len(p), 3))
        print('-----> Recall: ', round(sum(r) / len(r), 3))

> k=5, model=KNNBasic
-----> Precision:  0.764
-----> Recall:  0.414
> k=5, model=KNNBasic
-----> Precision:  0.77
-----> Recall:  0.42
> k=5, model=KNNBasic
-----> Precision:  0.604
-----> Recall:  0.322
> k=5, model=KNNBasic
-----> Precision:  0.684
-----> Recall:  0.358
> k=5, model=SVD
-----> Precision:  0.756
-----> Recall:  0.386
> k=5, model=SVD
-----> Precision:  0.749
-----> Recall:  0.384
> k=10, model=KNNBasic
-----> Precision:  0.754
-----> Recall:  0.549
> k=10, model=KNNBasic
-----> Precision:  0.75
-----> Recall:  0.561
> k=10, model=KNNBasic
-----> Precision:  0.597
-----> Recall:  0.475
> k=10, model=KNNBasic
-----> Precision:  0.665
-----> Recall:  0.505
> k=10, model=SVD
-----> Precision:  0.738
-----> Recall:  0.522
> k=10, model=SVD
-----> Precision:  0.731
-----> Recall:  0.523

Discussion

Baseline user-based and item-based Collaborative Models have nearly the same RMSE values. Clearly, tuned Collaborative Filtering Models have performed better than baseline model and the user-user based tuned model is performing better and have rmse of 0.9908.

The Collaborative Models use the user-item-ratings data to find similarities and make predictions rather than just predicting a random rating based on the distribution of the data. This could a reason why the Collaborative filtering performed well.

Collaborative Filtering searches for neighbors based on similarity of item (example) preferences and recommend items that those neighbors interacted while Matrix factorization works by decomposing the user-item matrix into the product of two lower dimensionality rectangular matrices.

RMSE for Matrix Factorization is better than the Collaborative Filtering Model. Tuning SVD matrix factorization model is not improving the base line SVD much. Matrix Factorization has lower RMSE due to the reason that it assumes that both items and users are present in some low dimensional space describing their properties and recommend a item based on its proximity to the user in the latent space.

Conclusions

In this case study, we saw three different ways of building recommendation systems:

rank-based using averages
similarity-based collaborative filtering
model-based (matrix factorization) collaborative filtering

We also understood advantages/disadvantages of these recommendation systems and when to use which kind of recommendation systems. Once we build these recommendation systems, we can use A/B Testing to measure the effectiveness of these systems. Here is an article explaining how Amazon use A/B Testing to measure effectiveness of its recommendation systems.

Movie Recommendation System

by Keanu Sida

Context

Objective

Dataset

Importing the necessary libraries and overview of the dataset

Loading the data

Exploring the Dataset

Describing the distribution of ratings:

Total number of unique users and unique movies:

Movies with which the same user interacted more than once:

The most interacted-with movie in the dataset:

Which users have the highest interactivity:

Distribution of the user-movie interactions:

Now that we've explored the data, let's start building some recommendation systems!

Creating A Rank-Based Recommendation System

Context

Recommending top 5 movies with 50 minimum interactions based on popularity

Recommending top 5 movies with 100 minimum interactions based on popularity

Recommending top 5 movies with 200 minimum interactions based on popularity

User-Based Collaborative Filtering Recommendation System

Building a baseline user-user similarity based recommendation system

Making the dataset into surprise dataset and splitting it into train and test set

Build the first baseline similarity based recommendation system using cosine similarity and KNN

RMSE for baseline user based collaborative filtering recommendation system

Predicted rating for an user with a specific user (e.g. userId=4: movieId=10 or movieId=3)

Improving user-user similarity based recommendation system by tuning its hyper-parameters:

Performing hyperparameter tuning for the baseline user based collaborative filtering recommendation system and finding the RMSE for tuned user based collaborative filtering recommendation system:

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3 using tuned user based collaborative filtering:

Identifying similar users to a given user (nearest neighbors)

Implementing the recommendation algorithm based on optimized KNNBasic model

Predicted top 5 movies for userId=4 with similarity based recommendation system

Predicting the top 5 movies for userId=4 with similarity based recommendation system:

Item based Collaborative Filtering Recommendation System

RMSE for baseline item based collaborative filtering recommendation system

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3:

Performing hyperparameter tuning for the baseline item based collaborative filtering recommendation system and find the RMSE for tuned item based collaborative filtering recommendation system?

Predicted rating for an item with userId =4 and for movieId= 10 and movieId=3 using tuned item based collaborative filtering?

Identifying similar users to a given user (nearest neighbors)

Predicted top 5 movies for userId=4 with similarity based recommendation system

Predicting the top 5 movies for userId=4 with similarity based recommendation system:

Model-Based Collaborative Filtering (Matrix Factorization using SVD)

Singular Value Decomposition (SVD)

U-matrix

Sigma-matrix

V-transpose matrix

Build a baseline matrix factorization recommendation system

RMSE for baseline SVD based collaborative filtering recommendation system:

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3:

Improving matrix factorization based recommendation system by tuning its hyper-parameters

Performing hyperparameter tuning for the baseline SVD based collaborative filtering recommendation system and finding the RMSE for tuned SVD based collaborative filtering recommendation system:

Predicted rating for an user with userId =4 and for movieId= 10 and movieId=3 using SVD based collaborative filtering:

Predicting the top 5 movies for userId=4 with SVD based recommendation system:

Predicting ratings for already interacted movies:

Precision and Recall @ k

Computing the precision and recall, for each of the 6 models, at k = 5 and 10:

Discussion

Conclusions