AI Movies Recommendation System Based on K-Means Clustering Algorithm

Overview of Article

Syed Muhammad Asad
25 min readAug 18, 2020

In this article, we’ll build an artificial intelligence movies recommendation system by using k-means algorithm which is a clustering algorithm. We’ll recommend movies to users which are more relevant to them based on their previous history. We’ll only import those data, where users has rated movies 4+ as we want to recommend only those movies which users like most. In this whole article, we have used Python programming language with their associated libraries i.e. NumPy, Pandas, Matplotlib and Scikit-Learn. Moreover, we have supposed that the reader has familiarity with Python and the aforementioned libraries.

Introduction to AI Movies Recommendation System

In this busy life as people don’t have time to search for their desired item and even they want it on their table or even in a little effort. So, the recommendation system has become an important part to help them to make a right choice for their desired thing and to grow our product. Since data is increasing day by day and in this era with such a large database, it has even become a difficult task to find a more relevant item of our interest, because often we can’t search an item of our interest with just a title and even sometimes it is harder. So, recommendation system help us to provide a most relevant item to individual available in our database.

In this article, we’ll build a movies recommendation system. Movies recommendation system has become an essential part to movies website because an individual don’t know which movies are more interested to him with just a title or genre. Sometime an individual likes action movies but he/she will not always like every action movie. To handle this problem, many authors has provided a better way to recommend a movie to user 1 from the watch list or favorite movies of another user 2 whose movies database is more relevant to the user 1. That is, if the taste of two people is same, then both of them will like each other favorite food. Many tech giants has been using these recommendation system in their applications like YouTube, Netflix, etc.

In this task, machine learning (ML) models has helped us a lot to build such recommendation system based on users previous watch history. ML models learns from users watch history and categorize them into groups which contain users of same taste. Different types of ML models has been used like clustering algorithms, deep learning models etc.

K-Means Clustering Algorithm

K-Means is an unsupervised machine learning algorithm which can be used to categorize data into different groups. In this article we’ll use this algorithm to categorize users based on their 4+ ratings on movies. I’ll not describe the background mathematics of this algorithm but I’ll describe little intuition of this algorithm. If you want to understand the mathematical background of this algorithm, then I’ll suggest you to search it on Google, many authors has written articles on its mathematical background. Since, the complete mathematics behind this algorithm has been done by Scikit-Learn library so, we will only understand and implement it.

Note: Plots of data in this section are designed randomly and only for intuition of K-means algorithm.

Figure 1 — Scatter Plot Before K-Means Clustering

Suppose that we have 2-dimensional data in the form of (x₁, x₂). Let, we have plotted our data in Figure (1). Next we want to divide this data into groups. If we take a look at data, then we can observe that this data can be divided into three groups. In this plot which is only designed for intuition, a common man can observe that we can divide into three groups. But some times we have very complex and big data or some time we have 3-dimensional or 4-dimensional or more generally we can have 100 dimensions or 1000 or even more than this. Then, it is not possible for human to categorize such type of data and even we can’t plot such a higher dimensional data. Also, sometimes we don’t know the optimal number of clusters we should have for our data. So, we use some clustering algorithms which can work for such big data which can even of thousands of dimensions and their are methods which can be used to know the optimal number of clusters.

Figure 2 — Scatter Plot After K-Means Clustering

In Figure (2), a demonstration of k-means clustering is shown. The data of Figure (1) has categorized into three groups and presented in the Figure (2) with a unique color for each group.

One can arise a question, how actually k-means worked to categorize the data?

To categorize data into groups which contain same type of items/data, there are 6 steps which k-means algorithm follow. Figure (3) is presenting the steps which k-means algorithm follow to categorize data.

Figure 3 — Graphical Abstract of K-Means Algorithm

Figure (3) is describing the following steps of k-means algorithm.

  1. Firstly, we have to select the numbers of clusters which we want for our dataset. Later, an elbow method will be explained for selection of optimal number of clusters.
  2. Then, we have to select k random points called centroid which are not necessary from our dataset. Because to avoid random initialization trap which can stuck to bad clusters, we’ll use k-means++ to initalize k centroids and it is provided by Scikit-Learn in k-means algorithm.
  3. K-means algorithm will assign each data point to its closest centroid which will finally gives us k clusters.
  4. The centroid will be re-center to a position which is now actually the centroid of its own cluster and will be new centroid.
  5. It will reset all clusters and again assign each dataset point to its new closest centroid.
  6. If, the new clusters are same as the previous cluster was OR total iterations has completed then it will stop and gives us the final clusters of our dataset. Else, It will move again to step 4.

Elbow Method

The elbow method is the best way to find optimal number of clusters. For this, we need to find within clusters sum of squares (WCSS). WCSS is the sum of squares of each point distance from its centroid and its mathematical formula is following

Where K is total number of clusters, Nᵢ is the size of i’th cluster or we can also say that data points in i’th cluster, Cᵢ is the centroid of i’th cluster and Pᵢ,ⱼ is the j’th data point of i’th cluster.

So, what we’ll do with WCSS?

WCSS will tells us how far are centroid from its data points. As we increase number of clusters, WCSS will become small and after some value of K the WCSS will reduce slowly and we will stop there and choose optimal number of clusters. I’ll suggest to Google for elbow method and take a look at more clear examples of elbow method. Here we have figure for intuition of elbow method.

Figure 4 — Elbow Method Plot

A demonstration of elbow method is show in Figure (4). As we can observe that, when number of clusters K moves from 1 to 5 then WCSS value decreases rapidly from 2500 to 400 approx. But, for clusters number 6 to onward it is decreasing slowly. So, here we can make a judgment that it is good for our dataset if we have 5 cluster. Further, as we can see its look like an elbow, the joint elbow will be the optimal number of clusters which is in this case is 5. Later we’ll see that we don’t have always such a smooth curve so in this work I have described another way to observe changes in WCSS and to know optimal clusters.

Methodology Used in this Article

In this article, we’ll build a clustering based algorithm to categorize users into groups of same interest by using k-means algorithm. We will use data, where users has rated movies with 4+ rating on the supposition of that, if a user is rating a movie 4+ then he/she may like it. We have downloaded database The Movies Dataset from Kaggle.com which is a MovieLens Dataset. In the following sections, we have completely described the whole project, from Importing Dataset -> Data Engineering -> Building K-Means Clustering Model -> Analyzing Optimal Number of Clusters -> Training Model and Predicting -> Fixing Clusters -> Saving Training -> Finally, Making Recommendations for Users. A complete project of movies recommendation system can be downloaded from my GitHub Library AI Movies Recommendation System Based on K-means Clustering Algorithm. A Jupyter notebook of this article is also provided in the repository, you can download and play with that.

URL: https://github.com/asdkazmi/AI-Movies-Recommendation-System-K-Means-Clustering
URL: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv

Now lets start to work on coding:

Importing All Required Libraries

import pandas as pd
print('Pandas version: ', pd.__version__)

import numpy as np
print('NumPy version: ', np.__version__)

import matplotlib
print('Matplotlib version: ', matplotlib.__version__)

from matplotlib import pyplot as plt

import sklearn
print('Scikit-Learn version: ', sklearn.__version__)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.cluster import KMeans


import pickle
print('Pickle version: ', pickle.format_version)

import sys
print('Sys version: ', sys.version[0:5])

from sys import exc_info

import ast

Out:

Pandas version:  0.25.1
NumPy version: 1.16.5
Matplotlib version: 3.1.1
Scikit-Learn version: 0.21.3
Pickle version: 4.0
Sys version: 3.7.4

Data Engineering

This section is divided into two subsections. Firstly, we will import data and reduce it into a sub DataFrame, so that we can focus more on our model and can look what type of users has rated movies and what type of recommendation for him based on that. Secondly, we’ll perform feature engineering so that we have data in the form which is valid for machine learning algorithm.

Preparing Data for Model

We have downloaded MovieLens Dataset from Kaggle.com. Here first we’ll import rating dataset, because we want users rating on movies and further we’ll filter data where users has gives 4+ ratings

ratings = pd.read_csv('./Prepairing Data/From Data/ratings.csv', usecols = ['userId', 'movieId','rating'])
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

Shape of ratings dataset is:  (26024289, 3) 

Max values in dataset are
userId 270896.0
movieId 176275.0
rating 5.0
dtype: float64

Min values in dataset are
userId 1.0
movieId 1.0
rating 0.5
dtype: float64

Next we’ll filter this dataset for only 4+ ratings

# Filtering data for only 4+ ratings
ratings = ratings[ratings['rating'] >= 4.0]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

Shape of ratings dataset is:  (12981742, 3) 

Max values in dataset are
userId 270896.0
movieId 176271.0
rating 5.0
dtype: float64

Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64

So, now minimum rating given by users is 4.0 and also data set has reduced from 2.6e⁷ to 1.2e⁷ which is less than half of the original dataset. But dataset is still large and we want to reduce it more.

For the intuition of this article, I want to work on a small dataset. So, now we will get a subset of this dataset for only first 200 movies. Later when we will reduce it further for first 100 users, then we’ll may have less than 200 movies which has been rated by users and we want to work around 100 movies.

movies_list = np.unique(ratings['movieId'])[:200]
ratings = ratings.loc[ratings['movieId'].isin(movies_list)]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

Shape of ratings dataset is:  (776269, 3) 

Max values in dataset are
userId 270896.0
movieId 201.0
rating 5.0
dtype: float64

Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64

Still the dataset is large, so we again get another subset of ratings by extracting it for not all users but some users i.e. for 100 users.

users_list = np.unique(ratings['userId'])[:100]
ratings = ratings.loc[ratings['userId'].isin(users_list)]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')
print('Total Users: ', np.unique(ratings['userId']).shape[0])
print('Total Movies which are rated by 100 users: ', np.unique(ratings['movieId']).shape[0])

Out:

Shape of ratings dataset is:  (447, 3) 

Max values in dataset are
userId 157.0
movieId 198.0
rating 5.0
dtype: float64

Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64

Total Users: 100
Total Movies which are rated by 100 users: 83

And finally, its done. We have a dataset of shape (447,3) which includes 4+ ratings of 83 movies by 100 users. As we were started with 200 movies but when we extracted it for only first 100 users, it looks like that 117 movies were not rated by first 100 users.

As, now we are not worried for ratings column and further we have supposed that each movie which is rated 4+ by user is of his/her interest. So, if a movie is an interest of user 1 then that movie will also be interest of another user 2 of same taste. Now, we can drop this column as each movie is a favorite for every user.

users_fav_movies = ratings.loc[:, ['userId', 'movieId']]

Since we were sorted DataFrame by columns, so index may not be in proper order. Now, we want to reset the index.

users_fav_movies = ratings.reset_index(drop = True)

And finally, here is our final DataFrame of first 100 users favorite movies from the list of first 200 movies. The below DataFrame is printed with transpose

users_fav_movies.T
png

Now, let save this DataFrame to csv file on our local, so that we can use it later.

users_fav_movies.to_csv('./Prepairing Data/From Data/filtered_ratings.csv')

Data Featuring

In this section, we will create a sparse matrix which we’ll use in k-means. For this, let define a function which return us a movies list for each user from dataset

def moviesListForUsers(users, users_data):
# users = a list of users IDs
# users_data = a dataframe of users favourite movies or users watched movies
users_movies_list = []
for user in users:
users_movies_list.append(str(list(users_data[users_data['userId'] == user]['movieId'])).split('[')[1].split(']')[0])
return users_movies_list

The method moviesListForUsers returns us a list which will contain strings for each users favorite movies list. Later we will use CountVectorizer to extract the features of strings which contains list of movies.

Note: The method moviesListForUsers returns us list in the same order as users list. So to avoid trap, we will have users list in the descending order.

In above defined method, we need to have a list of users and users_data dataframe. As users_data is the dataframe we already have. Now, let prepair the users list

users = np.unique(users_fav_movies['userId'])
print(users.shape)

Out:

(100,)

Now, let prepare the list of movies for each user.

users_movies_list = moviesListForUsers(users, users_fav_movies)
print('Movies list for', len(users_movies_list), ' users')
print('A list of first 10 users favourite movies: \n', users_movies_list[:10])

Out:

Movies list for 100  users
A list of first 10 users favourite movies:
['147', '64, 79', '1, 47', '1, 150', '150, 165', '34', '1, 16, 17, 29, 34, 47, 50, 82, 97, 123, 125, 150, 162, 175, 176, 194', '6', '32, 50, 111, 198', '81']

Above is the list for first 10 users favorite movies. First string contain first users favorite movies IDs, second for second users and so on. It looks that the list of 7th users favorite movies is larger than others.

Now, we’ll prepare a sparse matrix for each user against each movie.

If user has watched movie then 1, else 0

Let us first define a function for sparse matrix

def prepSparseMatrix(list_of_str):
# list_of_str = A list, which contain strings of users favourite movies separate by comma ",".
# It will return us sparse matrix and feature names on which sparse matrix is defined
# i.e. name of movies in the same order as the column of sparse matrix
cv = CountVectorizer(token_pattern = r'[^\,\ ]+', lowercase = False)
sparseMatrix = cv.fit_transform(list_of_str)
return sparseMatrix.toarray(), cv.get_feature_names()

Now, let prepare the sparse matrix

sparseMatrix, feature_names = prepSparseMatrix(users_movies_list)

Now let put it into DataFrame to have a more clear presentation. The format will be as columns will presents each movie and index will presents users IDs

df_sparseMatrix = pd.DataFrame(sparseMatrix, index = users, columns = feature_names)
df_sparseMatrix
png

Now, let make it clear that the matrix we defined above is exactly as we want it? We’ll check it for some users.

Let take a look at some users favorite movies lists

first_6_users_SM = users_fav_movies[users_fav_movies['userId'].isin(users[:6])].sort_values('userId')
first_6_users_SM.T
png

Now, let check the that if the users with above IDs have value 1 in the column of their favorite movie and 0 otherwise. Remember that in the sparseMatrix DataFrame df_sparseMatrix indexes were users IDs.

df_sparseMatrix.loc[np.unique(first_6_users_SM['userId']), list(map(str, np.unique(first_6_users_SM['movieId'])))]
png

We can observe from above two DataFrames that our sparse matrix is correct and have values in proper place. As, we have done with data engineering, now let create our machine learning clustering model with k-means algorithm.

Clustering Model

To clustering the data, first of all we need to find the optimal number of clusters. For this purpose, we will define an object for elbow method which will contain two functions first for running k-means algorithm for different number of clusters and other to showing plot.

class elbowMethod():
def __init__(self, sparseMatrix):
self.sparseMatrix = sparseMatrix
self.wcss = list()
self.differences = list()
def run(self, init, upto, max_iterations = 300):
for i in range(init, upto + 1):
kmeans = KMeans(n_clusters=i, init = 'k-means++', max_iter = max_iterations, n_init = 10, random_state = 0)
kmeans.fit(sparseMatrix)
self.wcss.append(kmeans.inertia_)
self.differences = list()
for i in range(len(self.wcss)-1):
self.differences.append(self.wcss[i] - self.wcss[i+1])
def showPlot(self, boundary = 500, upto_cluster = None):
if upto_cluster is None:
WCSS = self.wcss
DIFF = self.differences
else:
WCSS = self.wcss[:upto_cluster]
DIFF = self.differences[:upto_cluster - 1]
plt.figure(figsize=(15, 6))
plt.subplot(121).set_title('Elbow Method Graph')
plt.plot(range(1, len(WCSS) + 1), WCSS)
plt.grid(b = True)
plt.subplot(122).set_title('Differences in Each Two Consective Clusters')
len_differences = len(DIFF)
X_differences = range(1, len_differences + 1)
plt.plot(X_differences, DIFF)
plt.plot(X_differences, np.ones(len_differences)*boundary, 'r')
plt.plot(X_differences, np.ones(len_differences)*(-boundary), 'r')
plt.grid()
plt.show()

Why we write elbow method in object?

As we don’t know where we will get elbow i.e. optimal number of clusters, so we write it in object in such a way that the values of WCSS will be in attribute of object and we’ll not lost them. As, firstly we may run elbow method for cluster number of 1–10 and later when we plot it, we may find that we don’t get joint of elbow yet and we need to run it for more. So, next time we can run the same instance of object from 11–20 and so on, until we’ll get joint for elbow. So we can save our time to run it for again from 1–20. And thus, we’ll not lost data of previous run.

You may observe that in the above class method showPlot, I have written two plots. Yeah, here I’m going to use another strategy when we can’t observe an elbow. And this is the difference between each two WCSS values and we can set a boundary for more clear observations of changing in WCSS value. That is, when the changes in WCSS value will remain inside our required boundary then we will say that we have find elbow after which changes are small. See below the plots

Now let, first we analyze for clusters 1–10 with the boundary of 10 i.e. when the changes in WCSS value will be remain inside the boundary, we’ll say that now we have find an elbow after which change is small.

Remeber that the dataframe df_sparseMatrix was only for prsentation of sparseMatrix. For the algorithm, we always use only matrix sparseMatrix itself.

Let first create an instance of elbow method on our defined sparseMatrix.

elbow_method = elbowMethod(sparseMatrix) 

Now, first we will run it for 1–10 number of cluster, i.e. first k-mean will run for no of clusters 𝑘=1, then for no. of clusters 𝑘=2 and so on upto no. of clusters 𝑘=10.

elbow_method.run(1, 10)elbow_method.showPlot(boundary = 10)
png

Since, we don’t have any clear elbow yet and also we don’t have differences inside the boundary. Now let run it for clusters 11–20

elbow_method.run(11, 30)elbow_method.showPlot(boundary = 10)

What happend?

We don’t have elbow, but we have boundary in differences graph. If we look at the differences graph, we observe that after the cluster 14, the differences are almost inside the boundary. So, we will run k-means for clusters 15 because the 14'th difference is the difference between 𝑘=14 and 𝑘=15. Since we have done to analyze the optimal clusters 𝑘. Now move to fitting the model and making recommendations.

Fitting Data on Model

Now let first create the same k-means model and run it to make predictions.

kmeans = KMeans(n_clusters=15, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
clusters = kmeans.fit_predict(sparseMatrix)

Now, let create a dataframe where we can see each user cluster number

users_cluster = pd.DataFrame(np.concatenate((users.reshape(-1,1), clusters.reshape(-1,1)), axis = 1), columns = ['userId', 'Cluster'])
users_cluster.T
png

Now we’ll define a function which will create a list of DataFrames where each DataFrame will contain the movieId and the counts for that movie (count: the number of users who has that respective movie in their favorite list). So, the movie which will have more counts will be of more interest to other users who has not watched that movie yet.
For Example, we’ll create a list as following
[dataframe_for_Cluster_1, dataframe_for_Cluster_2, ..., dataframe_for_Cluster_3]
Where each DataFrame will be of following format

where 3rd column of Count is representing the total number of users in the cluster who have watched that particular movie. So, we will sort movies by their count in order to prioritize the movie which have most seen by users in cluster and is more favorite for users in the cluster.

Now we want to create a list of all user movies in each cluster. For this, first we’ll define a method for creating movies of clusters.

def clustersMovies(users_cluster, users_data):
clusters = list(users_cluster['Cluster'])
each_cluster_movies = list()
for i in range(len(np.unique(clusters))):
users_list = list(users_cluster[users_cluster['Cluster'] == i]['userId'])
users_movies_list = list()
for user in users_list:
users_movies_list.extend(list(users_data[users_data['userId'] == user]['movieId']))
users_movies_counts = list()
users_movies_counts.extend([[movie, users_movies_list.count(movie)] for movie in np.unique(users_movies_list)])
each_cluster_movies.append(pd.DataFrame(users_movies_counts, columns=['movieId', 'Count']).sort_values(by = ['Count'], ascending = False).reset_index(drop=True))
return each_cluster_movies
cluster_movies = clustersMovies(users_cluster, users_fav_movies)

Now, let take a look at any one DataFrame of cluster_movies.

cluster_movies[1].T
png

We have 30 movies in 1st cluster where movie with ID 1 is favorite by 19 users and at the top priority, followed by movie with ID 150 which is favorite by 8 users.

Now, let see how much users we have in each cluster

for i in range(15):
len_users = users_cluster[users_cluster['Cluster'] == i].shape[0]
print('Users in Cluster ' + str(i) + ' -> ', len_users)

Out:

Users in Cluster 0 ->  35
Users in Cluster 1 -> 19
Users in Cluster 2 -> 1
Users in Cluster 3 -> 5
Users in Cluster 4 -> 8
Users in Cluster 5 -> 1
Users in Cluster 6 -> 12
Users in Cluster 7 -> 2
Users in Cluster 8 -> 1
Users in Cluster 9 -> 1
Users in Cluster 10 -> 1
Users in Cluster 11 -> 11
Users in Cluster 12 -> 1
Users in Cluster 13 -> 1
Users in Cluster 14 -> 1

As, we can see that there are some clusters which contain only 1 user or 2 or 5. As we don’t want such small cluster where we can’t recommend enough movies to users. As the user in a cluster of size one will not get any recommendation for movies OR even user in size of cluster 2 will not get enough recommendations. So, we have to fix such small clusters.

Fixing Small Clusters

Since, there are many clusters which includes less number of users. So we don’t want any user in a cluster alone and let say we want at least 6 users in each cluster. So we have to move users from small cluster into a large cluster which contain more relevant movies to user

First of all we’ll write a function to get user favorite movies list

def getMoviesOfUser(user_id, users_data):
return list(users_data[users_data['userId'] == user_id]['movieId'])

Now, we’ll define a function for fixing clusters

def fixClusters(clusters_movies_dataframes, users_cluster_dataframe, users_data, smallest_cluster_size = 11):
# clusters_movies_dataframes: will be a list which will contain each dataframes of each cluster movies
# users_cluster_dataframe: will be a dataframe which contain users IDs and their cluster no.
# smallest_cluster_size: is a smallest cluster size which we want for a cluster to not remove
each_cluster_movies = clusters_movies_dataframes.copy()
users_cluster = users_cluster_dataframe.copy()
# Let convert dataframe in each_cluster_movies to list with containing only movies IDs
each_cluster_movies_list = [list(df['movieId']) for df in each_cluster_movies]
# First we will prepair a list which containt lists of users in each cluster -> [[Cluster 0 Users], [Cluster 1 Users], ... ,[Cluster N Users]]
usersInClusters = list()
total_clusters = len(each_cluster_movies)
for i in range(total_clusters):
usersInClusters.append(list(users_cluster[users_cluster['Cluster'] == i]['userId']))
uncategorizedUsers = list()
i = 0
# Now we will remove small clusters and put their users into another list named "uncategorizedUsers"
# Also when we will remove a cluster, then we have also bring back cluster numbers of users which comes after deleting cluster
# E.g. if we have deleted cluster 4 then their will be users whose clusters will be 5,6,7,..,N. So, we'll bring back those users cluster number to 4,5,6,...,N-1.
for j in range(total_clusters):
if len(usersInClusters[i]) < smallest_cluster_size:
uncategorizedUsers.extend(usersInClusters[i])
usersInClusters.pop(i)
each_cluster_movies.pop(i)
each_cluster_movies_list.pop(i)
users_cluster.loc[users_cluster['Cluster'] > i, 'Cluster'] -= 1
i -= 1
i += 1
for user in uncategorizedUsers:
elemProbability = list()
user_movies = getMoviesOfUser(user, users_data)
if len(user_movies) == 0:
print(user)
user_missed_movies = list()
for movies_list in each_cluster_movies_list:
count = 0
missed_movies = list()
for movie in user_movies:
if movie in movies_list:
count += 1
else:
missed_movies.append(movie)
elemProbability.append(count / len(user_movies))
user_missed_movies.append(missed_movies)
user_new_cluster = np.array(elemProbability).argmax()
users_cluster.loc[users_cluster['userId'] == user, 'Cluster'] = user_new_cluster
if len(user_missed_movies[user_new_cluster]) > 0:
each_cluster_movies[user_new_cluster] = each_cluster_movies[user_new_cluster].append([{'movieId': new_movie, 'Count': 1} for new_movie in user_missed_movies[user_new_cluster]], ignore_index = True)
return each_cluster_movies, users_cluster

Now, run it.

movies_df_fixed, clusters_fixed = fixClusters(cluster_movies, users_cluster, users_fav_movies, smallest_cluster_size = 6)

To observer changes for fixing clusters, first take a look at data which we were had before and and then data after fixing

First we’ll print those clusters which contain maximum 5 users

j = 0
for i in range(15):
len_users = users_cluster[users_cluster['Cluster'] == i].shape[0]
if len_users < 6:
print('Users in Cluster ' + str(i) + ' -> ', len_users)
j += 1
print('Total Cluster which we want to remove -> ', j)

Out:

Users in Cluster 2 ->  1
Users in Cluster 3 -> 5
Users in Cluster 5 -> 1
Users in Cluster 7 -> 2
Users in Cluster 8 -> 1
Users in Cluster 9 -> 1
Users in Cluster 10 -> 1
Users in Cluster 12 -> 1
Users in Cluster 13 -> 1
Users in Cluster 14 -> 1
Total Cluster which we want to remove -> 10

Now look at the users cluster data frame

print('Length of total clusters before fixing is -> ', len(cluster_movies))
print('Max value in users_cluster dataframe column Cluster is -> ', users_cluster['Cluster'].max())
print('And dataframe is following')
users_cluster.T

Out:

Length of total clusters before fixing is ->  15
Max value in users_cluster dataframe column Cluster is -> 14
And dataframe is following
png

So, we want max value in Cluster column is 4 starting from index 0, as we’ll remove 10 smallest clusters and we’ll have 5 remaining clusters

Now, let see what happend after fixing data.

We want to remove all those 10 small clusters and also the users_cluster DataFrame shouldn’t contain any user whose clusters which is invalid.

print('Length of total clusters after fixing is -> ', len(movies_df_fixed))
print('Max value in users_cluster dataframe column Cluster is -> ', clusters_fixed['Cluster'].max())
print('And fixed dataframe is following')
clusters_fixed.T

Out:

Length of total clusters after fixing is ->  5
Max value in users_cluster dataframe column Cluster is -> 4
And fixed dataframe is following
png

Now let see what happend when 10 clusters were deleted and how the remaining users clusters were adjusted which were already in large clusters.

Let take a look at anyone 11th cluster user. Since 11th cluster was already containing enough users i.e. 11 users and we were not want to delete that, but as now we only have max 5 cluster and max value of cluster column is 4, so what actually happend to 11 cluster? As there were 7 clusters before cluster no. 11 which were small and removed, so the value 11 now should be bring back to 4.

print('Users cluster dataFrame for cluster 11 before fixing:')
users_cluster[users_cluster['Cluster'] == 11].T

Out:

Users cluster dataFrame for cluster 11 before fixing:
png

Now let look at the cluster 4 after fixing

print('Users cluster dataFrame for cluster 4 after fixing which should be same as 11th cluster before fixing:')
clusters_fixed[clusters_fixed['Cluster'] == 4].T

Out:

Users cluster dataFrame for cluster 4 after fixing which should be same as 11th cluster before fixing:
png

Both DataFrame are containing same users IDs, So we don’t disturbed any cluster and simililarly we did same with list of movies DataFrames for each cluster

Now let take a look at list of movies dataframes

print('Size of movies dataframe after fixing -> ', len(movies_df_fixed)) 

Out:

Size of movies dataframe after fixing ->  5

Now, lets look at the sizes of clusters

for i in range(len(movies_df_fixed)):
len_users = clusters_fixed[clusters_fixed['Cluster'] == i].shape[0]
print('Users in Cluster ' + str(i) + ' -> ', len_users)

Out:

Users in Cluster 0 ->  45
Users in Cluster 1 -> 21
Users in Cluster 2 -> 8
Users in Cluster 3 -> 15
Users in Cluster 4 -> 11

Each cluster is now containing enough users so that we can make recommendations for other users. Let take a look at each size of clusters movies list.

for i in range(len(movies_df_fixed)):
print('Total movies in Cluster ' + str(i) + ' -> ', movies_df_fixed[i].shape[0])

Out:

Total movies in Cluster 0 ->  64
Total movies in Cluster 1 -> 39
Total movies in Cluster 2 -> 15
Total movies in Cluster 3 -> 50
Total movies in Cluster 4 -> 25

As, we have done working with training machine learning model k-means, making predictions of clusters for each user and fixing some issues. Finally, we need to store this training so that we can use it later. For this, we will use Pickle library to save and load trainings. We have already imported Pickle, now we will use it.

Let me first design object to save and load trainings. We will directly design methods for saving/loading particular files and also we will design general save/load methods

class saveLoadFiles:
def save(self, filename, data):
try:
file = open('datasets/' + filename + '.pkl', 'wb')
pickle.dump(data, file)
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
file.close()
return [False, err]
else:
file.close()
return [True]
def load(self, filename):
try:
file = open('datasets/' + filename + '.pkl', 'rb')
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
file.close()
return [False, err]
else:
data = pickle.load(file)
file.close()
return data
def loadClusterMoviesDataset(self):
return self.load('clusters_movies_dataset')
def saveClusterMoviesDataset(self, data):
return self.save('clusters_movies_dataset', data)
def loadUsersClusters(self):
return self.load('users_clusters')
def saveUsersClusters(self, data):
return self.save('users_clusters', data)

In above class, exc_info imported from sys library for error handling and error writings.

We will use saveClusterMoviesDataset/loadClusterMoviesDataset methods to save/load list of clusters movies DataFrames and saveUsersClusters/loadUsersClusters methods to save/load users clusters DataFrames. Now, lets try it. We will run and print responses in order to check if any error comes. If it return True then its mean our files has been saved successfully in proper place.

saveLoadFile = saveLoadFiles()
print(saveLoadFile.saveClusterMoviesDataset(movies_df_fixed))
print(saveLoadFile.saveUsersClusters(clusters_fixed))

Out:

[True]
[True]

As response is True for both save methods. Our trained data has now saved and we can use it later. Let check it if we can load it.

load_movies_list, load_users_clusters = saveLoadFile.loadClusterMoviesDataset(), saveLoadFile.loadUsersClusters()
print('Type of Loading list of Movies dataframes of 5 Clusters: ', type(load_movies_list), ' and Length is: ', len(load_movies_list))
print('Type of Loading 100 Users clusters Data: ', type(load_users_clusters), ' and Shape is: ', load_users_clusters.shape)

Out:

Type of Loading list of Movies dataframes of 5 Clusters:  <class 'list'>  and Length is:  5
Type of Loading 100 Users clusters Data: <class 'pandas.core.frame.DataFrame'> and Shape is: (100, 2)

We have successfully saved and loaded our data by using pickle library.

As we worked for very small dataset. But often movies recommendation systems works with very large datasets as the dataset we were had initially, and there we have enough movies in each cluster to make recommendations.

Now, we need to design functions for making recommendations to users.

Recommendations for Users

Now here we’ll create an object for recommending most favorite movies in the cluster to the user which user has not added to favorite earlier. And also when any user has added another movie in his favorite list, then we have to update clusters movies datasets also.

class userRequestedFor:
def __init__(self, user_id, users_data):
self.users_data = users_data.copy()
self.user_id = user_id
# Find User Cluster
users_cluster = saveLoadFiles().loadUsersClusters()
self.user_cluster = int(users_cluster[users_cluster['userId'] == self.user_id]['Cluster'])
# Load User Cluster Movies Dataframe
self.movies_list = saveLoadFiles().loadClusterMoviesDataset()
self.cluster_movies = self.movies_list[self.user_cluster] # dataframe
self.cluster_movies_list = list(self.cluster_movies['movieId']) # list
def updatedFavouriteMoviesList(self, new_movie_Id):
if new_movie_Id in self.cluster_movies_list:
self.cluster_movies.loc[self.cluster_movies['movieId'] == new_movie_Id, 'Count'] += 1
else:
self.cluster_movies = self.cluster_movies.append([{'movieId':new_movie_Id, 'Count': 1}], ignore_index=True)
self.cluster_movies.sort_values(by = ['Count'], ascending = False, inplace= True)
self.movies_list[self.user_cluster] = self.cluster_movies
saveLoadFiles().saveClusterMoviesDataset(self.movies_list)

def recommendMostFavouriteMovies(self):
try:
user_movies = getMoviesOfUser(self.user_id, self.users_data)
cluster_movies_list = self.cluster_movies_list.copy()
for user_movie in user_movies:
if user_movie in cluster_movies_list:
cluster_movies_list.remove(user_movie)
return [True, cluster_movies_list]
except KeyError:
err = "User history does not exist"
print(err)
return [False, err]
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
return [False, err]

Now lets try it to make recommendations and updating favorite list request. For this, first we’ll import data for not only IDs but for movies details like title, genre etc.

movies_metadata = pd.read_csv(
'./Prepairing Data/From Data/movies_metadata.csv',
usecols = ['id', 'genres', 'original_title'])

movies_metadata = movies_metadata.loc[
movies_metadata['id'].isin(list(map(str, np.unique(users_fav_movies['movieId']))))].reset_index(drop=True)
print('Let take a look at movie metadata for all those movies which we were had in our dataset')
movies_metadata

Out:

Let take a look at movie metadata for all those movies which we were had in our dataset
png

Here is the list of movies which user with ID 12 has added into its favorite movies

user12Movies = getMoviesOfUser(12, users_fav_movies)
for movie in user12Movies:
title = list(movies_metadata.loc[movies_metadata['id'] == str(movie)]['original_title'])
if title != []:
print('Movie title: ', title, ', Genres: [', end = '')
genres = ast.literal_eval(movies_metadata.loc[movies_metadata['id'] == str(movie)]['genres'].values[0].split('[')[1].split(']')[0])
for genre in genres:
print(genre['name'], ', ', end = '')
print(end = '\b\b]')
print('')

Out:

Movie title:  ['Dancer in the Dark'] , Genres: [Drama , Crime , Music , ]
Movie title: ['The Dark'] , Genres: [Horror , Thriller , Mystery , ]
Movie title: ['Miami Vice'] , Genres: [Action , Adventure , Crime , Thriller , ]
Movie title: ['Tron'] , Genres: [Science Fiction , Action , Adventure , ]
Movie title: ['The Lord of the Rings'] , Genres: [Fantasy , Drama , Animation , Adventure , ]
Movie title: ['48 Hrs.'] , Genres: [Thriller , Action , Comedy , Crime , Drama , ]
Movie title: ['Edward Scissorhands'] , Genres: [Fantasy , Drama , Romance , ]
Movie title: ['Le Grand Bleu'] , Genres: [Adventure , Drama , Romance , ]
Movie title: ['Saw'] , Genres: [Horror , Mystery , Crime , ]
Movie title: ["Le fabuleux destin d'Amélie Poulain"] , Genres: [Comedy , Romance , ]

And finally these are the top 10 recommended movies for that user

user12Recommendations = userRequestedFor(12, users_fav_movies).recommendMostFavouriteMovies()[1]
for movie in user12Recommendations[:15]:
title = list(movies_metadata.loc[movies_metadata['id'] == str(movie)]['original_title'])
if title != []:
print('Movie title: ', title, ', Genres: [', end = '')
genres = ast.literal_eval(movies_metadata.loc[movies_metadata['id'] == str(movie)]['genres'].values[0].split('[')[1].split(']')[0])
for genre in genres:
print(genre['name'], ', ', end = '')
print(']', end = '')
print()

Out:

Movie title:  ['Trois couleurs : Rouge'] , Genres: [Drama , Mystery , Romance , ]
Movie title: ["Ocean's Eleven"] , Genres: [Thriller , Crime , ]
Movie title: ['Judgment Night'] , Genres: [Action , Thriller , Crime , ]
Movie title: ['Scarface'] , Genres: [Action , Crime , Drama , Thriller , ]
Movie title: ['Back to the Future Part II'] , Genres: [Adventure , Comedy , Family , Science Fiction , ]
Movie title: ["Ocean's Twelve"] , Genres: [Thriller , Crime , ]
Movie title: ['To Be or Not to Be'] , Genres: [Comedy , War , ]
Movie title: ['Back to the Future Part III'] , Genres: [Adventure , Comedy , Family , Science Fiction , ]
Movie title: ['A Clockwork Orange'] , Genres: [Science Fiction , Drama , ]
Movie title: ['Minority Report'] , Genres: [Action , Thriller , Science Fiction , Mystery , ]

And finally, we have successfully recommended movies to user based on his/her interest with most favorite movies by similar users.

You’re Done

Thanks for reading this article. If you want this whole project in the deployment coding, then please visit my GitHub library AI Movies Recommendation System Based on K-means Clustering Algorithm and download it to work with it, it is completely free for everyone.

Thank You

--

--

Syed Muhammad Asad

MS (Computational Mathematics), Data Scientist and Machine Learning Engineer, Mathematician, Programmer, Research Scientist, Writer.