Learning new things in our time should be easy, all the information we need is one search away. If you have just started learning recommendation systems, then we are in the same boat, because blog posts and recommendation systems are way out of my comfort zone. In my experience, the best way to learn something new is to get out of your comfort zone and not worry about failure.

With this post, I’m going to explain how to create a simple movie recommendation app and explain how the relevant algorithms work. Later you can apply the algorithms to any other recommendation app or system. Most references are from the amazing book Programming Collective Intelligence by Toby Segaran.

The application will recommend movies based on other users scores for each individual movie. A user can rate movies from 1 to 5 and update the similarity score to other users who have rated the same movies. First, we need to explain how the algorithms work, then how they are implemented in the application.

## Algorithm

An algorithm is a process or set of rules used in the calculation to solve a given problem. Our problem would be “*Which movie should I watch next?*”. Let’s see how would we go about finding a movie we would enjoy. The best thing to do would be to ask a friend or someone who has similar taste in movies to recommend one (*I realize that the best thing today would be to go on IMDB or something and find one but bear with me*). We are going to do the same thing here, but instead of us asking our friends for the movie the algorithm is going to do the asking for us. The two algorithms we will explain here are:

- Euclidean distance
- Pearson correlation

You can see how the coefficient changes for each user, by registering on the application and checking the user profile page. All the mutual movie scores will be shown as well as the coefficients for Euclidean distance and Pearson correlation coefficient (*pcc*).

## Euclidean Distance

The definition for **Euclidean Distance** is “The straight-line distance between two points in **Euclidean space**”, say we take the points from the definition and transform them into users (*figuratively*). Let’s take two movie pairs (“Star Wars A New Hope”, “A Beautiful Mind”) and scores for each user. User **A** has given a score pair **(5, 3)** and user **B** **(3, 4)**, if we take these pairs and draw points onto a graph, we could determine the similarity between two users movie tastes.

We can see that both users liked “A Beautiful Mind” and have a close score, but for “Star Wars” the scores are way off. Calculating the distance between these two points we can get a coefficient of similarity between the two users determined by the movie scores they gave. Let’s use the Euclidean distance formula.

applying the formula shown above to our graph we get:

That’s it, **2.33**! Using the distance like this isn’t very helpful because we will get higher numbers for greater distances. What we need is lower numbers for greater distances and higher numbers for smaller distances. By using the formula below we can achieve exactly that:

The space in which we drew our “points” was two-dimensional because we only used two movies. The odds of using only two movies to determine the similarity between two users is next to never, so we will add more movies. By adding more movies our space becomes n-dimensional, where n is the number of movies (also, we are not drawing points then rather something else, but that is not important here). The Euclidean distance then looks like:

That’s it! Congratulations, you now know how to naively determine the similarity between two users based on their movie scores.

## Pearson Correlation Coefficient

The **Pearson correlation coefficient** is a measure of the linear correlation between two variables. It gives a number between -1 and 1, where -1 is a total negative linear correlation, a value of 0 means no linear correlation and a value of 1 is a total positive linear correlation. This algorithm is better than the Euclidean distance when the values are harsher. Let’s visualize it, so it’s easier to understand.

The dashed line is called the *best-fit* line because it tries to connect all the dots or be at an equal distance from them. All four graphs have different Pearson correlation coefficients **ρ**, if the coefficient is 0, then there is no linear correlation, otherwise, there is a connection between the plotted points. Just to clarify, we are not calculating the distance between two plotted points, but the distance between a point and the best-fit line.
Calculating the Pearson correlation coefficient requires the sums of movie scores for both users, the sums of squared movie scores and the sum of products of their movie scores. The variables are listed below:

- Number of mutual movies between two users
**n** - Sum of movie scores for the users
**user1MovieScoreSum**,**user2MovieScoreSum** - Sum of squared movie scores
**user1MovieScoreSqSum**,**user2MovieScoreSqSum** - Sum of movie score products between two users -
**pSum**

The denominator can sometimes be 0. Keep this in mind when implementing the algorithm in your code. That’s it! The coefficient is proportional to the user similarity. You now know how to use two cool recommendation algorithms.

## Application

The application is using LevelDB to store users, movies, reviews, and similarity between users. It uses TMDB to sync the latest movies and one “seed user”, who has randomly rated all movies currently in the database. Lodash is used to alleviate us from writing tedious code. If you are not familiar with lodash don’t worry, the function calls are intuitive. Both algorithm implementations are shown below:

```
function euclideanDistance (user1, user2) {
const n = _.size(user1.reviews)
let coefficient = 0
if (n === 0) {
return n
}
for (let i = 0; i < n; i++) {
coefficient += Math.pow(user1.reviews[i].rating - user2.reviews[i].rating, 2)
}
return 1 / (1 + Math.sqrt(coefficient))
}
function squaredNum (review) {
return Math.pow(review.rating, 2)
}
// Pearson correlation coefficient
function pcc (user1, user2) {
const n = _.size(user1.reviews)
if (n === 0) {
return n
}
const user1MovieScoreSum = _.sumBy(user1.reviews, 'rating')
const user2MovieScoreSum = _.sumBy(user2.reviews, 'rating')
const user1MovieScoreSqSum = _.sumBy(user1.reviews, squaredNum)
const user2MovieScoreSqSum = _.sumBy(user2.reviews, squaredNum)
let pSum = 0
for (let i = 0; i < n; i++) {
pSum += (user1.reviews[i].rating * user2.reviews[i].rating)
}
const num = pSum - ((user1MovieScoreSum * user2MovieScoreSum) / n)
const den = Math.sqrt(
(user1MovieScoreSqSum - (Math.pow(user1MovieScoreSum, 2) / n)) *
(user2MovieScoreSqSum - (Math.pow(user2MovieScoreSum, 2) / n))
)
if (den === 0) {
return 0
}
return num / den
}
```

Each algorithm compares two users with their reviews and returns a coefficient. There are no novelties here, it is simply implementing the formulas described in the sections before. The Pearson correlation coefficient has an additional check if the denominator is equal to 0.

## User similarity calculation

When a user rates a movie, it triggers a similarity update with every user. If the other user has no mutual movies then the coefficient will be 0. Let’s look at the code.

```
async function updateUserSimilarityScores (username) {
const users = await db.user.getVals()
const mainUser = await db.user.get(username)
mainUser.reviews = await getUserMovieReviews(mainUser)
const mainUserMovies = _.map(mainUser.reviews, 'movieId')
return Promise.map(users, async function (user) {
user.reviews = await getUserMutualMovieReviews(user, mainUserMovies)
if (user.username === mainUser.username) {
return
}
const clonedUsers = filterUserMutualMovies(mainUser, user)
return db.similarity.put([user.username, mainUser.username].sort().join('-'), {
euclideanDistance: algorithms.euclideanDistance(clonedUsers[0], clonedUsers[1]),
manhattanDistance: algorithms.manhattanDistance(clonedUsers[0], clonedUsers[1]),
pcc: algorithms.pcc(clonedUsers[0], clonedUsers[1]),
users: [user.username, mainUser.username].sort(),
})
})
}
```

The first four lines query for the user who rated the movie, his reviews, and all the users currently registered. Now we iterate over all the registered users, query their movie reviews and filter them so only the mutual movies are left. The users are being cloned because if we change an object in a function the original object will also be changed. Lastly, we update the database with the new coefficients for each user and voila.

## Similar users

Querying for similar users is just a matter of ordering the user similarity coefficients. Once the coefficients are ordered we return the users.

```
router.get('/user', auth, async function (req, res) {
const {username} = req
const similarity = helper.sortByAlgorithm(await db.similarity.getBy(function (user) {
return _.includes(user.users, username)
}), 'pcc')
const usersSimilarities = await Promise.map(similarity, async function (similarityData) {
return {
...similarityData,
user: await db.user.get(_.remove(similarityData.users, function (user) {
return user !== username
})[0]),
}
})
res.render('users', {usersSimilarities})
})
```

## Recommending movies

In order to recommend movies, we need the highest user similarity coefficients. The code example below is querying the top 5 user similarity coefficients, so the data needs to be ordered before. After querying the top coefficients, we need the movie reviews for each user. An additional added feature are the reviews of the currently logged in user because we don’t want to show the already reviewed movies. Next, each row is multiplied by the user similarity coefficient to get a better rating for each movie and finally, the movies are ordered by the now updated ratings. If there are no movies to show for the user, we query the database for all movies.

```
router.get('/movie', auth, async function (req, res) {
const {username} = req
const similarityData = _(await db.similarity.getBy(function (similarity) {
return _.includes(similarity.users, username)
})).orderBy('pcc', 'desc').take(5).value()
const users = _(similarityData).map(function (similarity) {
return {
..._.omit(similarity, 'users'),
username: _.remove(similarity.users, function (user) {
return user !== username
})[0],
}
}).value()
const mainUser = await db.user.get(username)
mainUser.reviewedMovies = _.map(await helper.getUserMovieReviews({username}), 'movieId')
await Promise.map(users, async function (user) {
user.reviews = _(await db.review.getBy(function (review) {
return review.username === user.username && !_.includes(mainUser.reviewedMovies, review.movieId)
})).orderBy('rating', 'desc').take(5).value()
})
let movieRecommendations = []
_.forEach(users, function (user) {
_.forEach(user.reviews, function (review) {
movieRecommendations.push({
movieId: review.movieId,
rating: review.rating * user.pcc,
})
})
})
movieRecommendations = _(movieRecommendations)
.orderBy('rating', 'desc')
.unionBy('movieId')
.value()
let movies = await Promise.map(movieRecommendations, function (movieRecommendation) {
return db.movie.get(movieRecommendation.movieId)
})
if (!_.size(movies)) {
movies = await db.movie.getVals()
}
return res.render('movies', {movies})
})
```

## Info

You can find the source code on github.com/blazing-edge-labs/movie-recommendation and play with the app itself on movierec.blazingedge.io .

## Conclusion

The algorithms used give us naive recommendations, but they provide an insight into a whole new area of programming and I find that fascinating. Multiplying the user similarity with their movie scores gave a better recommendation than just listing the top users who are similar and their movies. You can also expand the current code by adding movie genders and using them as one more variable when recommending movies or by adding subfeatures like a movie, tv show etc.

If you are interested in innovating and growing your product, let's do it together. Reach out to us today