4. Collaborative-Filtering Recommender Systems
In my previous article of this recommender systems series, we took a look at the content-based approach to building recommender systems. In this article, we will take a look at another popular approach used in modern recommender systems, called collaborative-filtering (CF). There are two types of methods that are commonly used in CF: memory based and model based. In this article we'll take a look at how memory-based CF works.
The main idea around collaborative-filtering is quite simple. Suppose we have a user x to whom we want to make recommendations. First, we look for a group of other users whose likes and dislikes are similar to those of user x. For example, in the case of movies, we want to find a group of users that liked the same movies as user x and disliked the same movies as user x. We call this set of users N, the neighborhood of user x. Once we have defined this set of users N, we then look for movies that are liked by a lot of the users in the set N and recommend those to the user x.
The key here is to find the set of users that are similar to user x (the neighborhood N of user x) and to do that we need to define a notion of similarity between users. The figure below shows an example of a rating matrix with 4 users (A, B, C, and D) and 7 items (i1, i2, i3, i4, i5, i6, i7)
User A has rated three items (i1, i4, i5), user B has also rated three items (i1, i2, i3), like user C (i4, i5, i6), while user D has rated only two items (i2,i7). Let's refer to each row in the rating matrix as the user's rating vector. Now, given two users x and y with rating vectors rx ry, we need a similarity metric that looks at the rating vectors and computes similarity. What's interesting here, is that there are a lot of items the users did not rate, so the key in defining the similarity function is how we deal with these unknown values (unknown ratings). We'd like to define the rating vector in a way that captures the intuition that users with similar tastes should have higher similarity as opposed to users with dissimilar tastes. In the case of the example above, we can notice that A and B rated only one item in common (i1), however, they both rated that item highly. Whereas users A and C have rated two items in common (i4, i5) but in a very dissimilar way (items A liked, C didn't like and vice versa). It seems, intuitively, that users A and C are dissimilar while users A and B are similar. So we would like to be able to capture this intuition when defining the notion of similarity, therefore we would like the similarity between A and B to be higher than the similarity between A and C. Let's see how to do this.
Once again the cosine similarity metric is what we want to use (refer to my previous article if you are not familiar with this metric). So the similarity between A and B will be calculated as the cosine similarity between the two vectors ra and rb (rows in the rating matrix). If we compute the cosine similar this is what we obtain:
sim(A,B) = 0.38, sim(A,C) = 0.32
Notice that, to compute the cosine similarity, we have to assign a value to the unknown values in the rating matrix, and the simplest thing to do is to assign the value zero (0) to the unknown values as shown in the figure below.
As we can see looking at the computed similarity values, we obtain:
sim(A,B) > sim(A,C)
but not by much. These values don't seem to capture that A and B are much more similar to each other than A and C are. The problem with cosine similarity here is that we are treating unknown values as negative values (negative ratings). We are assuming that, if a user hasn't rated an item then his rating for that item is zero, the worst value possible (since we are in 5 stat rating example). This is actually a bad assumption. To fix this we use a variation of cosine similarity called the centered cosine similarity. Before performing the cosine similarity we "center" the rating values around the zero value. To do so we normalize the ratings of a given user by subtracting the user's average rating from each rating value.
The next figure shows the average user rating, computed before performing normalization.
We then normalize the rating values subtracting the user's average rating from each known rating value in its rating vector (the user's row in the rating matrix) as shown in the next figure.
And the figure below shows the rating matrix after performing the normalization step.
It's interesting to notice that if you add the values in each row you obtain zero. What we have actually done here is we centered the ratings of each user around zero, by doing so, zero becomes the average rating value for every users. Now positive ratings indicate that a user likes an item more than average, and negative ratings indicate that the user likes the item less than average. Once we have done this "centering" operation, we can compute the cosine similarity using these "new" values. This is what we end up with:
sim(A,B) = 0.09, sim(A,C) = -0.56
Again we obtain sim(A,B) > sim(A,C), but now the difference between the two similarities is bigger, and this captures the fact that user A and user C are quite dissimilar users. The items that A likes C doesn't like and the items C likes A doesn't like. The bigger gap between sim(A,B) and sim(A,C),indicates that A and B are much more alike than A and C are. The centered cosine similarity captures our intuition of similar users much better than the simple cosine similarity, and this is because the missing ratings are no longer treated as negative ratings, but as average ratings instead. It also turns out to be a good way to handle "tough raters" and "easy raters". In the world of Statistics the centered cosine similarity is also referred to as the Pearson correlation.
So far we have come up with a way to estimante similarity between users, but how do we actually make predictions for a user?
Let rx be the vector of user x's ratings (a row in the rating matrix). We'll use the notion of centered cosine similarity to find the set N of k users most similar to x that have also rated an item i. Once we have determined the set N of users that are similar to x, we can make predictions for the user x and item i.
One first simple option to make a prediction is to take the average rating for item i from the neighborhood:
This option is very simple but it actually ignores the similarity values between users. The neighborhood N contains users that are similar to the target user x, but the similarity values may be very different. There may be users very similar to user x and user that are not that similar to user x. So what we really want to do is to weigh the average rating with the similarity values, which leads us to the following rating formula:
The technique we have described so far is called user-based collaborative filtering, because, given an active user it tries to find other users that are similar to that active user, and then leverages the ratings of these similar users to make predictions for the active user.
A dual approach to user-based collaborative filtering is item-based collaborative filtering. Instead of starting off with a user, we start off with an item, and we try to find similar items. Then we compute ratings prediction for the target item, based on the ratings of similar items. We can use the same similarity metrics and prediction functions as per the user-based approach. Let's walk through an example to see what this means.
Given the rating matrix shown in the figure above, let's see how item-based collaborative filtering works, using a neighborhood N that contains only 2 items (given an item i, we will be looking at the two nearest neighbors of item i). Our goal is to estimate (predict) the rating of user u5 for item i1. The figure below shows our goal.
The first step is to take item i1 and find other items that are similar to i1. To do so we calculate the centered cosine similarity (or Pearson correlation) between item i1 and all the other items, and then we concentrate on those other items that user u5 has also rated.
The figure above shows the values of the similarity between item i1 and every other item. For purely didactic purposes we calculated also the similarity between item i1 and itself and, as expected, we obtained 1.0 since the two vectors are identical. After computing the centered cosine similarity between the items, we concentrate on the items that u5 has rated and we pick the top-2 items with the highest centered cosine similarity (remember in this example we are considering a neighborhood of size 2).
As shown in the figure above the two items that are the most similar to item i1, are item i6 and item i3. Therefore, i6 and i3 are the neighbors of item i1, and we can use the weighed average formula to predict the rating for item i1 by user u5. Let's recall the rating prediction formula we are going to use first (figure below).
Now, let's compute the rating prediction for item i1 by user u5.
r15 = (0.59*3 + 0.41*2) / (0.59+0.41) = 2.59 ~ 2.6
So the predicted rating for item i1 by user u5 is 2.6, as show in the figure below.
Should we recommend an item to a user for which we have predicted a rating value of 2.6 (in a 5-star rating context)? We may introduce a threshold and not recommend items that have a predicted rating value below this threshold, but there are other elements that play an important role in recommender systems, for instance coverage. Coverage is simply the percentage of possible recommendation a recommender systems is able to provide. If you enforce a high quality threshold on the recommendation you might improve your accuracy at expense of coverage. Finding the balance of where and when exactly you are better off recommending nothing at all can be delicate. Coverage can be also important to watch because it gives you the sense of how quickly new items in your catalog will start to appear in your recommendations. When a new book appears on Amazon, it won't be recommended until at least a few people buy it, therefore establishing patterns with the purchase of other items. But here we are entering the topic of tuning and evaluating recommender systems, that might be the subject of a next upcoming article.
Recommender systems are often consider to be closer to an art than to science.
User-based and item-based are the two ways you can perform collaborative filtering. In theory these are dual approaches and should have similar kind of performance, but in practice they don't perform similarly at all. It has been observed that item-based collaborative filtering outperforms user-based collaborative filtering in many use cases.
Items are "simpler" than users
This is because items are "simpler" than users. Items belong to a small set of genres, while users tend to have varied tastes. A user, for example, may like both pop music and classical music, while it's very unlikely that an item will belong to both those genres. So it turns out that the notion of item similarity is more meaningful than the notion of user similarity. That's why item-based collaborative filtering works better than user-based collaborative filtering in most cases.
Both of these techniques belong to a category called memory-based collaborative filtering methods (or neighborhood-based methods). Because of their simple and intuitive approach, they are easy to implement and it is often easy to justify why a specific item is recommended, and the interpretability of item-based methods is particularly significant. The main disadvantage of these methods is their limited coverage because of data sparsity. Data sparsity also creates challenges for robust similarity computation when the number of mutually rated items by two users is small. Also these methods suffer the cold-start problem both for new items and new users.
That's all for now. But there is still a lot to say about the fascinating world of recommender systems, so... stay tuned!
This article is written by Riccardo Saccomandi, Co-founder and CTO of Kickdynamic.