Evaluating Design Choices of User-User Collaborative Filtering Across Datasets
Collaborative filtering is a class of techniques to create personalized recommendations based on ratings data.
There are many design choices that need to be made when building a collaborative filtering recommender system — with the right choices hopefully improving the quality of the recommendations.
Here, we will discuss some of the most common design choices of a specific type of collaborative filtering algorithms called user-user collaborative filtering. Using two public datasets, one book ratings dataset and one movie ratings dataset, we will evaluate the effectiveness of each design choice.
Based on significance tests and critical difference diagrams, we will illustrate which design choices significantly improve recommendation quality across the two datasets.
User-User Collaborative Filtering
User-user collaborative filtering is a type of collaborative filtering algorithm. Given explicit ratings of users on items, the method predicts the ratings of items a user has not rated yet. It can thus provide personalized and automated product recommendations to users. The user-user collaborative filtering algorithm can be described by the following three steps:
- Similarity Calculation: First, based on the ratings of a target user, calculate how similar all other users in the dataset are to that user, given their rating patterns. The simplest way to do this is to calculate the Pearson correlation coefficient between the target user ratings and the ratings of each other user in the dataset. This coefficient returns 1 (perfect similarity) if two users gave the same ratings to the items that they both rated, whereas it returns -1 if the two users gave opposite ratings. Non-perfect matches of the ratings lead to a similarity between -1 and 1.
- Neighborhood Selection: Optionally select a number of similar users. For example, take users with a positive similarity, that is with a correlation coefficient larger than zero, as neighbors. Another option is to select a fixed number of the most similar users.
- Prediction: For items (such as books or movies) that the target user hasn’t rated yet, calculate predicted ratings based on the ratings of the neighbors. This is done by using a weighted average of the ratings of the neighbors, weighted by their similarity.
Since it is based on similarities between users, this method is called user-user or user-based collaborative filtering¹. See also this post for a Python implementation of such a recommender system.
Design Choices of User-User Collaborative Filters
There exist many adaptions to the baseline user-based collaborative filtering algorithm outlined above that, based on human intuition, shall improve the quality of the recommendations. Below is a list of some possible design choices which will be evaluated.
In the context of the similarity calculation between the users, we evaluate two adaptions:
- Instead of using the Pearson correlation coefficient to compute similarities, one can use different similarity measures. Here, we also evaluate using the Spearman’s rank correlation coefficient for similarity computation.
- The correlation coefficients between two users are calculated based on items that both users have rated in common. If the overlap of items that the two users have rated is very small, then we want to put less trust into the obtained similarity between these two users. Imagine two users have just two items that they have rated in common, and both gave them identical ratings. Then the Pearson and Spearman correlation coefficients will return a perfect similarity (similarity of 1), although in principle the two users might have very different tastes. One method to counteract this is to require a minimum number of items that each pair of users must have rated in common to be considered similar — else might might consider them as not similar (similarity of 0). Below we evaluate the effect that different values for the minimum number of items that each pair of users must have rated in common has on the recommendations.
In addition, when we calculate the predicted ratings using a weighted average of the other users, it can be beneficial to restrict the average over a range of the most similar users (neighbors). Common design choices for the neighborhood selection are:
- Taking a fixed number of most similar neighbors.
- Defining a minimal similarity, and taking all users which a larger similarity as neighbors. Recall that in the baseline version explained above, we take all users with a similarity greater zero as neighbors.
Finally, a common design choice for the prediction of the ratings is mean adjustment:
- Different users have different rating-scales — that is, some users tend to rate more in the 3–5 star range, others more in the 1–4 star range. Normalizing their ratings using their respective mean rating does usually improve the recommendations. This means, for each user we mean-adjust their ratings by subtracting their mean rating from the data. When predicting for a particular user using the algorithm above, we eventually add the user mean to the predicted ratings. In doing so, we can account for the different scales over which users tend rate.
Datasets
To evaluate the different adaptions to the user-user collaborative filtering algorithms that we just discussed, we use two publicly available datasets. Both datasets contain explicit ratings on a scale from 1–5 stars.
- goodbooks-10k dataset, which contains over five million user ratings of over 50,000 users on the 10,000 most popular books, made publicly available by Goodreads².
- MovieLens-100k dataset, which contains 100,000 ratings from 943 users on 1682 movies³.
Evaluation
We evaluate each design option of the user-based recommender system using the two datasets mentioned above. We do so by performing a double-leave-one-out cross-validation as follows:
- We randomly select a single user from the dataset which we call target user.
- We randomly select a rating from the target users ratings and remove it from the ratings of the target user — we call that rating target rating, and the corresponding item the target item.
- Based on the remaining ratings of the target user, we construct the user-user collaborative filter by calculating the similarities between the target user and all other users in the respective dataset, take a neighborhood, and create a prediction for the target item.
- We then calculate the absolute difference between the predicted rating and the target rating. For some design choices, the collaborative filter might not be able to provide a prediction. Therefore, we also calculate a prediction coverage value, which is 1 if the algorithm is able to produce a rating for the target item, and 0 otherwise.
- We repeat this process 5000 times for the movie dataset, and 500 times for the book dataset. We then present the mean absolute difference (or mean-absolute error — short mae) and the mean prediction coverage, together with confidence intervals⁴.
Results
As the baseline user-user collaborative algorithm, we use
- Pearson correlation coefficient as similarity measure
- All users with a similarity greater zero as neighbors
- No mean-adjustment in the prediction
and take the minimum number of items that each pair of users must have rated in common to be considered similar equal to one. With these settings, we get an mae (confidence interval) of the baseline of 0.746 (0.693–0.797) on the book dataset and 0.829 (0.811–0.846) on the movie dataset. Note that having more data in the book dataset might lead to the better mae of the baseline model on the book dataset than on the movie dataset
Correlation Function
As a first experiment, we replace the Pearson correlation coefficient in the baseline algorithm’s similarity calculation with the Spearman’s rank correlation coefficient. Running the evaluation as described above, we get the following mae, together with the baseline metrics, on the book dataset
- Baseline: 0.746 (0.693-0.797)
- Spearman: 0.734 (0.687-0.784)
and the movie dataset
- Baseline: 0.829 (0.811-0.846)
- Spearman: 0.828 (0.811-0.846).
Note that using the Spearman’s rank correlation function leads a slightly better performance on average on both datasets.
Minimum Number of Items Rated in Common
As mentioned above, if only very few items are rated in common between two users, we might still refrain from considering them as similar due to missing sufficient evidence. We investigate this assumption by altering the baseline algorithm by increasing the minimum number of items that each pair of users must have rated in common to be considered similar. In the figure below, we plot the mean mae and the mean prediction coverage for the book and movie dataset, respectively, together with the confidence intervals. Note that when the minimum number of items rated in common becomes large, the coverage drops. This is due to the fact that there won’t be any users that have rated enough books in common with the target user while also having rated the target item.

We marked the value for which the mae is small while the coverage is still large (at a value of 5 for both datasets) with a green circle. We will use that value to estimate the statistical significance of that choice later. The metrics, together with the baseline metrics, are for the book dataset
- Baseline: 0.746 (0.693–0.797)
- Minimum 5 items rated in common: 0.722 (0.677-0.767)
and the movie dataset
- Baseline: 0.829 (0.811–0.846)
- Minimum 5 items rated in common: 0.812 (0.796, 0.829).
Minimal Similarity
In the neighborhood selection we can select neighbors depending on if the neighbor’s similarity exceeds a certain threshold. In the baseline method, we use all users with a similarity greater zero as neighbors. In the figure below, we show the mae and average coverage when increasing the threshold from 0.0 to 0.9. For large thresholds, we observe a drop in coverage, which can be explained by the fact that the number of neighbors decreases with increasing similarity requirements.

We find that for the book dataset, a similarity threshold of 0.4 provides good results, whereas for the movie dataset 0.2 seems better. Both values are marked with blue circles in the plots above, and yield for the book dataset
- Baseline: 0.746 (0.693–0.797)
- Minimal similarity of 0.4: 0.728 (0.683-0.772)
and the movie dataset
- Baseline: 0.829 (0.811–0.846)
- Minimal similarity of 0.2: 0.816 (0.799-0.833).
Number of Most Similar Neighbors
Instead of using a minimal similarity for the neighborhood selection, one can also take a particular number of most similar neighbors. We visualize the resulting metrics with such a neighborhood selection below.

We again mark the values for which we get the best results with green circles (9000 neighbors for the book dataset, and 800 for the movie dataset), which lead to the following metrics on the book dataset
- Baseline: 0.746 (0.693–0.797)
- 9000 neighbors: 0.722 (0.675–0.771)
and the movie dataset
- Baseline: 0.829 (0.811–0.846)
- 800 neighbors: 0.802 (0.785–0.819).
Mean Adjustment
Finally, we can account for the different scales in user ratings by mean-adjusting the rating data. We report the resulting metrics for the baseline and the mean-adjusted predictions below. Notice how the mean adjustment greatly reduces the observed mae on both the book and movie dataset.
Books
- Baseline: 0.746 (0.693–0.797)
- Mean-adjusted: 0.670 (0.623-0.720)
Movies
- Baseline: 0.829 (0.811–0.846)
- Mean-adjusted: 0.746 (0.730-0.762)
Critical Difference
Note that every design choice evaluated above reduces the mae on average, and thus potentially increases the recommendation quality. However, due to limited computational resources, large uncertainty intervals are associated with the observed average metrics. To get a better understanding of which design choices significantly reduce the mae, we perform a statistical hypothesis testing (a so-called Friedman test) across both datasets and visualize the results in a critical difference diagram⁵. Algorithms that are not deemed significantly different are connected with a solid black bar.

The horizontal position corresponds to the average rank each method obtained across both datasets. A rank of 1 means that method always ranked first and thus always had the smallest mae compared to the other adaptions.
From the critical difference diagrams, we can observe that the mean adjustment is significantly better than the baseline method on both datasets. In fact, it is the only adaption to the baseline user-based collaborative filtering method evaluated here that significantly reduced the mean absolute error of the predictions.
However, all design choices considered here improved, for the right choice of parameters, the average mae (smaller average rank in the plot above), even though the differences might not be statistically significant for the data collected here.
Going Forward
Besides running more cross-validation iterations to enhance the statistics, other aspects can be important when evaluating recommender systems, some of which are:
- Other metrics might be more relevant and informative than the mean absolute error and prediction coverage⁶. Depending on the use case and context, those could be true positive rates or ranking metrics, serendipity, or latency if one is interested in fast recommendations⁷.
- We evaluated only one single adaption to the baseline user-user collaborative filtering algorithm at a time. Combinations of multiple design choices might yield even better metrics.
- We touch upon only a few design choices of collaborative filters. There is, however, a plethora of more design choices in user-user collaborative filters⁸.
- We evaluated the design choices on two datasets only, and only for a limited number of iterations. The results on the two datasets above are thus only indicative, and more extensive testing might potentially lead to more conclusive results, i.e. concerning the statistical significance of the design choices.
- Both datasets considered here contain explicit ratings on a scale from 1–5. For implicit data or different scales other design choices might work better.
The code to reproduce these results is available under https://github.com/fkemeth/book_collaborative_filtering/tree/evaluation.
References
[1] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen: Collaborative filtering recommender systems, https://doi.org/10.1145/3130348.3130372
[2] Data: goodbooks-10k, https://github.com/zygmuntz/goodbooks-10k
[3] Data: MovieLens-100k, https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset/
[4] We use the definition from https://stackoverflow.com/a/50859611/12438757 to calculate confidence intervals
[5] J. Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets, https://dl.acm.org/doi/10.5555/1248547.1248548
[6] J. L. Herlocker, J. A. Konstan, L. G. Terveen and J. Riedl, Evaluating Collaborative Filtering Recommender Systems, https://grouplens.org/site-content/uploads/evaluating-TOIS-20041.pdf
[7] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender systems by coverage and serendipity, https://dl.acm.org/doi/10.1145/1864708.1864761
[8] J. L. Herlocker, J. A. Konstan, A. Borchers and J. Riedl, An Algorithmic Framework for Performing Collaborative Filtering, https://dl.acm.org/doi/10.1145/3130348.3130372