Machine Learning. Explanation of Collaborative Filtering vs Content Based Filtering.

Published in

codeburst

9 min readMay 8, 2018

Recommender systems help users select similar items when something is being chosen online. Companies such as Netflix or Amazon, would suggest users different movies that might interest them and are worth watching. Yelp is using similar algorithms to suggest different restaurants and services. These types of algorithms lead to service improvement and customers satisfaction. Exploring and evaluating recommender systems for Yelp to recommend the best sushi place to user by creating profiles for users and sushi places based on discovered ratings and restaurant features. The method is based on content and collaborative filtering approach that captures correlation between user preferences and item features.

Introduction

Mass customization is becoming more popular than ever. Current recommendation systems such as content-based filtering and collaborative filtering use different information sources to make recommendations [1]. Content-based filtering, makes recommendations based on user preferences for product features. Collaborative filtering mimics user-to-user recommendations. It predicts users preferences as a linear, weighted combination of other user preferences.

Both methods have limitations. Content-based filtering can recommend a new item, but needs more data of user preference in order to incorporate best match. Similar, collaborative filtering needs large dataset with active users who rated a product before in order to make accurate predictions. Combination of these different recommendation systems called hybrid systems [2].

They can mix the features of the item itself and the preferences of other users.

Research Question

Question as an example being explored “what is the best sushi place to be recommended to user?”. Finding users that are related to user likes and the best recommender system improves user experience by recommending best matched products, reducing search times and frustration.

Proposed Solution

Two basic recommender systems are being used for recommendations. Content-based filtering and Collaborative filtering.

First method, Content-based filtering. It relies on similarities between features of the items. It recommends items to a customer based on previously rated highest items by the same customer. List of features about these items needs to be generated.

Each item will have an item profile
A table structure will list these properties
Comparing what and how many features match and collect scores
Recommend highest scored item
Code will be based on an algorithm, by given some item, the most similar item will be found
Best scoring match will be provided to the user
This method relies on item features only, and not the user preferences.

Second method, Collaborative filtering. It relies on how other users responded to these same items. It doesn’t rely of features of the item, but the preferences from other users. Similar users survey needs to be done.

Users will have a table with different rated items of what they choose or liked
Based on the similarities, prediction can be make of what the user might like, based on what similar users did.
The list will be filtered and matched to users who used the same items for comparison and recommendations
Everything will be summed up and highest score will be recommended
Code will be created based on an algorithm, by given a user x, recommend an item that x might like
Item with the highest score will be recommended

Problem with this method is that you need to have a data to make recommendations. More data you have, better recommendations will be.

Data Processing

Dataset is being extracted from Yelp dataset challenge online at https://www.yelp.com/dataset/download. It’s being filtered by user ratings of sushi places and demanded features around Los Angeles area. Users information and ratings are being collected.

4 user profiles and 5 item profiles are being constructed as an example, based on top 5 sushi places in Los Angeles. Sushi A, Sushi B, Sushi C, Sushi D and Sushi E.

Content based filtering table was based on features of meaningful words in the sushi places. TF-IDF was being used as shown below in the Figure 1.

Collaborative filtering table was based on user ratings of the sushi places as shown in Figure 2 below.

Alex rated all of the 5 sushi places: Sushi A, Sushi B, Sushi C, Sushi D and Sushi E
Bertha rated 3 sushi places: Sushi B, Sushi C and Sushi E
Colin only rated 2 sushi places: Sushi C and Sushi E
Ethan rated 1 sushi place: Sushi E

Evaluation Methods and Results

Content-based evaluation item profile can be seen as vector. Measured 0 or 1, depends on YES or NO of containing a feature inside that item.

Profile item was created by evaluating the item and finding meaningful words.

To pick important words, TF-IDF method was used. It looked for featured and important words to user in the sushi places. TFij stands for feature x in document y is number of times the feature x appears in document y divided by number of times that same feature appeared in a document. For example word salmon, appeared in a document 5 times. But in the other document it appeared 23 times. Then, the TFij = 5/23. The documents needed to be normalized in order to compare longer documents. The more times word appear, more important it is.

However, some words are more important that the others. For example little words “the” might appear a thousand times, but it’s not important at all. That’s where the document frequency comes in.

nx= number of documents that mention term x
N = total number of documents
IDFx (inversed document frequency for the word x) = log (N/nx)

More common the term, larger the x and lower IDF. It gives lower rate to common words and higher score to rare. Then, a given document was being calculated term in scores.

TF-IDF score: wxy (word x in y) = TFxy (IDFx)

So given document you computer TF-IDF scores for every term in a document, and pick highest score and that would be the document profile.

The formula for the TF-IDF score is:

Featured words were: Salmon, Tuna, Eel, Crab and Ramen as shown in Figure 3 below.

Boolean values were being used to build the table below. Similarities were being compared between 5 sushi places, to see how many matches. It showed that there 4 matches between Sushi A and Sushi B, 5 matches between Sushi A and Sushi C, 4 matches between Sushi A and Sushi D and 3 matches between Sushi A and Sushi E.

The Best match for Sushi A is Sushi C.

Below is a coded data structure for that called sushi:

typedef struct { bool Salmon; bool Tuna; bool Eel; bool Crab; bool Ramen; } sushi;

To compute if the items similar or not, and what their score is, a set of pseudocode to generate the same function is:

Given a sushi s1, find the most popular sushi

Set BestScore to some minimum value (0)

For each other sushi s2

Set MatchScore = 0

For each feature

Compare s1 and s2 on that feature

If they match, increment MatchScore

If MatchScore > BestScore

then BestMatch = s2 and

BestScore = MatchScore

Return BestMatch

Next, Collaborative filtering method was being used to create user profiles. It’s was being generated from other users preferences. Users with different types of sushi places were scored.

As was shown in Figure 2 above, Sushi A rated 1. Sushi B rated 5,4. Sushi C rated 3,5,2, Sushi D rated 1, and Sushi E rated 5,5,5,5. Just by looking at the number it seemed that all users have differences in scores except Sushi E. Similarities needed to be defined.

Centered Cosine or Pearson Correlation method was used. It captures intuition better. The way it works is by finding row mean.

To recommend the best next sushi place to try, ratings need to be normalized by subtracting user’s rating mean (3.7) from the user ratings.

41 (sum of scores)/11(number of ratings) = 3.7
Sushi A profile weight = -2.7/1 = -2.7
Sushi B profile weight = 1.6/2 = .8
Sushi C profile weight = 1.3/3 = .43
Sushi D profile weight = -2.7/1 = -2.7
Sushi E profile weight = 5.2/4 = 1.3

Results — user didn’t like Sushi A and Sushi D.

Some users didn’t rate all sushi places and the challenge was to decide on how to deal with these missing values. When we add a user who only rated Sushi C, the next sushi match that is recommended will be Sushi E, since it has the highest rating score. Ethan scores are being eliminated for comparison, since he didn’t rate Sushi C before, as shown in Figure 4 below.

Below is a coded data structure for that called user. It shows features of which sushi places user liked:

typedef struct {bool LikedSushiA; bool LikedSushiB; bool LikedSushiC; bool LikedSushiD; bool LikedSushiE; } user;

Additionally to compute what user x might like, a set of pseudocode can be generated as:

Given a user x, recommend a sushi that x might like

For all sushi s

Set score [s] = 0

For each other user y

If y’s preferences match x’s preferences

increment score[s] for all s that u likes

Find the sushi with the highest score and return it

To make better predictions, almost all the major systems use hybrid recommender systems.

It combines user profiles with item profiles and comparing to figure out what the rating will be for the user and the item.

To estimate the angel, a cosine similarity is being calculated. The smaller angel, the more similar the item is. The cosine similarity is being calculated between the user and items catalog. Then, the highest cosine score is being recommended to the user.

Conclusion

Content-based filtering outperforms user collaborative filtering. Items are more similar and make more sense than users similarities.

The experiment showed that if a user liked Sushi Place A before, then the next recommendation should be Sushi C, since it scored the highest.

However, if the recommendation based on other users preferences based on similarities, then it showed that users didn’t like Sushi A and Sushi D. Based on their weighed scores, it showed that Sushi E would be the best recommendation.

Good thing about content based approach that you don’t need data about other users in order to make recommendations.

Problem with collaborative filtering is that when a unique user has a unique taste, there might not be similar matches of other users. Meanwhile, the content based approach can be build based on user and item profiles. Items can be recommended based on previous choices.

However, if a user never rated an item, it won’t be in the recommendations list.

The best approach would be to use a combination of different approaches. Mix of collaborative and content based filtering. Some of it will depend on preferences of the users and some on item features.

References

[1] Ansari, A. (n.d.). Internet Recommendation Systems. Retrieved August/September, 2000, from https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/385/Internet Recommendation Systems.pdf

[2] Shi, C. (2017, June 27). A Hybrid Recommender with Yelp Challenge. Retrieved from https://nycdatascience.com/blog/student-works/yelp-recommender-part-1/

[3] Maryam Khademi, M. (n.d.). Predicting a Business’ Star in Yelp from Its Reviews’ Text Alone. Retrieved from https://arxiv.org/pdf/1401.0864.pdf

[4] Jiaotong, X. (2013, October 1). Recommendation via user’s personality and social contextual. Retrieved from https://www.researchgate.net/profile/Xueming_Qian/publication/262328219_Recommendation_via_user's_personality_and_social_contextual/links/55472daa0cf234bdb21dbc85.pdf

[5] Xiaozhong, L., & Turtle, H. (2013, June 12). Real‐time user interest modeling for real‐time ranking. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.22862

codeburst

Machine Learning. Explanation of Collaborative Filtering vs Content Based Filtering.

Introduction

Research Question

Proposed Solution

Data Processing

Evaluation Methods and Results

Conclusion

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in codeburst

Written by Elena Kirzhner

No responses yet