Ever wondered how you get recommended to watch Raiders of the Lost Ark after you gave a good rating to Star Wars on your favorite movie rental service? (yes, back to the 80's...). That's the purpose of a recommendation engine, widely used on the web for different kinds of products or services.
As I have been following the adventures of the excellent Edwin Chen lately and read about his implentation of a recommendation engine using Scalding -- the mapreduce framework used by Twitter -- I decided to try to write other VERY basic implementations using different languages. For a more complete overview and even more code samples, check out our how-to guide on recommendation engines.
Please note that this is not at all optimized nor exhaustive, it just illustrates the fact that you can quickly build such an engine using the technologies at hand (depending on the size of your data). The scripts below can certainly be enhanced, and of course other softwares/languages can be used.
I'll be using here the MoviLens 100K dataset, a collection of movie ratings with anonymized user ID's. The purpose of the recommendation engine is to help users finding new movies to watch, given their preferences (as seen in their ratings). The algorithm used is pretty simple: it's a Pearson correlation coefficient between the ratings vectors of each pair of corated movies (i.e, find every users who rated both movie A and movie B, and the associated ratings). It ranges between -1 and 1, the closer to 1 the higher the correlation. I'll skip the long explanation and go straight to the point.
Implementation in Pig
Say you have a massive dataset and have access to a Hadoop cluster, a scalable recommender can be done in Pig, a high-level data language built on top of Hadoop:
As a quick note, Pig can be easily installed on Mac OS X with Homebrew. For development or testing purpose, you can use Pig in local mode (pig -x local), which does not require HDFS.
Implementation in SQL
Now let's say your data is not too large, or you have access to a MPP database (Teradata, Netezza, Vertica, Aster Data, Greenplum...), you can go the SQL way. Here is an implementation with MySQL :
Implementation in SAS
Even if R is gaining a lot of traction and becomes the new standard among statistical softwares, there are still plenty of orgs using SAS for analytics, especially in the "enterprise" world. So the third implementation is in SAS:
SAS is mainly I/O bound. If you have access to a machine large and fast disks, this will scale to a lot of data.
Implementation in Python
For the fun, the last implementation is an attempt in Python:
Note that this will not scale much as the process of creating the coratings is very expensive...
So, what about the results?
So back to Star Wars. Using this dataset, the top 5 recommendations are:
- Empire Strikes Back, The (1980) 0.748
- Return of the Jedi (1983) 0.672
- Raiders of the Lost Ark (1981) 0.536
- When We Were Kings (1996) 0.515
- Some Folks Call It a Sling Blade (1993) 0.509
Once again, this is a very simplistic approach for creating recommendations, but the point is that you can get to a working prototype pretty fast, using various technologies.
The source code can be found is this Github repo. And don't forget to download the complete guide to building your own recommendation engine for more!