Movie Recommendation: Should I Trust IMDb?

Use Cases & Projects, Dataiku Product Jean-Baptiste Rouquier

Nice comments about the top 12 secret shortcuts of Dataiku DSS convinced me to write another one: this time it's about plugins, APIs, movies, and how to choose the best one to watch with friends.

choose your movie wisely

Choose a Good Movie

A friend of mine has a list of all her DVDs (and takes note of who borrowed what). How do I choose the top five movies I'd like to borrow?

I'm not a movie maniac and I don't know most of the titles in her collection. And when I do know one, it's most often because I have already seen it (and would prefer to watch a new one). So I need another source than my culture for choosing a movie.

There is a way: I have noticed that I often agree with IMDb and Metacritic ratings. So the question is now this: I'd like to get ratings from various sites for all those DVDs and filter them to keep only the highest-rated ones.

Once the question is well defined, first thing is getting the data. I found two nice free APIs: OMDb and TMDb.

  • OMDb provides IMDb, Metacritic, and Rotten Tomatoes scores.
  • TMDb is better at matching approximate and foreign titles.

I coded a quick plugin to request those APIs and got this flow:

Movie recommendation flow

The red recipes are the ones created thanks to the plugin: they request OMDb/TMDb and enrich DVDs_list with movie ratings. The yellow recipe joins all info into the final dataset.

Now I just need to open the resulting dataset, use the Dataiku DSS filter to keep high enough ratings, and voilà, a list of the first movies I'd like to watch with my friend.

As a bonus, a column contains the runtime to help us choose a short or lengthy movie. I also like the “tomato consensus," which tells us where the movie shines (deep characters, gorgeous scenery, great jokes...) without any spoilers.

Choose a Recommendation Site

Data science workflow

Quick question: Which site has the best ratings? Or, more precisely, where are the ratings most useful to me?

With the above tools, it would be a shame not to run some cool stats. Let's see if we can gain some insights from those numbers.

I have rated most movies I've seen on SensCritique.com (their offer is to give you a movie rating based on ratings by people like you). To get my data back, I used the import.io plugin (not detailed in this post). Then I ran the same flow: augmented this list of films with info from TMDb and OMDb. This yields a dataset of movies with several rating sources: IMDb, Metacritic, RottenTomatoes, TMDb users... and me. ;)

Can we find some correlations?

Correlations

  • As could be expected, highest correlations are between tomatoMeter and tomatoRating: the meter is a percentage of positive reviews, while the rating is an average of all reviews. So they contain very similar information.
  • Metascore and Tomato are pretty highly correlated too... and they indeed have the same kind of sources: they both aggregate lots of professional reviews.
  • Next highest one is tomatoUser to IMDb: they are both an aggregate of lots of users (not pro) scores.
  • They correlate much less with SensCritique users. I guess the French taste differs from the mainstream taste on American sites. ;)

On the following more detailed plot, we see that IMDb never rates below five, while Senscritique is more demanding and uses most possible ratings (TMDb essentially uses 3 different ratings).

Pandas visualization

For quite some time now, I have been checking IMDb ratings before watching a movie, and I'm probably unconsciously influenced by it. If I'm told a movie is good, I will tend to like it more... so the correlation between my score and IMDb ratings is probably slightly biased.

On the other hand, I take great care of not reading the user scores on SensCritique before giving my own score to avoid any bias. It's interesting to see that my taste correlates most with SensCritique.

Finally, my personal takeaway is that Rotten Tomatoes might be closer to my taste than Metacritic. I will pay more attention to it from now on.

You May Also Like

Alteryx to Dataiku: Working With Datasets

Read More

Demystifying Multimodal LLMs

Read More

I Have AWS, Why Do I Need Dataiku?

Read More

Why Data Quality Matters in the Age of Generative AI

Read More