I recently built a movie recommendation application in Data Science Studio. This time I'll show you how you can do clustering with DSS without using a single line of python code. For more fun, our clustering will be actors and based on features we can get from the movies they’ve played in.
Of course, this example can be applied to companies wanting to do this on their own customer databases. The common error I see is people going too fast on the clustering algorithm without preparing the data. You cannot put your client database in a K-means and expect to have it magicaly clustered. You have to spoon-feed the algorithm!
For this, your dataset must be formatted and aggregated in the right way.
Formatting your data.
My initial dataset is a movie dataset. First, I have to transform it to an actor dataset. A column of my data is a list of actors; I use the split and fold processor of DSS to reshape it correctly:
My dataset of 7K movies is now a dataset of 17K actors.
Now I want to aggregate my dataset by actors because some actors appear in several movies. To do this I will use the DSS grouping recipe. To keep it simple, I just aggregate on numerical features and do operations like average or sum:
- Proportion of American movies they played in.
- Average of duration of their movies.
- Sum of fan of their movies.
- Average of press and spectator rating.
- Sum of user ratings.
I also activate a post filter to keep only actors who played in at least 5 movies.
Clustering data with algorithm.
My dataset is now ready to enter in a clustering algorithm. Clustering is a method to discover pattern in data. In my actor clustering I expect to see a cluster with Hollywood superstars, another with popular French actors, and so on. There are several clustering algorithms in DSS: K-means, ward hierarchical clustering, spectral clustering, DBSCAN.
Here's some example of scikit learn clustering algorithm on different shape of data point.
For the features, it is suggested to rescale it by average and standard deviation. DSS does it by default for your numerical features. Then in the algorithm tab, you can select several algorithms, several numbers of clusters, and run all the experiments at a glance.
Now you may ask: You may or not have an idea of how many clusters you want, but often, it isn’t that obvious. To find the best one, or at least a correct one, you can check the silhouette. Silhouette coefficients near 1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters.
So the higher the silhouette score, the better your clustering should be!
Here, my best clustering seems to be a Kmeans with 4 clusters. Let's have a look.
The first thing I usually check is the heat map of my clusters:
It reveals how the clustering algorithm groups the data. Here I can guess that the cluster 1 is Hollywood superstars with lot of fans. I'll guess that cluster 2 represents “all fame” actors: good movies but few fans. Cluster 0 and 2 contain more French movies, 1 and 4 contain more American ones.
You'll notice that Americans have much better ratings and fans than the French ones! Also that superstar plays in longer movies, and not always liked by the press.
Visualize the clusters.
As soon as my clustering was done, I created an insight to visualize it.
Cluster 0: the French mafia.
Cluster 1: the Hollywood superstars.
Cluster 2: the "legend" actors.
Cluster 3: American stars.
Voila! And this literally took me one hour. If you want to find out more or try this out for yourself, download Dataiku’s free Community Edition here
Till next time!