I recently built a movie recommendation application in Data Science Studio. This time, I'll show you how you can do clustering with DSS without using a single line of Python code. For added fun, our clustering will be actors and based on features we can get from the movies they’ve played in.
Of course, this example can be applied to companies wanting to do this on their own customer databases. The common error I see is people going too fast on the clustering algorithm without preparing the data. You cannot put your client database in a K-means and expect to have it magicaly clustered. You have to spoon-feed the algorithm!
For this, your dataset must be formatted and aggregated in the right way.
Formatting your data
My initial dataset is a movie dataset. First, I have to transform it to an actor dataset. A column of my data is a list of actors; I use the split and fold processor in DSS to reshape it correctly:
My dataset of 7K movies is now a dataset of 17K actors.
Now I want to aggregate my dataset by actors because some actors appear in several movies. To do this, I will use the DSS grouping recipe. To keep it simple, I just aggregate on numerical features and do operations like average or sum:
- Proportion of American movies they were in.
- Average duration of their movies.
- Sum of fans of their movies.
- Average press and spectator rating.
- Sum of user ratings.
I also activate a post filter to keep only actors who were in at least 5 movies.
Clustering data with algorithm
My dataset is now ready to enter in a clustering algorithm. Clustering is a method of discovering pattern in data. In my actor clustering, I expect to see a cluster with Hollywood superstars, another with popular French actors, and so on. There are several clustering algorithms in DSS: K-means, ward hierarchical clustering, spectral clustering, DBSCAN.
Here's an example of scikit learn clustering algorithm on different shapes of data point.
For the features, it is suggested to rescale it by average and standard deviation. DSS does it by default for your numerical features. Then in the algorithm tab, you can select several algorithms, several numbers of clusters, and run all the experiments at a glance.
Now, you may or may not have an idea of how many clusters you want, but, often, it isn’t that obvious. To find the best one, or at least a correct one, you can check the silhouette. Silhouette coefficients near 1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters.
So the higher the silhouette score, the better your clustering should be!
Here, my best clustering seems to be a Kmeans with 4 clusters. Let's have a look.
The first thing I usually check is the heat map of my clusters:
It reveals how the clustering algorithm groups the data. Here I can guess that cluster 1 is Hollywood superstars with lots of fans. I'll guess that cluster 2 represents “all fame” actors: good movies but few fans. Cluster 0 and 2 contain more French movies, 1 and 4 contain more American ones.
You'll notice that Americans have much better ratings and fans than the French ones! Also, that superstars appear in longer movies, and are not always liked by the press.
Visualize the clusters
As soon as my clustering was done, I created an insight to visualize it.
Cluster 0: the French mafia.
Cluster 1: the Hollywood superstars.
Cluster 2: the "legend" actors.
Cluster 3: the American stars.
Voila! And this literally took me one hour. If you want to find out more or try this out for yourself, download Dataiku’s free Community Edition here.
Till next time!