Spoiler alert: This post contains the full list of dead Game of Thrones characters... including those that will most likely die in season six.
What would you say if I told you that I can use machine learning to predict character deaths in the Game of Thrones season to come (season six)? You'd probably say: NOOOOO, don't spoil this for us!
Predict the future deaths of GOT characters?
And you're right! There is absolutely no way I'm going to spoil season six of Game of Thrones — for my own sake and for yours. So instead of trying to predict who is going to die in season six, I decided to have a look at the very last words of characters that have died throughout all the seasons and to see what would come out of this. Were they angry, disgusted, happy (that would be weird...), confused? Something else?
What Data Are We Talking About?
I used the website genius to retrieve the scripts of the episodes of Game of Thrones for the first through the last season. I also used Wikiquote's Game of Thrones page to retrieve the exhaustive list of characters that have died since the series started. Then I pulled all of this data into one dataset:
Dataset containing the scripts of the episodes
Dataiku DSS is pretty awesome to work with text data, including movie scripts. I've used several processors of the preparation script to clean the data. The most important processors I used are:
- Regular expressions to remove all the text that was between brackets or parenthesis (it was actually information about the scene)
- Simplify text to normalize the text, stem words, and remove stop words.
Prepare recipe for text mining
The last action I took was to extract the last sentence pronounced by all of the characters before their deaths.
Now if you know Game of Thrones inside and out (like I do), you should be able to answer this simple question: to which characters do these last words belong to?
Don't know the answer? Take a look at the web app at the end of this post and find out.
The big thing about Game of Thrones is that deaths are always unexpected (think about the Red Wedding... nobody expected that to happen). But I was hoping that given the characters' last words, we could estimate in which state of mind they were at the time of their deaths:
- Angry
- Sad
- Surprised
- Disgusted
- Scared
In order to do so, I decided to perform what is called sentiment analysis on the dataset.
Sentiment What?
Sentiment analysis is a technique which is used to extrapolate the general feeling from a text. Historically, this technique was used so that machines could automatically determine if a sentence was positive, negative, or neutral. If you look up "sentiment analysis" on the web, you'll find lots of examples.
To do so with my dataset, I used the list of words available on this git repo and built a model using the sentences I pulled from the scripts. The idea was to train a model using the sentences I downloaded (of course removing those corresponding to the last words of dead characters) and then applying the model to the last words dataset!
Training a model using text features is super easy (and quick) with Dataiku. You can also select how to handle the text features with the following:
- TF/IDF vectorization
- Tokenize and hash
- Tokenize, hash, and apply SVD
- Counts vectorization
Text feature handling in Dataiku DSS
I decided to go with a classic TF/IDF vectorization. This way Dataiku will automatically create the sparse features corresponding to the presence or absence of a word or an n-gram.
Once my random forest was trained (based on the features generated by the TF/IDF vectorization), I applied it to my last sentences dataset to determine the dominant sentiment of these last words. If you want to see how the algorithm classified the dead characters, just have a look at the following web app.
Of course, the classification is not significant in all cases: when a character's last words are only "Olly..." it is difficult to predict the right label (if it exists).
Stop talking. We want to play with the web app!
Let's Visualize All of This!
For your (and of course my) pleasure, I decided to create a web app within Dataiku to refresh our memories on character deaths with data visualization. I adapted one of the scripts from Mike Bostock so that we can see how characters fell to their deaths, what were his/her last words, and how the algorithm classified them (each node corresponds to a sentiment).
Instructions: When you click on a character, check out the top of the web app to see the character's name, their last words, and click on "death scene" to watch it on YouTube! When you click on a node, find out what sentiment it represents. Enjoy :)
If you have any questions about this post don't hesitate to reach out to me. Further, if you liked this post, you might also like to see data visualization about Marvel comics or the Tour de France.