Today's subject is natural language processing with a dataset of choice: Hillary Clinton's emails. These emails are a hot topic in the current Democrat primaries. In this blog post, we will use data rather than opinions to analyse this controversy.
You all know that the race for the United States presidency has already begun. While the Republican primaries are a bit messy with no clear winner at this point, not so long ago, it seemed the Democrats already had a definite winner.
However, a few months ago, an important controversy revolving around Hilary Clinton’s emails arose. Indeed, Hilary made the mistake of using a personal email address for work instead of the one provided by the Federal Government while she was Secretary of State, which, some say, could have resulted in the hacking and disclosure of classified information - and therefore posed a potential threat to National Security.
To cut short the controversy, Hillary Clinton decided to hand over 30,000 emails from her private server from her time at the State Department that she believed belonged in the public domain. That's how kaggle made some of these emails available in this nice dataset.
During her brilliant performance at the first democratic presidential debate, one of her oponents Bernie Sanders declared that "American people are sick and tired of hearing about your damn emails". Hillary couldn’t agree more.
However, from a data scientist's point of view, such a dataset is just too tempting. I couldn't dream of a better corpus to apply some Natural Language Processing tricks than a bunch of emails from the former United States Secretary of State. In what you are about to read, I will share some insights I gained by analyzing this dataset and I’ll attempt to answer the question most Americans are asking themselves - “Did those emails contain information that could have been dangerous if leaked?” - by applying efficient natural language processing tools to this corpus of nearly 8000 emails.
Here is the flow we will go through today:
Kaggle completed a part of the data mungling for us by retrieving emails from raw pdf and gathering them in a cvs format. Of course, I will use this starting point for this blog post. You can go look at the kaggle data page to look at explanations and features in detail. In this short study, I decided to use the emails and the Person dataset.
Joining, Cleaning, and First Insights
Did you know that you have several possibilities to join two datasets in DSS? In my case, I needed to clean the text retrieved from e-mails and to assign a sender with a normalized noun to each e-mail. Hence, I used a shaker which enabled me to clean the TextBody feature of the Email dataset and join the Person dataset with the Sender ID in one recipe:
DSS 2.1 now let's us visualise pivot tables in the chart engine. Let's look at the top email sender:
You are looking at the list of closest Hillary collaborators!
Retrieve information on states
The US Secretary of State mainly deals with foreign state policy. Hence, it would be interesting to tag emails by the countries they are talking about. I did so in a python recipe. The main ingredients in this recipe were:
the pycountry: a package which provides a country list feature and other utilities
nltk: The python package to do machine learning
Once done tagging emails by country, I implemented a second python recipe that computes the mutual information (using scikit-learn) between a word presence vector and the country presence vector. In plain English, I computed a score between word and country: the higher the score, the more a word appears in the same emails as a country. To explore the results, I created a web app with a country selector that displays a word cloud of relevant words for that country. The size of the word is a function of the mutual information between that word and the selected country.
Here are few examples:
Haiti mainly retrieves words about the tragic January 12, 2010 earthquake and the humanitarian crisis that followed
Afghanistan words mainly deal with US military presence in Afghanistan and terrorism
Germany word cloud reserved a few surprises. Indeed, we retrieve expected words like "euro" but also a lot of intriguing words... There are a lot of references to Nazi Germany that doesn't make much sense in the 2010s. We also notice lots of absolutely ludicrous words like "hardcore", "rite" and "pot", which appear in reference to Germany.
Let's look closer by retrieving the emails containing those words. I opened an iPython notebook and queried the dataset to explore mails containing those words:
Some WWII references come from an article about a Republican Neo-Nazi candidate. The second is a long piece written by Max Blumenthal about conservatism in United States. The confusion came from the use of Nazi references used by the Extreme Conservative political wing in the United States.
Did you enjoy this piece? Register for our upcoming free training an Natural Language Processing here.
By browsing through the e-mails, I noticed that, interestingly, a lot of these emails were used to share news articles or to deal with public opinion. It seems that nothing of major interest has been disclosed in these emails but it enables us to have glimpse of how a United States Secretary of State works.
Stay tuned for our next data adventure and don't hesitate to email me if you want the python code in the recipes above.