World Geopolitics Through Hillary Clinton's Emails

Use Cases & Projects, Dataiku Product Hanna Julienne

Today's subject is natural language processing with a dataset of choice: Hillary Clinton's emails. These emails are a hot topic in the current Democrat primaries. In this blog post, we will use data rather than opinions to analyze this controversy.

You all know that the race for the U.S. presidency has already begun. While the Republican primaries are a bit messy with no clear winner at this point, not so long ago, it seemed the Democrats already had a definite winner.

However, a few months ago, an important controversy revolving around Hillary Clinton’s emails arose. Indeed, Hillary made the mistake of using a personal email address for work instead of the one provided by the Federal Government while she was Secretary of State, which, some say, could have resulted in the hacking and disclosure of classified information — and therefore posed a potential threat to National Security.

A similar scandal is now affecting the CIA Chief but, in this case, the hacking did happen and some private information was disclosed on WikiLeaks.

To cut short the controversy, Hillary Clinton decided to hand over 30,000 emails from her private server from her time at the State Department that she believed belonged in the public domain. That's how Kaggle made some of these emails available in this nice dataset.

During her brilliant performance at the first democratic presidential debate, one of her opponents Bernie Sanders declared that "American people are sick and tired of hearing about your damn emails". Hillary couldn’t agree more.

However, from a data scientist's point of view, such a dataset is just too tempting. I couldn't dream of a better corpus to apply some Natural Language Processing tricks than a bunch of emails from the former U.S. Secretary of State. In what you are about to read, I will share some insights I gained by analyzing this dataset and I’ll attempt to answer the question most Americans are asking themselves — “Did those emails contain information that could have been dangerous if leaked?” — by applying efficient natural language processing tools to this corpus of nearly 8,000 emails.

Here is the flow we will go through today:

Flow in dss

Kaggle completed a part of the data munging for us by retrieving emails from raw PDF and gathering them in a .csv format. Of course, I will use this starting point for this blog post. You can go look at the kaggle data page to look at explanations and features in detail. In this short study, I decided to use the emails and the Person dataset.

Joining, Cleaning, and First Insights

Did you know that you have several possibilities to join two datasets in Dataiku DSS? In my case, I needed to clean the text retrieved from emails and assign a sender with a normalized noun to each email. Hence, I used a shaker which enabled me to clean the TextBody feature of the Email dataset and join the Person dataset with the Sender ID in one recipe:

Joining, Cleaning, And First Insights

Dataiku DSS 2.1 now let's us visualize pivot tables in the chart engine. Let's look at the top email sender:

Joining, Cleaning, And First Insights
You are looking at the list of closest Hillary collaborators!

Retrieve Information on States

The U.S. Secretary of State mainly deals with foreign state policy. Hence, it would be interesting to tag emails by the countries they are talking about. I did so in a Python recipe. The main ingredients in this recipe were:

Once done tagging emails by country, I implemented a second Python recipe that computes the mutual information (using scikit-learn) between a word presence vector and the country presence vector. In plain English, I computed a score between word and country: the higher the score, the more a word appears in the same emails as a country. To explore the results, I created a web app with a country selector that displays a word cloud of relevant words for that country. The size of the word is a function of the mutual information between that word and the selected country.

Here are few examples:

Joining, Cleaning, And First Insights
Haiti mainly retrieves words about the tragic January 12, 2010 earthquake and the humanitarian crisis that followed.

Joining, Cleaning, And First Insights
Afghanistan words mainly deal with U.S. military presence in Afghanistan and terrorism.

The Germany word cloud reserved a few surprises. Indeed, we retrieve expected words like "euro" but also a lot of intriguing words. There are a lot of references to Nazi Germany that don't make much sense in the 2010s. We also noticed lots of absolutely ludicrous words like "hardcore", "rite" and "pot", which appear in reference to Germany.

Let's look closer by retrieving the emails containing those words. I opened an iPython notebook and queried the dataset to explore mails containing those words:

Joining, Cleaning, And First Insights

Some WWII references come from an article about a Republican Neo-Nazi candidate. The second is a long piece written by Max Blumenthal about conservatism in U.S. The confusion came from the use of Nazi references used by the Extreme Conservative political wing in the U.S.

By browsing through the e-mails, I noticed that, interestingly, a lot of these emails were used to share news articles or to deal with public opinion. It seems that nothing of major interest has been disclosed in these emails, but it enables us to have glimpse of how a U.S. Secretary of State works. 

You May Also Like

Alteryx to Dataiku: Working With Datasets

Read More

Demystifying Multimodal LLMs

Read More

I Have AWS, Why Do I Need Dataiku?

Read More

Why Data Quality Matters in the Age of Generative AI

Read More