One very important use case for predictive analytics and machine learning is security, both in the virtual and in the physical sense. Every day, organizations across the world are analyzing massive amounts of data from different sources to keep us safe.
We've talked about big data and security before - see our recent post on London crime. But this guest blog post from one of our partners illustrates yet another example of how predictive analytics can augment security operations by delving into Twitter data and asking the question: how much can we glean about people from their Tweets? Specifically, is it possible to predict based only on the content of a Tweet if the author has connections to the Islamic State?
About the author: Isabela Castilho is a data scientist at Marionete, a Dataiku partner and UK-based consulting company specializing in the disruptive technologies of data science, big data and DevOps.
Isabela has a background in aerospace engineering, and her experience ranges from modeling rocket combustion to recommendation systems or predicting mechanical failure in aircraft components with machine learning. She is passionate about modeling behavior, whether it’s human or machine behavior!
Detecting ISIS Members Based on Tweets
Here are a few real Tweets from different people talking about the Islamic State. Can you deduce who has proven connections to ISIS and who is just casually talking about a very current subject? Give it a try:
Aslm Please share our new account after the previous one was suspended. @[omitted]
ISIS influence on the decline as terrorists lose Twitter battles - CNET http://www.cnet.com/news/isis(...)
RT @TIME: ISIS has lost a quarter of its territory in the past 18 months, report says http://time.com/4400413/isis(...)
Media is an integral part in the functioning of a proper state. Islamic State has been doing exceptionally well.
Only the first and last Tweets were written by someone with known connections to ISIS. Did you get it right?
It shouldn't be too difficult to figure out - you wonder why the first user had his account suspended and why the last one is so excited about the success of the Islamic State. On the contrary, the other Tweets focus on the failures of ISIS, which is something a member with connections to the organization wouldn't likely advertise.
But would a machine be able to see this?
I was curious to find out how far I could go with machine learning on detecting ISIS Tweets without spending too much time on it, so I used Dataiku Data Science Studio (DSS) to quickly clean my data and move on to the machine learning part. I used two data sets from Kaggle: one contained Tweets that mentioned ISIS, and the other contained Tweets from users with known connections to ISIS.
A challenge I decided to add was to only use text features, leaving any implicit personal characteristics (like the location of the Tweet) out of the analysis. As humans, we might consider things like the location or images associated with Tweets when making our decisions, and that is perfectly normal because we always try to collect as much information as possible when making any decision. But later, we might discard some of this information and end up basing our decision only on one factor, like the body of the Tweet. Algorithms don't have this way of thinking, so I wanted to see what results I could get by just sticking to the body of the Tweet itself.
Given the challenge I'd set for myself of just using text features, I trained the algorithm only based on the Tweet body and the hashtags used.
The body of the Tweet required the traditional text processing steps such as removing stop words and normalizing, and in this case, I decided to use the TF/IDF vectorization. I also removed any mentions to other accounts so that the algorithm couldn't use what it knows about another account for the predictions it's trying to make.
Did it work?
In this case: yes.
My final predictions were using a logistic regression with 1-grams, 2-grams, and 3-grams, and I obtained 2 percent false positives and 2 percent false negatives in detecting Tweets from ISIS members. This is very good, especially taking into account that I was evaluating Tweets separately and not users as a whole.
What did the Algorithm see in ISIS members' Tweets?
First, let's take a look at what the algorithm didn't see:
These words were mainly linked to Tweets from users with no connections to ISIS:
- Some of the words (like ISIS, daesh, and ISIL) are used mainly in the media to talk about the organization.
- We also see Mosul, which was a city controlled by ISIS and in which there is an ongoing battle between ISIS and Iraqi forces; each ISIS loss in this city is highly covered in the media.
- HTTP shows up mostly when people share a news article.
- Ramadan was also a very popular word that was especially used to talk about ISIS attacks during that time. These Tweets specifically focused on the fact that it's supposed to be a time of peace and reflection.
So if these words didn't help identify people with known ISIS connections, then what gave them away?
- Words like photo, video, and footage is a trend possibly illustrating the intention to spread proof and details of shocking and gruesome attacks.
- Another trend is mentions of other accounts from ISIS members. Though, as you'll remember, I removed @mentions from the analysis, the algorithm still picked up on Tweets that said things like “check out this account.” Even though the actual mention to the account was removed, the fact that someone is talking about another account and mentioning ISIS ended up being a strong indicator of an ISIS member.
- Tweets mentioning "account suspended" give us a hint that the user has strange behavior on Twitter.
- There are also words associated with strength, like army and soldiers, and Tweets about the USA (something usually used as an example of what ISIS stands against).
- Lastly, there are references to Aleppo, which is a city in Syria where confrontations were very often tainted by attacks on civilians and where the international community was, for a long time, unable to peacefully resolve conflict.
What Can We Take Away From This?
I was very pleased to fulfill my original challenge of basing my algorithm only on behavior text features without using any indicator of personal characteristics or sacrificing performance.
Additionally, I was able to not only detect ISIS members' Tweets with a high degree of accuracy, but the project also only took me one afternoon. Thanks to Dataiku's data preparation features, I was able to move very quickly to the quest for casually satisfying curiosity. But in the end, I also created something that I could easily turn into a bigger project.