Predicting the number and even the type of crimes being committed in the Greater London area each month is no easy task, but here’s how I cracked it, with Dataiku Data Science Studio (DSS).
This blog post was updated in February 2017 to include all 2016 data and make predictions for 2017.
an introduction to using data to predict crime
In 2014, London police started trialing software designed by Accenture to identify gang members that were likely to commit violent crimes or reoffend. It began an unprecedented study drawing on five years of data that included previous crime rates and social media activity. Using big data to fight crime is clearly not entirely novel, but I wanted to take this further, especially with all the open data about crime that’s out there.
The Greater London police (Metropolitan and London) are doing a great job at fighting crime, providing data, and mapping the results, but what’s interesting is to try to make predictions and not just have a view on the past data.
We might already have an idea of who is likely to commit a crime, but how many crimes would this result in, and what would be the nature of these crimes? This was the kind of information I hoped to predict, so I tried two different predictive models: crime month-by-month at the LSOA level and crime type (whether burglary, bicycle theft, arson, etc.) month-by-month at the LSOA level.
About LSOA: LSOA is a census area containing 1,000 to 3,000 people. Here’s the full definition from ONS.
A step by step guide to How I built the crime-prediction project
Inputting and enriching the data
So, where to begin? I sourced the data from the open source crime database on the UK police portal, selecting data from 2011 to 2016 pertaining to Greater London (central London and the surrounding metropolitan area).
The main source of data I used is available here - I selected the metropolitan and London areas. I also used UK census information, Point of Interests (POIs), and the geographical locations of police stations.
I enriched the dataset with various open data sources, added the police station coordinates, and added postcodes. I also inputted POIs and the LSOA statistics.
Building the dataset
To prepare the dataset for training the machine learning models, I created a geohash based on latitude and longitude coordinates. I cleaned the data for recoding, filling empty values and structuring, which is super simple in Dataiku DSS.
I then created clusters for the LSOA in order to define the criminality profiles and their levels - three clusters and one outlier were found. The different datasets could then be joined.
Building the predictive models
I built two models, the first for prediction per LSOA per month and the second for prediction per LSOA per month per crime type.
Blending the data
I collected the POIs, cleaned the data, and created a geohash for each latitude/longitude coordinate, and then loaded it into a HPE Vertica database. Then I was ready to collect the crimes from 2011 to 2016 and to clean this data.
Here is an overview of the first data preparation step:
I have developed a geohashing plugin for transforming the XY coordinates into categorical values. If you are not familiar with DSS plugins, you can find out more here - plugins are super useful for packaging a methodology and adding new functions to Dataiku DSS.
Let’s have a first look at the volume of crime data we collected. For this, I created a chart of the number of crimes by year with Dataiku DSS:
I decided to work with crime data from 2012 to 2015 and then predict for 2016. The second step was to predict the number of crimes in 2017 based on the 2016 model. The first pleasant surprise was seeing the number of crimes decreasing. But I was less surprised by the re-categorization of crimes. This is often the case in other industries when, for operational reasons, a category is splitted or merged. I decided to regroup some crimes to give more stability to the approach:
Regarding the POIs, let’s focus on the POI "subcategory," which is consistent and detailed enough for our predictive model:
Here’s a sample of the police stations that I collected:
I then collected the census statistics for each LSOA, added the police station directory, defined a geohash, processed the number of police stations per LSOA, and added restaurants to the POIs that had previously been added.
Next, I created an LSOA clustering model. I tried to clusterize the LSOA regarding crime level and type. In Dataiku DSS, you can set and assess different algorithms in order to ensure the best performing one; I selected a KMeans with the best silhouette score.
Iterating the predictive models
I created three predictive models with the visual-guided machine learning of DSS.
- Number of crimes at the LSOA level for each month
- Number of crimes per LSOA and type of crimes for each month
- Number of crimes per LSOA per type of crimes
Here are the results for the crime prediction at the LSOA level for each month - contact me if you want the other models! Regarding the key metric R2, I decided to not tweak the parameters or re-engineer the features with a custom code.
Here is the debriefing:
Let’s look at the variable importance:
The most contributive feature was obviously the LSOA size area (but this is a good feature). Then comes all the average number of crimes of the past four years, the average distance to the city centre... If we look at the relationship in more detail, it seems that there is generally a higher level of crime in the centre which is most probably due to the concentration of shops, tourist attractions, pubs, etc. But that’s not the only explicative feature.
Dataiku DSS is providing useful debriefing reports. Our error distribution is pretty good for a first model! Keep in mind that the model is slightly underestimating the number of crimes (distribution on the left).
Metrics related to the test sample (20%).
The Pearson coefficient gives the global correlation between the observed crimes in 2016 and the predictions.
Deploying the predictive models
I applied the predictive models to the 2016 data. For example, where 19 crimes were observed for the E01000001 area in January 2016, the model predicted 19.24.
Here, we have the real data to the far left and the predicted data to the far right:
Here is an overview of the two final output datasets :
the results of my crime prediction project
I was expecting a direct relationship between the number of shops, restaurants, tourist attractions, etc., and the number of crimes committed, so I wasn’t surprised to see that the density of POIs correlated in part with a higher number of crimes.
Here’s a data visualization that offers a clearer explanation of the link between POIs and the number of crimes:
So what was I surprised by? Well, creating performing predictive models with a basic dataset proved remarkably straightforward and didn’t require a huge amount of work! What’s more, I’ve shown that anyone can use open data sources to build predictive models for valuable causes in just a few hours, provided they have the right tools. You don’t have to be an expert to do this!
It’s time to apply our model to the data from 2013 to 2016 and discover the predictions. Keep in mind that this model is mainly built with trends and POIs. As a result, I don’t have control over the predictions. In other words, if something significatively changes in the police strategy, law, areas, etc… the model could be really wrong.
For the LSOA E01000001, the crime was 19 in January 2016 and expected to be 20 in January 2017. Let's remind ourselves that the model is slightly underestimating the number of crimes. In this area in 2016, 268 crimes happened and we forecast 248.
Visual overview of the predictions
Here's the number of crimes predicted for 2017 per LSOA centroid:
No surprise that this density map highlights the city centre. Bubble size represents the volume of crimes and the color is the % change 2017 vs 2016. There is no specific increase for the area with the highest number of crimes.
Here's a reminder of the initial map with 2016 January to August data on LSOA Polygons:
As LSOA is an ONS statistical unity, it is never easy to figure out the correspondence between an LSOA and a real neighbor. Let’s focus on the LSOA with the highest level of crime. By setting a high transparency on the map, we can see that the highest number of crimes are in an area crossing Soho, Leicester Square, and Charing Cross.
This is the initial map with the 2016 January to August data:
Let’s draw the crime prediction distribution by LSOA name in Dataiku DSS: The "Westminster 018A" is the LSOA with the largest number of crimes. This LSOA is roughly crossing Leicester Square and Charing Cross (on the initial map with 2016 January to August data).
Finally, let’s map the Residuals’ Standard deviation (residual means the difference between predicted and actual values). This will give us an idea of the model fitting regarding the geographical dimension and can help for improving our model quality. Colors are residual values for 2016 and the size is the absolute value.
We can see that our model has a bigger deviation on specific areas, especially when approaching the city center; probably because we are mainly taking into account the trends and not enough event data.
To improve the approach and goals, we also need to create a predictive model at a lower area for a lower time frame. Adding some news feeds would also enable us to detect results that are related to certain trends. We could also envisage improving this performance by using other kinds of algorithms and machine learning approaches.
If you found this interesting, feel free to install the latest version of Dataiku DSS below so you can start straight away with building your own predictive models! And be sure to check out the interactive map I created with Esri ArcGis Online.
Get in touch with me if you have any questions, comments, or suggestions.
Edited by Cleo Pollard.