Predicting the number and even the type of crimes being committed in the Greater London area each month is no easy task, but here’s how I cracked it, with Dataiku Data Science Studio (DSS).
an introduction to using data to predict crime
In 2014, London police started trialing software designed by Accenture to identify gang members that were likely to commit violent crimes or reoffend. It began an unprecedented study drawing on five years of data that included previous crime rates and social media activity. Using big data to fight crime is clearly not entirely novel, but I wanted to take this further, especially with all the open data about crime that’s out there.
The Greater London police (Metropolitan and London) are doing a great job at fighting crime, providing data, and mapping the results, but what’s interesting is to try to make predictions and not just have a view on the past data.
We might already have an idea of who is likely to commit a crime, but how many crimes would this result in, and what would be the nature of these crimes? This was the kind of information I hoped to predict, so I tried two different predictive models: crime month-by-month at the LSOA level and crime type (whether burglary, bicycle theft, arson, etc.) month-by-month at the LSOA level.
About LSOA: LSOA is a census area containing 1,000 to 3,000 people. Here’s the full definition from ONS.
A step by step guide to How I built the crime-prediction project
Inputting and enriching the data
So, where to begin? I sourced the data from the open source crime database on the UK police portal, selecting data from 2011 to 2016 pertaining to Greater London (central London and the surrounding metropolitan area).
The main source of data I used is available here - I selected the metropolitan and London areas. I also used UK census information, Point of Interests (POIs), and the geographical locations of police stations.
I enriched the dataset with various open data sources, added the police station coordinates, and added postcodes. I also inputted POIs and the LSOA statistics.
Building the dataset
To prepare the dataset for training the machine learning models, I created a geohash based on latitude and longitude coordinates. I cleaned the data for recoding, filling empty values and structuring, which is super simple in Dataiku DSS.
I then created clusters for the LSOA in order to define the criminality profiles and their levels - three clusters and one outlier were found. The different datasets could then be joined.
Building the predictive models
I built two models, the first for prediction per LSOA per month and the second for prediction per LSOA per month per crime type.
Blending the data
I collected the POIs, cleaned the data, and created a geohash for each latitude/longitude coordinate, and then loaded it into a HPE Vertica database. Then I was ready to collect the crimes from 2011 to 2016 and to clean this data.
Here is an overview of the first data preparation step:
I have developed a geohashing plugin for transforming the XY coordinates into categorical values. If you are not familiar with DSS plugins, you can find out more here - plugins are super useful for packaging a methodology and adding new functions to Dataiku DSS.
Let’s have a first look at the volume of crime data we collected. For this, I created an SQL notebook with Dataiku DSS:
I decided to work with crime data from 2012 to 2015 and then predict for 2016. The first pleasant surprise was seeing the number of crimes decreasing.
Regarding the POIs, let’s focus on the POI "subcategory," which is consistent and detailed enough for our predictive model:
Here’s a sample of the police stations that I collected:
I then collected the census statistics for each LSOA, added the police station directory, defined a geohash, processed the number of police stations per LSOA, and added restaurants to the POIs that had previously been added.
Next, I created an LSOA clustering model. I tried to clusterize the LSOA regarding crime level and type. In Dataiku DSS, you can set and assess different algorithms in order to ensure the best performing one; I selected a KMeans.
Itterating the predictive models
I created two predictive models with the visual-guided machine learning of DSS. Regarding the key metric R2, I decided to not tweak the parameters or re-engineer the features with a custom code.
With the R2 metric so high, it was then necessary to check that the model wasn’t overfitting. I decided to check it with a validation partition, and it proved that the performance was stable enough to keep the model (for the second model relating to the prediction by crime, the results were not so stable, but it shouldn't be hard to stabilize them).
Let’s look at the variable importance:
The most contributive feature is the average distance to the city centre. If we look at the relationship in more detail, it seems that there is generally a higher level of crime in the centre which is most probably due to the concentration of shops, tourist attractions, pubs, etc. But that’s not the only explicative feature.
Dataiku DSS is providing useful debriefing reports. Our error distribution is pretty good for a first model!
Deploying the predictive models
I applied the predictive models to the 2016 data and saw that, with both models, the R2 metric (on validation data) was so high that I could deduce that the predictions were very accurate. For example, where 19 crimes were observed for the E01000001 area in January 2016, the model predicted 17.
Here, we have the real data to the far left and the predicted data to the far right:
Here is an overview of the two final output datasets :
the results of my crime prediction project
I was expecting a direct relationship between the number of shops, restaurants, tourist attractions, etc., and the number of crimes committed, so I wasn’t surprised to see that the density of POIs correlated in part with a higher number of crimes.
Here’s a data visualization that offers a clearer explanation of the link between POIs and the number of crimes:
So what was I surprised by? Well, creating performing predictive models with a basic dataset proved remarkably straightforward and didn’t require a huge amount of work! What’s more, I’ve shown that anyone can use open data sources to build predictive models for valuable causes in just a few hours, provided they have the right tools. You don’t have to be an expert to do this!
Visual overview of the predictions
Let’s take the total predicted number of crimes in 2016 (January to August) and put it on a map. LSOA being specific polygons, I decided to draw a map in ArcGIS Online (Dataiku is an Esri Silver Partner):
As LSOA is an ONS statistical unity, it is never easy to figure out the correspondence between an LSOA and a real neighbor. Let’s focus on the LSOA with the highest level of crime. By setting a high transparency on the map, we can see that the highest number of crimes are in an area crossing Soho, Leicester Square, and Charing Cross.
Let’s draw the crime prediction distribution by LSOA name in Dataiku DSS: The "Westminster 018A" is the LSOA with the largest number of crimes. This LSOA is roughly crossing Leicester Square and Charing Cross.
Now that we have a better idea of the total number of crimes, let’s create a density map of the total number of crime per LSOA. No surprise that this density map highlights the city centre:
Finally, let’s map the Residuals’ Standard deviation (residual means the difference between predicted and actual values). This will give us an idea of the model fitting regarding the geographical dimension and can help for improving our model quality.
We can see that our model has a bigger deviation on specific areas, especially when approaching the city center; probably because we are mainly taking into account the trends and not enough event data.
To improve the approach and goals, we also need to create a predictive model at a lower area for a lower time frame. Adding some news feeds would also enable us to detect results that are related to certain trends. We could also envisage improving this performance by using other kinds of algorithms and machine learning approaches.
If you found this interesting, feel free to install the latest version of Dataiku DSS below so you can start straight away with building your own predictive models! And be sure to check out the interactive map I created with Esri ArcGis Online.
Get in touch with me if you have any questions, comments, or suggestions.