Hi, I am Hanna and I'am girl data scientist (the rarest unicorn) at dataiku and I am going to tell you how much fun I had exploring San Francisco open data with the Data Science Studio version 2.0!
San Francisco has a very progressive and efficient politics in terms of open data. The city provides clean and fascinating datasets on various subjects : SFPD Crime Incidents, Business locations and even Film locations). This politics is fruitful and leads to numerous fun and insightful exploitations.
Among these datasets, the SFPD Crime Incident dataset is very fascinating. Indeed, all incidents that happened from 1/1/2003 up to two weeks ago are reported with the type of incident, date, hour and the latitude and longitude. A real treat for a Datascientist!
I retrieved the dataset and uploaded it in DSS was as simple as draging the file into a drop zone. A small preparation script later, I had parsed the date, extracted date component such day of the week, month, week of the year and I was ready to graphically explore the dataset. I had hardly worked!
Graphical exploration with DSS charts
Before going any further I had to search for patterns and tendencies in my dataset. Again it was real easy with the DSS charts. I just dragged the features I wanted to explore and made several instructive barchart in no time.
Notably, I remarked that:
- Crime incident are clearly not uniformly scattered across the city districts.
- Seems like each district have its "specialty". This suggest that the spatial repartition of crimes depends on the category of the crime.
- Also, crimes follow a very strong pattern accross the day. There is much less crimes early in the morning. This makes senses everybody is sleepling. More surprisingly, the most dangerous hour is 6 p.m. Who would have guessed?
Ok , I now know for sure that crimes in san francisco have a strong time and spatial repartition. I also know that different type of crimes don't all follow the same spatial repartition. They may also not happen at the same time.
What could i do to explore further these possibilities? An interactive dashboard of course!
Making a beautiful interactive dashboard in DSS
I would like to be able to see the spatial and temporal repartition of crimes in San francisco by categories. In other words, I want to display 4 dimensions on a 2 dimensionals layout. I will achieve this by programming an interactive dasboard that include a map, a barchart of the number of crime by hour, and a tool to filter by category of crime.
I used the neat DSS web app editor:
First, I drawed the map using a JS code snippet given in our web app editor. Added two sliders to select the degree of spatial aggregation and the year and here is my map :
The more red the more the number of crimes is high. I now can explore the spatial repartition of crime in San Francisco.
Crime category filter
Let's add the possibility to filter by crime category. I could have added a plain selector to filter by category. But I am a datascientist so I told myself "Hey, wouldn't it be cool that the element enabling to filter by category carry information". Instead of a selector, I used a D3.js icicle to display the proportion of categories and sub categories.
A few tricks latter clicking on the icicle enable to filter by the category or sub category:
The time repartition
The last element I added is a barchart of hours. When I change the year or filter by a new category the barchart is updated accordingly.
Let's filter by thefts from a person, we see that it happens mostly downtown and during the afternoon:
Whereas vandalism occurs during the evening and is more scattered through the city :
Fascinating! I am going to spend time exploring San Francisco crime dataset to search other patterns. Jealous? Don't be. If you want to try the app, you can download the free DSS community edition and email me to retrieve my project (firstname.lastname@example.org).
In a following post, we will try to predict when crime will happen in specified area of San Francisco with DSS. Stay tuned!