My initial intention was to use data coming from the tax administration as a proxy for wealth, and to build some cool graphs from there. Sounds easy ?
Not exactly... For instance:
- The site data.gouv.fr does not provide any API to query and fetch its data
- Data comes in a wide variety of formats (doc, xls, pdf)
- Data is often split into a large number of small files
- Data is not standardized
So the point here is not to criticize the open data initiative (which is great), but just to make sure we are ready for a great data analysis and data preparation party.
The process to build the data set is described here :
You get the idea : grab the data from different sources, clean it up, merge it, create new attributes, visualize. This will sound pretty familiar to whoever is familiar with working on real world data.
For this project, I used only open-source or free tools:
- Python as my general purpose scripting language
- Google Refine for some data cleansing tasks
- Tableau Public for the actual visualization
Get the data
Let's start with grabbing the tax data.
The site data.gouv.fr offers 350,000+ files, most of them coming from large datasets split into smaller pieces, say for different geographic levels (see there for a more complete description), which means that you'll probably have a bit of work for a consolidated view.
In the case of tax data, the search engine returns something like this:
There is one file per "département" and "région" (~county/state), each containing data at city level. That's more than 120 Excel files to download. Download each file and consolidating manually their content into one dataset would take too long, so let's have some Python fun instead :
This class simply loops through the search page results, loads their content using BeautifulSoup, looks for the href link of the Excel files, and actually downloads them locally.
extract and consolidate the data
Once downloaded, we are going to extract and consolidate the content of each spreadsheet into a single file. The main problem here is the formatting. The use of formatting is great for printing / reading spreadsheets, but is a nightmare to deal with for analysis.
Merged cells, filters, blank lines, summary and detailed data... The perfect combo, all in one sheet ! Well, at least and luckily enough, all the files have the same layout, so let's go Python again:
Please see the comments in the code to have more details, but what it basically does is extracting the content of each file according to the settings passed to the __init__ method and push it to a tab-separated values file.
enrich and transform the data
Once the tax data is consolidated, the next step is to enrich these data using additional features coming from the French census bureau (INSEE). It will provide population data as well as a standardized geograhic hierarchy (in France, from top to bottom, we have Country > "Regions" > "Departements" > "Communes" - cities).
Instead of writing a Python script, I'll use Google Refine to process the data. I won't go into details of the process, but a summary is as follows:
Google Refine is very interesting to deal with messy data. Among other things, it can read a wide variety of data sources, has some nice faceting features that can help in consolidating similar records, and offers the powerful GREL language to perform trasformations. The main UI looks like this :
Here is a sample list of features I used :
- Read data from a URL
- Using faceting to filter out some data
- Using GREL to standardize values
- Using GREL to merge data sets
So after spending some time cleaning up my input data, standardizing it and augmenting it, the last step is to append geographic coordinates to each city that will be useful to plot maps. The process of transforming a place name / address into a pair of longitude/latitude coordinates is known as geocoding. There are a lot of APIs provinding such a service (including Google Maps, Bing, OpenStreetMap, etc.) but I chose to go with Yahoo! PlaceFinder, notably because of its high rate limits (50,000 calls to the API per day). To start using it, the first thing needed is getting an API key from Yahoo. Then the following Python script will take the clean data set as an input, call the geocoding service on each city name, and output the results to a new file :
Please note that this script should be generic enough to be adapted for new data set, the only requirements are to have columns names in the input file specifying the delimiter and the ordered list of fields to form the address to geocode.
The main trick here is multi-threading. Since the process of calling sequentially the geocoding service row by row from the input file can take a while, this script takes advantage of Python multi-threading facility to largely speed up things. Geocoding 36K addresses takes about one hour with this method.
visualize the data
We're done. The data set now has all the info required and is ready to be pushed to Tableau. If you are interested in the scripts and data set, everything is this Github repo. If you're also interested in adding to your toolbox by making data visualizations in Dataiku Data Science Studio (DSS), check out our free training.