The Data Science Studio makes it possible to create an end-to-end analytical workflow using APIs and services from different providers.
Think about it. How many times did you have to:
- Pull data from a marketing services solution provider or a web analytics service?
- Fetch weather forecasts or geocode adresses using an API?
- Use datasets from open data providers, or even create new ones by crawling the web?
As the Dataiku Data Science Studio (DSS) users encounter this kind of scenario very frequently, I'm going to take you through a complete example about how to create an automated workflow where you'll end up mixing external and internal data.
Our use case
We are going to show, on a single map, incidents and crimes data in San Francisco, as well as graffittis and businesses locations (because, hey, who knows, there might be some correlations!).
In DSS, I created a new project called "Openness" which will hold all the work. The complete data processing workflow for this project, once completed, will look like this:
This example is interesting because it is very similar to the various use-cases described above:
- the incidents dataset will be pulled from the open data portal from the city of San Francisco (think "this is just like when I need to get files from my email solution provider's FTP!")
- the businesses and graffitis datasets are available from the enigma.io API (think "this is just like when I need to pull data from my web tracking provider's API!")
- the data will be pushed to CartoDB to create maps that will be embedded in DSS (think "this is just like when I need to push data to my dashboarding system!")
It sounds like we have it all, so let's dig in the details!
Using external data in Data Science Studio
There are basically two ways to get external data into DSS.
The first one is to use HTTP Datasets, which will let you download and work with any CSV or even simple JSON files, as long as they are reachable via a URL.
One example is the "SFPD Incidents" dataset, visible on the left of the diagram above. The Dataset's "Settings" panel just shows what to do: enter the URL of the file, then hit Download. Once the download completed, DSS will sniff its format and let you use it as any other dataset:
The second way is when you need to access to more complex datasets, or to write a custom logic. In this case, one can use a Python recipe, with no input dataset. Let's have a look for instance at the "get graffittis" recipe, still on the left of the diagram:
This recipe is a "no input" recipe but calls external APIs, which makes it a bit special but very powerful when it comes to use external data. Basically, as long as you can write your code in Python, you will be able to access to virtually any kind of data.
What we do here:
- find the dataset ID on Enigma
- make the required calls to their API and get a JSON response (and looping through the results), using Requests here
- load the 'result' section of the response in a Pandas dataframe (and let Pandas finds the structure)
- create DSS dataset using the Dataiku API, and let it generate automatically its schema when written
At this stage, we have fetched our raw datasets, and we are now ready to move further.
Putting it all together
We now need to stack all the datasets together. This objective is to produce a new dataset with the required structure to push to CartoDB.
Without going to much in details, building a map often requires the dataset to have "real" geographic attributes. CartoDB for instance works well with coordinates encoded a Point, as per the Postgis definition. To make sure everything works smoothly, we'll then use a Postgis-enabled PostgreSQL database to create our dataset.
Moving datasets around can be done in several way in DSS, but here we are going to use "Preparation Scripts". These scripts will:
- let you visually enrich and clean data
- allow you to write the processed dataset to any kind of target system, here a Postgres table (note that this feature is common to all kind of recipes, DSS will let you move in one click data from a system to another).
For instance, if we look at the "compute incidents t" recipe, we start from a local text file, create a preparation script interactively, apply it to the whole dataset, and store the output in PostgreSQL:
That's it: a few clicks to enrich and clean your datasets, and one click to push it to a PostgreSQL database. Nice, right?
Now that the datasets are stored in PostgreSQL, the last step is to create the proper data structure for CartoDB. This is done via the last SQL recipe:
This simply stacks together the 3 sources, and more importantly, create a new Point column, stored as a proper geographic attribute (a "geom") using a well-defined geographic projection. This step produces the "map data" dataset, which we'll use to create our map.
Publishing and visualizing
Our dataset is ready to be published! So how do we do it? Once again, a Python recipe will do the job:
This one is a bit hackish, but who doesn't like a tool that can be hacked a little bit? :) Schematically, what we do:
- read the input map data
- store a sample (to make a lighter file) of it in a temporary location
- push it to CartoDB using curl
- store the curl response in a dataset
The dataset is now ready to be used:
You can now play around and create several visualizations. Interestingly, once a map is created in CartoDB, it can be embedded somewhere else using an iframe. This is great since they can be then be displayed directly within DSS!
The Studio offers the ability to create your own web apps in the "Insights" section. So to display our maps, we just need to create new web apps and copy/paste the CartoDB iframe link inside the body of the app:
Lastly, these Insights can be published to the Pinboard (using the Publish button), which will let your users play with the viz you just created!
That's a wrap! You should now have a better understanding about how to use external services from within DSS.
Beyond being able to produce the visual outputs, using DSS to perform such a project offers several advantages:
- Automation: the workflow shown at the beginning of this post is more than just a graph, it is a smart execution engine relying on a DAG calculating the dependencies between each dataset, and knowing when to update what. By scheduling the update of the last dataset (did I talk about the built-in scheduler?), DSS will automatically update the dependent ones.
- Sharing and transparency: the whole process is accessible in one place, which makes it easy to share between different persons, and the code and scripts are totally open.
- End-to-end solution: all the required components are assembled seamlessly: write Python scripts to fetch data from APIs, use Dataiku's own components to clean up and move data, run a SQL query to create new datasets... No more need to use several tools and clients software, the whole data science workflow can be done right from DSS.
If you want to play with the code by yourself, get a free version of DSS in its Community Edition version, download a copy of the project, create a new project in your DSS and import the archive. Note that this won't work out of the box, except if you have a Postgis database available.
Can you think of a comparable use-case in your organization? Feel free to get in touch with us to discuss how the Data Science Studio can help!