For Dataiku's First Free Training, Kurt Muehmel, business engineer by day, data enthusiast by night, showed attendees how businesses are using Dataiku Data Science Studio (DSS) to predict future revenue by combining raw web logs with CRM data.
During the 45-minute session, our guests learned how business analysts with little or no technical background use Dataiku to clean, enrich, analyze, model, and finally deploy powerful predictive analytics applications on their own. To receive the full webinar video recording, please email Pauline . Otherwise, you can read the webinar's transcript below, and follow along by downloading Dataiku:
A Brief Introduction: the Speaker, the Company, the Software
So I'll start with a brief intro about myself: my name is Kurt, I’m a business engineer at Dataiku, and let's say that I have a diverse semi-technical background which, for the purposes of this webinar, means that I’m not a data scientist or a developer. I’m therefore not going to go into the details of the IT aspects or of the data science aspects. This webinar is really intended as an introductory webinar.
Dataiku is the company that I have the pleasure of working for. It was founded in 2013 by four founders. We are currently based in Paris and very proud to be a part of the French tech movement. We are also expanding to the US and opening offices in NYC shortly. We have 24 employees and recently closed our first round of funding with two top level French venture capital firms. Dataiku has over 30 customers in numerous industries.
Our software is an and-to-end platform for going from raw data to predictive applications, in which all steps required are available. We’ll be looking at every step, from loading and enriching your data, in a very hands on way in just a minute. Then, we'll be going from analyzing and predicting values to publishing and running your predictive applications.
What Is This Webinar About and To What Industry Does It Apply To?
We are assuming that you are in a CRM or e-commerce context where you have a lot of data about your customers. So how can you go from that data to actual value? Everything we are talking about today relates to an e-commerce context but can be applied to other industries as well.
How Do You Build Something Meaningful From All This Data and From All These Different Sources?
Imagine what you could do if you could predict the revenue of your site visitors! A customer arrives on your site and you don’t know if he is a regular customer, if this customer is the fancier type (maybe the kind that wears ties to shop on-line), or if he is a top-level customer. If you knew this and if you could identify what type of customer he or she is by predicting how much this customer is worth, maybe you could follow up differently in your email marketing or adjust your pricing for these different customer profiles. These are of course decisions that you’d make based on your business strategy.
In this webinar, we aren’t going to make those decisions for you but we are going to help you get one step closer by showing the you basics of predictions, how to set-up predictions in DSS, and how you can integrate these predictions in your business strategy.
The Three Steps to Predicting Revenue
- The first step in computing predictions is identifying what you want to predict. In this case, it is revenue. We want to know how much a customer is going to spend on our site.
- Then, we are going to use all the different data that will be useful in identifying the different variables that will help us predict that revenue.
- Finally, we'll create a model that will predict the future revenue based on those variables from the past data. Basically, we are linking the past to the future.
What Will You Need?
- As much user data as possible (website logs, purchase history, …) because, out of all this data, we don’t yet know what has predictive value.
- Next, you need the ability to prepare, enrich, and combine the data to create a model that will compute your predictions.
- Finally, depending on your business case, maybe you’ll need a business content management system that will allow you to change your site content based on these predictions, maybe a mailing system to adapt your marketing campaigns' contact strategy, or maybe an e-commerce platform that will allow you to adjust pricing dynamically based on predictions.
Today, we are talking about this middle step. We are assuming that you have the data and that other solutions could provide the final step – i.e. deciding what to do with these predictions. But the real hard part is finding the link between these various steps.
In fact, that’s basically the core of data science. It’s a combination of different areas of expertise: statistical expertise, development expertise, IT expertise, and understanding business. Having all these in one person is difficult and it’s one of the things we set out to solve with DSS.
In today’s training, we have a .csv file with data from our CRM system, which has info about our users including birth dates, gender, etc. It is important to note that this is information from the past that we are going to use to train our models. Similarly, we have web logs (which have been aggregated to a single customer id, so basically one line per customer) that will give us IP, pages visited, whether or not the person arrived via advertising, etc. And finally, we have our new visitors data. This last file has the same structure and format as our CRM system data but with new information. So basically, we want to know what we can learn from the visitors we know very well and from the ones that have just arrived.
Let's Get To It
Reminder: Please email Pauline to receive a free recording of this webinar... which will no doubt add a little color to the following transcript.
Combining Data in DSS
Now, I am switching over to Data Science Studio (click here to download the DSS Community Edition for free). I’m running DSS on Chrome. This here is the main interface for DSS. These are my different projects: projects have users and have different access rights. Now, I’m going to create a new project. Lets call it “Predict Revenue.”
The first thing we need is data. Connecting to data in DSS is simple. I’m going to import those three .csv files that we talked about earlier.
Here are the different data storage systems we can connect to. For today’s demo we are just going to upload my .csv files because I’m running locally on my MacBook Air. But once we’ve established a connection with the different data storages, DSS will treat them equally. This means that everything we do to our little .csv file we could also do to a Hadoop distribution, or in the cloud with AWS, etc.
But today we are just uploading the files. Now, I am going to pick up my data. Here is the CRM from last month. I can preview it to make sure it picked it up correctly. It was compressed which is common on web servers so I am going to rename it like this.
As you can see, DSS has identified the customer id and so on. If I wanted to, I could adjust the schema but typically DSS does that for you. I can browse through the data here. Now, I’m adding the web data from last month. And now the third and final dataset which contains the new visitors.
Now I can go and get an overview of what I’ve done. The data I’ve connected to appears as blue cylinders, the universal representation of data.
Preparing Data in DSS
Now, I’m going to go into my “web last month dataset” like this. I can look at a few different things there. I see I have an IP address and other information that I’d like to make more useful. So I am going to create a preparation script. This is one of the core concepts in DSS: the ability to manage and prepare your data in a manner that can be easily repeated and updated over time.
Now, I’m in my data prep script view. I can click on a column and DSS is going to suggest that I resolve geoIP. It is going to look at the IP address and propose corresponding variables such as country, area code, etc. For our purposes, I’m going to de-select country code and geo-point and leave just the city and country variables because I don’t think I’ll need all that other information.
But as you saw in the introductory presentation, there was a customer id variable in both CRM and in web logs. It would be interesting to combine those two. This can be done with a join. I just clicked on the “add a step manually" button that opens up this dialogue with all these different processors. A processor is a little script we’ve prepared that lets you prepare your data easily and quickly. Today, DSS has 66 such processors and we add more frequently.
Let's go into join and just do the standard memory-based join. I’m telling DSS to look at customer ID. I want to join it with CRM data, take customer ids, and bring over the revenue information from that dataset. If I scroll over here, I see a new yellow column, which shows me the new data I’ve connected to. I’m going to rename it “Revenue”. I can go ahead and save it as a recipe. In DSS, a recipe is a preparation step that you can repeat in the future. This is important if you are updating new data.
Now, I’m telling DSS that I want to create this new dataset directly. If I go into this new dataset, I can see this new revenue column. If I go back to my flow diagram, I can see that I’ve combined two datasets through the preparation script and the output is this new dataset.
Working on Your Predictive Model
Let’s get to work on our model. In our model I can go to the models tab and select “new”. I can choose from clustering or prediction. These are two fundamentally different approaches. For today’s purposes, we are going to click on "prediction."
I’m telling DSS to look at “web last month prepared” and telling it to predict “revenue”. Now, I could go into the different tabs and personalize this model. For example, if I thought that different features (column variables) should be included, I’d just use this interface to pick and choose. I could also go and modify the algorithms that DSS has selected for me. But as I’m not an expert, I’m just going to click on ‘train now’ and see what happens.
Training Your Predictive Model and Visualizing the Output
Right now, DSS is training these two different algorithms. Here we go! You can see the first results: two algorithms with slightly different results. My Pearson correlation is fair in ridge regression and marginally better in random forest. I can go see the variable importance here. This tells me which variables are going to have the greatest importance for the prediction’s accuracy. In this case, we can see that if the users come from China or from the US it will impact the predicted revenue significantly.
I can now go back to the summary tab and select all the algorithms and get an overview of their different performances over different metrics. As you can see, in this case, DSS is saying that random forest is best over all the metrics. Though it’s clear that this model isn’t that good, it’s my first time so I’m just going to go ahead and use it.
Using the Predictive Model
To use the model, I have three options: - compute predictions - create a recipe to periodically retrain this model - or create an ipython notebook: basically, with DSS, you can get right into the code and leave the interface at any time. This is my model for the random forest in python. If I scroll down in the notebook, I can find the section that actually creates the model. If I wanted, I could tweak this to get exactly what I want.
I’m going to go to my results page and run this model into production (see more on why going into production matters!). I’m choosing to create a recipe to periodically retrain my model. And so my input dataset will be “web last month.” In my flow, I now have a trained model. Let’s go into preparation mode again. Remember how I enriched my IP data with country information? Well I’m going to do that here as well for an apples-to-apples comparison. I’m going to quickly enrich this with IP data.
I now have a prepared dataset. Now I need to link these two steps in the chain. I want to score this data based on the model we’ve prepared down here. To do this, I will click on the “more” button and select “compute predictions.” I want to score “web new visitors prepared": the model to use is “model web last month prepared” and we are going to create a new outset dataset which I am going to store in my file system. Remember what I’ve said before: data storage locations are irrelevant when you are working in DSS. If I were working in Hadoop I could save it all there as well.
Now, I have the ability to get my scored data. Here, I am going to click on “build”. Now, I am telling DSS to force rebuild all different dependencies. If you’ve ever tried to do this outside of an integrated platform, it is often quite a headache to make sure that all the dependencies are updated frequently.
Basically, DSS is doing all those preliminary steps one last time to make sure we get the most accurate results. There we go, the job has completed! If I go back to my flow like this, I can go to my scored dataset and see the predicted revenue for all of my users.
Finally, I'm going to share the visualizations of these different steps as insights with other members of our team like this.
Questions & Answers
Q: Where is the data transformation happening?
A: The answer is that it really depends. In this case, we are running locally. Everything in this specific case is happening on my MacBook Air. But imagine I was connected to a Hadoop cluster (which is a distributed storage technology used to store large amounts of data across many servers): if we were doing a data transformation, this step would be prepared here in DSS. Then, it would be mapped out as map reduce before being sent out to a cluster. For those of you who use Hadoop, you can understand that this would save you a lot of time. Whenever possible, DSS is going to push as much work as possible on the underlying infrastructure to do as little as possible on the server that is running DSS. This process lets you use more modestly sized machines for that server.
Q: Is it possible to schedule different jobs in a flow?
A: Yes. You can schedule these jobs at any point in the workflow. Imagine you wanted to run this job at specific times. In order to do that we would go to our administration panel, select the project, the output dataset, and then define at what intervals I want to the job to run (every Tuesday at 3am for example). Then, I’m going to choose the “build mode”: since this is a small dataset, I’ll rebuild the whole flow. If I were dealing with more data, I’d probably want to pick these rebuild jobs more intelligently from these different build options. This helps you industrialize these processes to make sure your data is always up to data, where you need it, and when you need it.
Q: Different languages (pig, hive, python, r, sql): what are these there for and can I use them in my workflow?
A: Yes! You could use them to replace any one of the steps we went through. So imagine you wanted to do this join in SQL? I could go ahead and add a new SQL step here. Then, I would need to define input and output and code my SQL. Same goes for python, hive or pig. It’s the same principal for the modeling aspects of the studio. Numerous clients prefer to develop their models with their own languages. DSS is totally based on a white box approach.
For our next Free Training (which we will announce shortly), we will improve the model we built today. We will look at the ways you can work on the different features that you want to include in order to obtain better conclusions. Stay tuned!
If you wish to receive invites to Dataiku's Free Trainings (limited spaces available), please email Pauline.