The retail industry has been data centric for a while. With the rise of loyalty programs and digital touch points, retailers have been able to collect more and more data about their customers over time, opening up the ability to create better personalized marketing offers and promotions.
I was pretty excited to read that our partner HPE recently added machine learning capabilities to its Vertica database . As MPP (massively parallel processing) systems are fairly frequent in the retail world, it is a good occasion to try building the kind of application described above with a best-of-breed technology combined with Dataiku DSS!
So What Did I Build?
Actually, a predictive application managed by Dataiku DSS and running entirely on top of Vertica. This is a end-to-end data workflow starting from acquiring raw data acquisition to training a machine learning model, aiming at predicting whether or not a customer is going to make a purchase in a specific product category:
The Supporting Data
a file storing 2 years of transactions, with information about customers, products, time, store, and amount
a file storing demographics about the customers
a file storing product attributes : product hierarchy, packaging...
The Data Science process:
1. Data ingestion
As we are going to use Vertica both for storing and processing our data, the very step is, of course, to push these data into Vertica. That's not rocket science with DSS: just create the DSS datasets from your raw data files, make them looking good using a Visual Data Preparation script if you wish, create a new output dataset stored in Vertica, and run your recipe. That's it. Your data are in Vertica. No coding, no SQL, no worries about how to move data around or how to handle schemas and data types: Dataiku DSS does it all automatically for you. Great, isn't it ?
2. Data preparation
The files are disparate, have unnecessary data... With Dataiku DSS, we can easily join, filter, split datasets together using visual recipes, with no need to code. So what do we do now ? We first merge the 3 initial datasets (transactions, products, and customers) all together to get an holistic view of the purchase behavior (with the mean of a Join recipe), then we split the newly created datasets in 2, with a Split recipe: one part will be used to create the features or variables for our predictive model, based on historical transactions data; and the other one will be used to create the "labels", a simple new column indicating whether or not the person will purchase specific products in coming weeks. All these data management tasks will be performed by the underlying Vertica database whenever it is possible to avoid data movement and benefit from its speed.
3. Features engineering
The data are still too raw to be used by machine learning algorithms directly. We need to switch from the detailed "transactional" view to an "analytical" view, where the data will be aggregated to form a large set of attributes ("features", or variables) describing each customer. This step is crucial and one of the most difficult when creating a machine learning model, known as features engineering. Technically speaking, when working with a SQL database, this will be essentially a combination of group by / pivot operations ... . Fortunately enough, with Dataiku DSS you can either use "visual grouping" recipes to generate your SQL statements, or programmatically create much more complex queries using helper Python functions. So let's move ahead and generate this aggregated dataset (which ends up with more than 340 columns !) by combining simple counts-based features and the money spent for each type of product:
4. Data preprocessing for machine learning
We're getting closer to our machine learning model. Actually, there is one remaining step: many ML algorithm implementations expect numeric only features with no missing values. This is unfortunately very uncommon in practice as we often need to deal with not-so-clean, mixed data types. So in this step, we are going to impute missing values with a sensible value (averages) - even if this is always questionable, and transforming each categorical (text) feature into a set of "dummy" variables (if a feature has n possible values, we will create n-1 new columns filled with 0/1 depending on the absence/presence of the value). Again, Dataiku DSS helps automating this process by offering the ability to generate dynamically the corresponding SQL statements.
5. Model creation
We can finally do ML now :) And test the new Machine Learning for Predictive Analytics functionality released with the latest 7.2.2 version of Vertica. Training, storing and getting statistics from your machine learning model is as easy as 2 queries:
As we are predicting whether or not someone will buy from a specific product category, we have a classification problem (with a binary outcome), hence the choice of logistic regression, which is available out of the box for Vertica. Training the model is done entirely in-database, with no need to move data around!
6. Model assessment & deployment
This is our last step. Using the statistics created by Vertica, we can easily see and analyze the coefficients and associated p-values for each feature of our model. As this is a regular DSS dataset, you are also free to keep on building your workflow from there (for instance by doing Checks / Metrics analysis). Vertica provides also with a set of handy functions to assess model performances via ROC or lift curves, which can be leveraged via DSS SQL Notebooks. Finally, the results can also be operationalized by using the model to score new records on a regular basis, done again fully in-database.
This is it, we now have a functional workflow orchestrated by DSS and running on top of Vertica to produce the purchase predictions, even if rather simplistic.
Creating such predictive applications is much more than training machine learning algorithms - which is in fact just a fairly small amount of work in the entire process. Data preparation, features engineering, preprocessing... : these are some of the many things to do before being able to move to the "real fun" of ML. Even it may look like a sequential process through this post, it is in practice highly iterative. __Data preparation and machine learning are deeply coupled, and DSS offers the immense benefit to be able to create the entire data science workflow in an integrated environment, with no need to go back and forth between several tools.
As Big Data arise, and best-of-breed technologies to manage them are made available (this what we call Technoslavia at Dataiku), Dataiku DSS offers integrations with third-party solutions to make sure they can be properly leveraged. A key feature is the ability to push down calculations and processing to underlying, high performance system such as HPE Vertica, which will avoid unnecessary data movements between systems and tools (one of the plagues in the Big Data world). Vertica is also perfectly suited for the kind of workloads on structured data created by DSS : data scientists can join, filter, group or pivot their data without worrying about tuning the system for performance and model tables properly, it will work fast out-of-the-box. The association is powerful: Vertica as a high performance backend to predictive applications developed in Dataiku DSS.
Working with our ecosystem is key to us, but even more for our clients as they can fully leverage their technological investments by creating high value predictive applications more efficiently. Blablacar is a great example, as you may read about here (in French) or here. Did you notice? It's Vertica + Dataiku DSS, but also Tableau Tableau for data visualization and dashboarding, and a Cloudera Hadoop cluster for very large scale storage and data processing. Dataiku is also a technology partner of all of these great vendors, so we can ensure maximum value for our customers.
We strive to keep on being more and more integrated with our ecosystem, but also to develop our partnerships with system integrators and consulting companies that can implement end-to-end data solutions on top of our technologies, so if you are interested to learn more, connect with me on LinkedIn or ping our team at firstname.lastname@example.org!