This article was written by a guest author, Peter White. Peter White is a veteran full stack software developer with 15 years of experience under his belt. Though the bulk of his time was spent working with all aspects of the .NET environment, he has recently focused more on Python and Javascript, mainly with Angular. Outside of coding, Peter is also a stand up comedian who has performed in over 30 countries and appeared on numerous television programs including Just For Laughs.
In my time as a data scientist I’ve used a lot of different tools — frankly too many. I came into data science from a programming background and, as such, my default approach was always to build things for myself. However, I quickly learned that if I wanted to get anything done in the world of data science, I was going to need some help. The volume of data that needs to be prepared, the number of models that need to be trained and analyzed, the amount of information that needs to be explained — it’s all too much for a DIY approach. So, I adopted different tools for different tasks and tried my best to stay organized.
While I tried to keep up, the number of tools kept growing and changing. It felt like every week I’d have to learn something new. There was a ton of bloat in the process, and it seemed to always get worse. Add to that the ever-changing goals of the business side of management, and the massive amount of work required to make these changes and explain the results to non-technical users, and it eventually became too much.
I, like many others before me, burned out and switched career paths. However, I still have a huge interest in data science, and regularly play around with machine learning (ML) for personal projects, which is how I first stumbled upon Dataiku.
My Impression of Dataiku
Dataiku promises to vastly increase the efficiency of ML projects through all stages of development. It seems like a lofty goal, but, as that is exactly what I am looking for, I figured I’d give it a spin.
First Impressions
As I dove into the Dataiku platform, I have to say that I was initially very skeptical. My honest first take was: This seems too easy. The UI is very clean and clear, and very intuitive. From the outset, it’s clear that you can manage just about every aspect of ML from this single platform. While that sounds great, as a data scientist it made me a little suspicious. What are the odds that it can do all of this, and do it well? It used to take me a ton of tools to accomplish this.
Regardless, I dove in. Dataiku prides itself on the speed of moving projects forward, and true to their word I was able to upload a dataset and create a model in minutes. Sure, it wasn’t optimized and was fairly trivial, but the minuscule amount of time and effort it took was impressive. A quick glance at the other features showed that they were similarly easy to use. While I was still convinced that something was amiss, I couldn’t help but start to consider the possibilities. If this platform is anywhere near as accurate as it is easy to use, this could really be useful.
If I can create, analyze, and explain models at this rate, I could try a number of different approaches with minimal effort. I could compare approaches easily, and avoid all the headache work that goes into model interpretation. I was still skeptical, but already I couldn’t help but get a bit excited.
Setup
I took a step back and set to work creating a much more involved project. I wanted to really see if the platform would give good results in a realistic scenario. I began with writing the code that would access the correct data source — something I’ve done manually many times before. It’s always a chore, perhaps the least fun part of any project. As I started working with Dataiku’s visual flow my skepticism was at an all-time high. As a strong opponent of visual programming, their visual approach to data pipeline construction had me muttering awful things under my breath. “There’s no way this works,” I said, as I connected my Snowflake database in a single click. I carried on like this through my whole pipeline setup, as it felt like I was drawing a diagram of the thing I would have to build — as opposed to actually building it. In fact, I was almost mad when it worked. It was too easy. It felt like I had wasted so much time constructing manual pipelines. I nearly shed a tear.
The entire setup was a breeze, from the pipeline to the model training and model interpretation. Typically, my issue with visual setups is that there’s a complete lack of control. I always find myself trying to do something and being unable to due to limited visual controls, but with Dataiku it was always obvious. Anything I wanted to do was right at my fingertips.
Data Preparation
It’s a common sentiment in data science, and as someone who comes from a programming background it’s even moreso: Preparing and cleaning data is the worst. It’s tedious, it’s time consuming, and it’s stressful. Any mistakes you make with your data echoes throughout your entire model. Not only does it take a ton of effort to build, it takes a ton of time to test and verify as you dig through the combined data to make sure everything looks right. I was very unsure how Dataiku was going to be able to tackle this from a visual perspective.
But I was wrong to doubt them. Their UI for data preparation is unbelievable. I feel stupid for not thinking of it, but of course it’s easier to prep and clean data when you can see what you’re doing. Dataiku’s interface allows you to catch and fix errors on the spot! You can also join and aggregate data in so many different ways, and the platform records all your changes for reproducibility. Plus the system has built-in transformers for pretty much any data manipulation you’d need. I’ve never spent less time and had more confidence in my data preparation in my life. Knowing your data is clean and properly combined before moving into the modeling phase is a wonderful feeling.
Explainability
Easily one of the most difficult things to do in data science is to explain ML results to someone who doesn’t have a background in ML. While models aren’t necessarily the black boxes that they used to be, you still can’t easily peek under the hood and see how it all works. Every data scientist knows the pain of calculating partial dependence and ad hoc row level explanations, so I was pleasantly surprised to learn that Dataiku has all the latest explainability approaches built in. While it still may take your expert translation and interpretation to ensure managers will understand your model results, generating your analysis and creating supporting charts will require far less effort.
Insight
Understanding your data is really the name of the game in data science, and the more information you can have about the raw data, the better. I’d constantly have a page of code kicking around full of simple data analysis tools for simple things like finding outliers and calculating data statistics. Again, Dataiku has done all of this for me. They automatically calculate and visualize any sort of data statistic you need, without you having to scrape together a script. Plus, they include stats that I wouldn’t have ever bothered to calculate myself, meaning I have more insight into my data than I did when I did everything manually. And with way less effort.
Surprise and Delight
I thought working with Dataiku was going to be a limiting, somewhat frustrating experience. I couldn’t fathom that a visual-based platform could achieve the same sort of results that I could with my more traditional approach. But in truth, the results with Dataiku were easily equivalent to what I was able to achieve on my own. That is because Dataiku doesn’t figure anything out for you. You need to know what you’re doing, and it simply makes it easier and faster to do what you want.
The platform removes a lot of the tedious parts of the job and allows you to focus on what you want to make as opposed to focusing on actually making it. To me, Dataiku is an enablement tool that allows me to accomplish what I was already accomplishing quicker and more efficiently.
Having built-in visualization tools available every step of the way without needing to switch platforms made it easy for me to interrogate my data and gain insights that influenced my next decision. I was able to train models that were as accurate as those I coded on my own, but I was able to gain more transparency and reproducibility using the abundance of analytic tools. This made the iterative process of model experimentation and evaluation much easier and faster than any approach I’ve seen, and — if I’m honest — it would undoubtedly lead to more accurate results. Not because the algorithms are more accurate, but because feedback is more thorough and alterations are easier to make. I can easily change my pipeline, features, and weightings and build a new model.
At the end of the day, Dataiku lets me do more with the same amount of time. It cuts out a lot of the tedious parts of data science without sacrificing accuracy or control. It organizes everything, and gives you unprecedented insight into your data and your models.
The best review I can give to the Dataiku platform is this — if I had found this earlier, I would probably still be a data scientist today."