This year, we were invited to Gartner’s Data Analytics Summit in both London and Orlando where we participated in the data science and machine learning (ML) bake-off. What is a bake-off, you might ask? If you are imagining certain beloved cooking shows centered around stressfully baking and judging cakes, then surprisingly, this event was not much different!
Dataiku was invited to participate in this unique event where vendors create and showcase a custom demo, using the same data, to build an end-to-end data science solution. How did we prepare for this challenge, what environment-saving project did we end up building, and how can you leverage this bake-off template for your own organization if you’re looking for a new data science vendor? Read on to find out!
Let Them Eat Cake
The bake-off experiment is simple: When giving the same recipe and ingredients to contestants with different cooking techniques and appliances, various delicious creations can come out, each with unique strong and weak points. While these might be judged from different angles, it may be impossible to decide if structural integrity, decorations, or taste are more important to different cake fans. Personally, I’m a fan of all cakes.
Now imagine bringing the same ideas to data science showdowns, and you get the data science and ML bake-off. As an organization searching for a suitable data science solution, hosting a structured bake-off can streamline the process and lead to a more informed decision. As a participating vendor, or any data science team rising to the challenge of a custom scripted request, the bake-off setup can be as exciting as a hackathon and a great learning experience. Here’s how our experience with the event went (Spoiler: We enjoyed it enough to do it twice!).
How to Choose a Data Science Vendor
Most companies looking for a data science and ML solution start this process with a plan. However, this setup can vary a lot and not all approaches are created equal, whether in terms of cultural fit or even success rate.
What we’ve seen lead to great results, and what Gartner also used as their own prerequisite to this bakeoff, is outlining various success criteria and requirements, weighted by importance — before reaching out to vendors. By having clear demands and expectations from vendors, you can compare platforms on key features you find important. You can also research current trends and developments in the field you want to watch out for.
The bake-off as part of the Gartner conference is a great practical example of how this process could work, with Gartner playing the “customer looking for a data science solution” and Dataiku, along with the two other invited vendors, answering the call as potential solution providers.
Following the process we just described, for example, Gartner highlighted the impact of a few key trends for data science and ML platforms in 2022: dynamism, democratization, and data-centricity, which they would use as a guide for the type of demos and solutions they wanted the vendors to create.
Now, depending on the type of talent and resources in your data science teams (from citizen data scientists and analysts to expert coders) and what specifically you’re looking for in a platform to serve these teams, you might set different priorities and criteria when comparing tools. Or, you might find that the focus of this particular bake-off is exactly what you’d find relevant as well! In either case, the bake-off concept is still an exciting and innovative way to form an opinion on the state of the market.
In case you missed the Gartner conferences, we’ll be sharing a peek at our bake-off experience, from preparing for such a unique request to showcasing the Dataiku solution our team built to “help save the environment.”
Dataiku for Positive Environmental Impact
Gartner issued a challenge for us to use UN environmental data and build a project to help organizations and governments understand the impact of sustainability goals on economic, environmental, and social outcomes, and make recommendations on how to best effect a positive influence in these areas.
To fit the format, Gartner asked us to build a scripted demo that dedicates five minutes each to data access, management, and preparation; model building, training, and validity; model delivery and management, as well as business insights.
To respond to this challenge, we built an end-to-end data science project that goes from data, to insights, to production; showcasing how we can leverage Everyday AI to improve the environment — and, even more challenging — how to demo this in under 20 minutes.
However, before the demo, we first had to assemble our own data science team and build both a business case as well as a solution strategy for the challenge.
With some help from our business solutions team, we chose to specifically focus on which factors are the best predictors for a decrease in CO2 emissions in different countries. This would allow us to support the 13th Goal from the UN’s 2030 Sustainability Plan: Taking urgent action to combat climate change. We wanted to make sure we came up with some historic insights and prescriptive recommendations for organizations to follow using extensive analytics and expert model interpretability. This would enable countries to understand their own emissions drivers, as well as to determine the actions they can take to make a change, and how big of a difference can be made.
Let's Take a Look at Our Solution
Data Access, Preparation, Feature Selection
Our data engineers started by collecting and exploring the provided UN SDG (Sustainable Development Goals) and World Bank datasets on topics like sustainable energy practices, resilient infrastructure, and sustainable industrialization. Leveraging multiple connectors, they centralized the different sources into a Snowflake database, making the data available in our project flow for the rest of the team.
From there, our analysts and citizen data scientists used Dataiku’s visual transformations like the prepare and join recipes, as well as the built-in charting and interactive statistics functionalities to clean, shape, and explore our data. They published the most relevant insights to dashboards for our data scientists to review. As our team had contributors with different backgrounds, we also used a combination of code-first features for data exploration like SQL commands and Python notebooks for feature engineering.
Model Building, Training, and Validation
As for the ML stage, our data scientists experimented with a few models in the lab, building estimators with our AutoML functionalities. We iterated through almost 20 versions of our classification model that would determine if a specific country will increase or decrease its CO2 emissions. To save some time, we leveraged a lot of the out-of-the-box functionalities like data sampling, feature handling, diagnostics, and model interpretability.
This allowed our citizen data scientists (as well as our business experts) to gauge which features ended up being most influential in our model (renewable energy consumption and access to electricity!) or explore how the results differ for subpopulations of our dataset (i.e., how different countries had a wider possible range for changes in their emissions). We also got to experience first hand one of the pitfalls we often caution against — data science teams that aren’t connected to the business or subject matter experts closely enough. Our analysts highlighted a partial dependence plot that highlighted air pollution as a main indicator for driving CO2 emissions. However, one of our SMEs explained that air pollution was a consequence of emissions, and should therefore not be considered as an input to our prediction model. This led us to adjust our model and focus on other important features that were less co-linear.
Model Delivery and Model Management
After the extensive experiment phase, our data science team chose the best performing model version and deployed it as an API for real-time emission monitoring, as well as published it back to our flow and bundled our project on Dataiku’s automation node.
We then took the project further by having our ML engineers and architects enable scenarios to automate our flow, such as retraining for new data or if the model accuracy drops as economic or environmental conditions change.
Business Results and Recommendations
Finally, we gathered all the possible insights into our model: from dashboards exploring variable importance and model performance, to the interactive scoring tool. Called the “what if analysis”, this tool can be used by governments to interactively change input values to the model and see which changes to their policies can reduce emissions and to what extent. All these visuals were then made available to stakeholders into a dedicated Workspace.
And with that, we’ve covered Gartner’s requirements and were ready to showcase our solution in the bake-off. Almost. As opposed to most customer demos we’re familiar with, this particular event had quite a strict time constraint, so we also had to spend some time on scripting out our storyline and choosing the most important details worth showing. This part fell to our product marketing team, where I and Christina Hsiao took on the presenter roles for the London and Orlando conferences, respectively.
It was a balancing act between getting excited about the technical details of our data prep and model, the business value and insights we discovered about sustainable practices, and the greater architecture view or MLOps features.
In the end, we settled on the final 19 minutes where we showed the audience at the Gartner event Dataiku’s vision on Everyday AI by just telling the story of our team. We shared how our six awesome women with different backgrounds were able to collaborate in our project to pool their collective expertise in a single, shared environment. We saw how they were able to get from data to actionable insights using the plethora of Dataiku features — from visual tools to code-based, to data and environment infrastructure, and enterprise-grade functionalities.
We are really proud of our final project and had a great time participating in this bake-off! If you’re curious about our final solution, check out the video below to see our original demo of the end-to-end Dataiku data science process.
And look out for Gartner’s blog on the event as well to learn more about their take on bake-offs as a vendor selection process and their review of all solutions showcased at the event.