In 2024, I’m a data scientist and technical account manager working in the machine learning (ML) and AI software space, but that hasn't always been the case.
I began my career running numerical simulations for aircraft structural analysis. Back then, I relied on a workstation that had to scale with the complexity of our models. Then came a breakthrough: High-performance computing was no longer exclusive to large industries. Access to a Linux supercomputer with eight CPUs and 16 GB of memory transformed our workflows by enabling us to offload computations to a server and keep working without interruption. We could now dedicate our workstations to post-processing visualizations. Yet, this came with challenges. Transferring large datasets, queuing jobs with Unix commands, monitoring progress, troubleshooting errors, and retrieving massive results all required a steep learning curve.
From Laptops to the Cloud: The Evolution of Data Science Scalability
Fast-forward 15 years, and my career has evolved into modern data science. I saw a familiar revolution that mirrored what numerical simulation had undergone years prior. Data scientists worked on powerful laptops and requested the latest GPUs to build bigger models. They became proficient in dual-booting Linux alongside their usual operating systems. They spent countless hours navigating the infamous "dependency hell" of configuring NVidia environments. At the same time, cloud computing was reshaping the industry. We gained the ability to spin up high-performance instances on-demand. We could now run extensive data preprocessing tasks or lengthy model training in the cloud. This shift to cloud computing gave us unmatched flexibility and on-demand scalable power.
Today, data teams have a wide range of options to scale their most demanding workloads. They can scale vertically by launching GPU instances to fine-tune the latest object detection models. They can use parallel tasks to train thousands of time series models in minutes. Cloud providers simplify this process by managing data transfer between storage and computing. Cloud-native data warehouse technologies like Snowflake excel SQL data transformation. For those with the right skills, there’s an option to set up Kubernetes clusters and containerize code with Docker. They can also build and manage Spark clusters for distributed computing.
Bringing Computation Closer to the Data Storage Layer
A key benefit of computation push-down is that it brings data transformation and model training closer to the data storage layer.
This minimizes data movement across networks, reducing bandwidth costs. It lets teams use the best features of environments like Snowflake, which excels at SQL operations. A global pharmaceutical firm switched 10,000s of Spark apps to Snowpark on Snowflake. This led to significant savings both in runtime and computation cost. For instance, one data processing job that took 10 minutes to run was running in 15 seconds after the switch. By offloading heavy processing, main resources can support daily user collaboration.
When I joined Dataiku, I learned that the lives of data engineers and data scientists could be simplified and improved even more, through a flexible platform that abstracts away the complexities of connecting to all kinds of infrastructure. In this article, let’s see how! We’ll start by showing how Dataiku lets your data teams delegate heavy workloads to the right computation engine. Then, we will explore how these choices can impact your whole ML pipeline. In the third section, we’ll show how the platform simplifies managing a variety of computing systems. Decoupling computation from pipeline logic future-proofs AI systems. This approach sidesteps obsolescence by avoiding dependence on short-lived technologies. Our final section demonstrates how Dataiku helps you enforce this separation.
Dataiku Allows You to Delegate Computation to the Right Engine
Dataiku is both the lubricant and the glue ... Making things faster while holding them together.
-SVP, Global Pharmaceutical Company
Computation comes with a cost, and Dataiku offers various strategies to help manage it in a cost-effective manner. To this effect, Dataiku can use its own application server. It can also delegate the computation to external engines. We know this process as computation push-down.
In Dataiku, when transforming datasets, you work with a sample most of the time. Once you finish your transformations, you can apply these steps to the entire dataset. You can now choose the best computation engine. It should match the data storage and the operations you need to perform. In some cases, admins can set default computation engines for you. This lets you focus on designing and running transformations. You won't have to worry about resource selection.
Computation push-down in Dataiku can take four main forms:
- You can run computations in-memory or stream them on your Dataiku engine. You can use this strategy to execute Python recipes.
2. Use the in-database strategy to translate a visual recipe into a SQL query. A SQL server or a cloud-native data warehouse such as Snowflake will then run your query.
3. You may also use a Spark cluster or Databricks to push down computing with Spark SQL queries.
4. Docker and Kubernetes clusters offer an alternative to Dataiku's host server. These technologies enable containerized in-memory execution.
Your Dataiku platform administrators can configure and manage your computation engine. They have access to advanced configuration and permission settings. This allows organizations to secure infrastructure access and control their costs. This guarantees the efficient and secure allocation of resources.
So, Dataiku allows you to push computation down on your infrastructure of choice. But how broadly can you apply this to your data preparation and ML pipeline?
Your Whole ML Pipeline Can Benefit From Computation Push-Down
Data engineers have long been using distributed computing. They tackled big data challenges with the early days of Hadoop and Spark clusters. Today, there are more tools tailored for distributed processing. Spark and PySpark remain foundational for many workloads. Kubernetes has become a key player in orchestrating containerized applications. These technologies are more flexible. They let data engineers choose the best tools for specific data processing needs. These can include large-scale ETL pipelines or real-time streaming apps.
The landscape has been equally exciting for data scientists. They can now take advantage of cost-effective GPU resources. They can leverage various parallelization techniques to speed up cloud-based model training. These capabilities allow them to train larger models. They are also able to design complex, distributed feature engineering workflows. All in all, they can now process and analyze data at unprecedented scales. The shift toward cloud-based parallel computing has enabled faster experimentation. This has led to more sophisticated model development. In turn, data scientists can now focus on maximizing model performance. All the while being less affected by infrastructure constraints.
Speed is also crucial during inference. Low-latency predictions can indeed drive competitive advantage in real-time applications. Distributed systems enable fast, responsive models. They power recommendation engines, fraud detection systems, and dynamic pricing apps. In the era of large language models (LLMs), Dataiku supports running custom Hugging Face models on GPU Kubernetes clusters. This enables companies to deploy powerful natural language applications within their infrastructure. This modern ecosystem provides data professionals with a set of scalable, high-performance tools for model development and deployment. This makes advanced analytics accessible and operationally feasible at any scale.
Computation push-down can become as pervasive as you wish in your data pipeline. This usually has a high cost due to a steep learning curve. You need to add many new skills to your teams. Let’s see how Dataiku addresses this challenge.
Abstracting Away the Complexities for an Enhanced User Experience
Dataiku enables teams to use their preferred tools and technologies. They do not need to understand the intricacies of their underlying computation engines. As a full-coder, low-coder, or Dataiku visual workflow user, you do not need to know the details of computation push-down. They are fully abstracted away from you:
This streamlined access means both business teams and developers can access advanced, compute-intensive AI capabilities to build and deploy analytics solutions, whether through Dataiku’s intuitive visual interface or its full-code options. Data scientists can focus on delivering business value rather than spending time learning to administer Kubernetes clusters.
Standard Chartered Bank values Dataiku’s ability to leverage existing compute capacity across its ecosystem. As they put it:
Dataiku can do single operations on the amount of compute capacity you have somewhere in your ecosystem — doesn’t matter where it is. It’s amazing technology. Sometimes the data is small, sometimes it’s very big. Sometimes it’s a graphic problem. At some point you’ll want to go from just processing lots of data to adding value. You just need that flexibility, and Dataiku provides it.
-Craig Turrell, Head of Plan to Perform (P2P) Data Strategy & Delivery at Standard Chartered Bank
The bank encountered challenges with a specific KPI calculation. They needed to scale from 10 million to over 400 million rows of data. The team assumed their limitations were due to a lack of compute resources. Yet, they found that the real bottleneck was in scaling up data processing. In fact, they were underutilizing the large warehousing capacity available to them. With Dataiku, they were able to tap into this existing ecosystem with ease. They used the platform's flexibility to optimize data processing. This allowed them to gain valuable insights from their data at scale.
You can push computation to any distributed infrastructure in your ML pipeline. End users can actually choose the best underlying infrastructure and even experiment away. But isn’t there a risk that these choices become a burden at the organizational level?
Engine Selection That Evolves With You
Dataiku makes optimizing compute resources straightforward. Within complex workflows, you can see which computation engines are in use at each step. This enables you to make informed decisions about where to switch engines. It enables efficient resource use. End users can then optimize performance and costs in their data projects:
In the flow above, you might notice that most recipes are currently running on the Dataiku server. Dataiku denotes this with teal building blocks. As your datasets grow, it may be more efficient to transfer computation to a Spark cluster. For SQL-based datasets, you could choose to push computation to your Snowflake environment. By choosing this over a Kubernetes cluster running on AWS, you can optimize the use of your existing infrastructure.
Dataiku makes these choices reversible. It achieves this by decoupling the computation layer from the pipeline logic. If your needs change, you can adjust your setup. It won't need any admin work or disrupt projects. Often, right-clicking on a recipe within the flow allows you to select a new recipe engine. With little effort, you can reconfigure where computation runs based on project demands:
For this Dataiku recipe, the user can choose to run locally on the Dataiku instance. Other choices can be to run in the database from which the SQL dataset is coming, or on a Spark cluster.
Spend More Time Building Models and Less Time Setting up Resources
This article showed you how to use Dataiku to decouple your computation layer from your pipeline logic. You also learned about the benefits of pushing down computation to external engines. Finally, you gained insight into how Dataiku abstracts away the choice of external engines from users.
Dataiku is a flexible, end-to-end platform. It empowers organizations to process data wherever it's stored. In early AI initiatives, data availability and quality often take center stage. As these use cases evolve, we must move them into production. We need solutions that can scale and operate efficiently. Dataiku’s open, flexible architecture lets teams use the best engines for each task. It prevents business users and developers from experiencing limits that could slow innovation. Dataiku helps organizations scale AI. It lets business teams focus on insights and value, not on infrastructure complexity.