There are many ways to achieve scale in AI and machine learning (ML) — scale up, scale out, elastic scale. But taking a more granular approach to scaling your AI/ML projects can pay dividends. The best way to understand scale for an AI and ML platform is to look at each step in the lifecycle of a project. Each stage from design to production has different types of workloads and there are different ways of scaling to meet user needs.
While highly scalable infrastructure is critical, the secret to scaling AI lies in the design of your AI platform and how it scales at each of the major steps in the AI/ML lifecycle, notably Prepare, Build, Deploy and Monitor (which we’ll discuss in more detail below). For IT/cloud architects and analytics leaders, scaling at the right time in the right way is critical to enabling your data science teams.
In Dataiku, each of these stages has dedicated environments that help scale at each step in the life cycle. Scaling in this way enables teams — business analysts, data engineers, data scientists, ML engineers, and IT operations — with a highly performant elastic AI stack for their individual stage in the process and also optimizes the entire AI/ML lifecycle.
Prepare: Scaling Data Preparation
The data preparation phase in AI/ML includes things like data acquisition, discovery, visualization, shaping the data, transformations, feature selection, pre-processing, and exploratory analysis. It is the stage at which we prepare data for modeling.
Scalability in this context begins with the ability to work with larger datasets and more types of datasets. With access to massive amounts of data in cloud data warehouses and data lakes, maintaining performance becomes even more challenging. Dataiku has pre-built connections to all the major cloud petabyte and exabyte storage, including databases, cloud data warehouses, and blob storage, but takes an innovative approach to preparing these massive data volumes for modeling, whether structured, semi-structured, or unstructured.
In the data preparation step, performance is critical when making multiple transformations on large volumes of data. Dataiku uses pushdown execution for data preparation tasks and users can choose the most effective engine based on the data sources and recipes (local, in-database, or Spark). Pushdown workloads can be executed efficiently in elastic cloud compute clusters or in-database. But, more importantly, Dataiku maintains accessibility to your data, no matter the size of your dataset, so you view and work with it through the same interface.
Pushdown helps optimize the runtime for Dataiku projects. When running on Spark or a SQL database, a long chain of recipes (data preparation tasks) can be executed without building intermediate output. This helps teams avoid cycles of reading and writing data at each intermediate step and reduces or eliminates data movement over the network. Business analysts, data engineers, and others involved in the data preparation phase benefit from both the speed of processing but also from a host of data preparation tools and techniques in the platform.
Build: Scaling Experimentation and Model Training
Scale takes on another dimension when we move into the experimentation and model building phase of an AI/ML project, which is inherently iterative. It is essential to reduce the time and expense associated with model experimentation and training by leveraging scalable infrastructure for these tasks with both flexible AutoML and training of custom models. The requirements to scale training mostly rely on computational power and model parallelization/distributed compute. For data scientists, the ability to experiment and iterate is crucial, but has to be managed with cost in mind so speed and scalability are also critical.
Dataiku contains a powerful AutoML engine that allows you to get highly optimized models with minimal intervention. Data scientists will appreciate the added speed for model development with a distributed hyperparameter search on Kubernetes for performing grid search. When grid search is not the best strategy, you can use advanced search strategies like random search and Bayesian. In addition to AutoML, Dataiku also enables users to write custom models using Python or Scala.
For training deep learning models, Dataiku provides a dedicated coding environment for defining a network architecture with Keras code. The training/scoring of Keras models can be run on CPU or GPU(s). Dataiku natively supports the use of GPUs for training. If the Dataiku instance has access to GPU(s), locally or on your infrastructure (via containers), you can choose to use them to train the model. To train or score a model, at least one CUDA-compatible NVIDIA GPU is required. We can’t address GPUs in full here, but read this Dataiku blog for a discussion of GPUs and when to use them.
Deploy: Scaling for Production and Monitoring
Production model workloads tend to be transient and highly burstable. Real-time API calls will peak because of increased demand from a marketing campaign or based on the time of day, and batch scoring will happen on a trigger or schedule and need to complete the job in a specific timeframe. Each of these will have different requirements for infrastructure (e.g., GPU vs CPU), highly customized environments for execution, and large compute (big data) capabilities, so ML engineers and IT operations require scalable deployment architecture.
When dealing with operationalization, multi-node architecture is a must have. In addition to the Design node, where workloads for data prep and model training are managed, Dataiku has an Automation node and an API node for operationalizing models.
Batch scoring is managed by the Automation node and real-time scoring is supported by the Dataiku architecture for processing predictions in real-time via API endpoints. Adding additional Automation nodes is made simple with cloud stack accelerator, a clickable interface for deploying Dataiku instances in the cloud. API nodes are horizontally scalable and highly available web servers and often deployed on auto-scaling Kubernetes clusters. Also, customers can scale out by deploying as many API nodes as required to meet their scaling requirements and SLAs for both batch and real-time predictions.
Not only does Dataiku’s architecture enable scale with self-service deployment, but it also speeds the process of model inference and monitoring, replacement, and redeployment within its node architecture. Dataiku leverages a fully-managed Kubernetes solution and can offload production workloads to compute clusters for elastically scaling to meet the needs of both batch and real-time scoring. Enterprises can scale workloads across multiple clusters, automatically scaling up to meet the requirements of new load conditions and scaling down to optimize your costs.
The speed of scoring models can be increased significantly by pushing down into Snowflake using Java UDF batch scoring, for example. In one recent example, Snowflake achieved an 8x increase in scoring speed over environments like Spark leveraging Snowflake’s UDF capabilities.
Finally, Dataiku supports containerization and is also compatible with all of the major cloud container services: Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS), as well as with on-premises Kubernetes/Docker clusters.
It's Not Just "Speeds and Feeds"
Getting value from AI requires an AI/ML platform that offers the right scaling options and the right design elements for each stage of the life cycle. Dataiku enables various approaches to technical scale at each stage, but hides much of the complexity of technical architecture to enable Everyday AI and to systematize data science for broader adoption among non-coders working in VisualML or coders working in notebooks.
Business analysts, data engineers, data scientists, ML engineers, and IT operations can all achieve the scalability they require to manage their part in the AI/ML lifecycle. And IT and cloud architects can enable their data science teams with a flexible, scalable elastic AI stack spanning these teams so they scale independently, but share a common, collaborative platform.
In our next blog, we’ll look at the impact of architecture, tools, and features on another dimension, scaling AI organizationally and enabling teams to collaborate without having to manage infrastructure.