The goal of the data preparation phase of the AI life cycle is to wrangle and enrich data as input for model building. Data prep is key for good machine learning models — the more data that is collected and used for model training, the higher the model's accuracy. By preparing data, both features and labels, in an efficient way, models can perform better, increasing the business value of the output.
Essentially, the data prep process is when data is cleaned, reshaped, aggregated, or classified — helping teams move from raw material to insights, unearthing value with each data prep task. In this blog, we’ll share some common data flow patterns in the data life cycle as well as tidbits on how Dataiku can reduce friction in the data prep process, removing associated pains and accelerating the time to value.
Common Data Flow Patterns in the Data Life Cycle
The below data flow patterns are commonly utilized across the enterprise:
Batch Data Flows:
- ETL, or Extract Transform Load
- Transform data from source system
- Prep data with associated schema
- Data engineers load the data into a data warehouse, which is easily accessible by analysts
- Extract Load Transform
- Take data from cloud native data stores (i.e., Twitter)
- Stream and load data to a data lake (which can store more data more easily than a warehouse)
- Transform data from the data lake
Data Stream Processing:
A more modern data architecture thinks of data as a stream, as opposed to content being processed in batches.
Further, we know that the output of data work varies by project and use case, but it typically falls into one of these three buckets:
- Data: Here, data is transformed and the output is more data (such as when an analyst sends a clean Excel file to a boss to review).
- Reports: Think of the same analyst, who sends their boss a bar chart of last quarter’s performance.
- Models: Data is used to build algorithmic models to help the organization make future predictions and improve business outcomes.
How Dataiku Can Help
Regardless of the data’s underlying storage system (or whether a team is aiming for the output of data, reports, or models), data preparation in Dataiku is accomplished by running recipes. A Dataiku recipe (whether visual, code-based, or from a plugin) is a repeatable set of actions to perform on one or more input datasets, resulting in one or more output datasets. Dataiku helps address data prep pain points associated with:
- Volume of data, as the platform is built for big data (while spreadsheets are limited in their worksheet size, number of worksheets, column width, and number of characters in a single cell)
- Governance, as it offers one centralized place from which all data work happens. Teams can keep documentation of data sources with sensitive or proprietary information, view what data is used for a given project and who “owns” what data, see what has been done on specific project, and track the history of the data such as where it’s being used and where it came from
- Machine learning, as it puts data prep in the same place that machine learning happens so projects can be more easily expanded and developed
- Collaboration, as Dataiku is focused on collaboration within and across teams (so that analysts and non-technical users can work with data on their own, but follow the best practices laid out by data scientists and other data experts in the organization)
- Reuse and automation, so teams can perform the data prep tasks once and reuse recipes to automate data prep, saving time and resources