This is a guest article from Lewis Gavin. Gavin is a data architect who has been working with data for seven years and has also been blogging about skills within the data community for five years on a personal blog and Medium. During his computer science degree, he worked for the Airbus Helicopter team in Munich enhancing simulator software for military helicopters. He then went on to work for Capgemini where he helped the U.K. government move into the world of big data. He is currently using this experience to help transform the data landscape at easyfundraising.org.uk, an online charity cashback site, where he is helping to shape their data warehousing and reporting capability from the ground up.
Data comes in many different formats and from disparate sources. Often, we need to bring data into a centralized location to fulfill business reporting requirements, analysis, and data science. However, not all data is created equal, and bringing it into a single location like a data lake or data warehouse is only a small part of the challenge. Data often requires some form of transformation before it is useful for downstream consumers, such as analysts and data scientists.
Data pipelines are used to manage data through each step in this lifecycle, from ingestion to utilization. Data pipelines are incredibly important, especially for data science. Models are only as good as the data they consume, and data pipelines ensure data is up to date, cleansed, and ready for use.
This article will break data pipelines into their core components and discuss best practices for building them.
Data Pipelines in Action
There are many different tools and applications available to build and run data pipelines. However, the core architecture remains the same. The goal of a data pipeline is to ingest some data, transform and join it with other datasets as required, and then store the data in a suitable format and location for the intended consumer.
Regardless of the tools used, all data pipelines can be broken down into their core components: origin, destination, transformation, workflow, and monitoring. Let’s explore each of them in detail.
Data Pipeline Components
Origin
The origin component is where the data enters the pipeline. This is the original source of the data: anything from a text file to a database table. It may also be a continuous stream of data. The purpose of the origin is to obtain data from the source and, in some cases, a secondary aim is often to obtain any new or updated data generated to continuously refresh the data at the destination. This is usually achieved by establishing criteria that, when detected, will trigger pipeline updates — such as a scheduled execution time or detected changes to source systems. Automatic pipeline updates are crucial to keeping downstream systems in sync.
Destination
The destination, as its name implies, is the final endpoint for the data. This is typically a data storage location such as a data lake, data warehouse, or business intelligence (BI) tool. Data from one origin can have multiple destinations, depending on the end users of the data. A data pipeline could ingest some data from an origin, transform and join the data with other data sources, and then store the results in the data warehouse for analysis. It might also export it to a different database used, for example, to serve a website.
Transformation
As we have discussed, data often comes from numerous different origins. Each origin will have different standards and formats for storing data, making it complex to combine these datasets in a meaningful way. For the best chance of producing accurate models, such as machine learning (ML) models, clean and standardized datasets are especially important.
This is the goal of the transformation stage of a data pipeline: to standardize, normalize, validate, and clean the incoming data, so it is ready for downstream consumers. This might involve standardizing date and time fields, converting numbers and currencies to the same format, and removing or filling empty values. Performing all of these transformations up front is a boon to analysts and data scientists. The data is already prepared and ready for use, and the transformation all occurs in one place, which avoids repetition across each analysis activity.
Dependencies
It is not unusual for downstream users to move data to other locations, depending on where it is needed. Sometimes this is built into the data pipeline itself or can be managed separately. For example, the data pipeline might ingest some data into a data warehouse, prepare it, and store the final output in a table in the data warehouse. This might be the data pipeline’s final destination, but an external BI tool could have a workflow to import this data to perform additional data manipulation.
Although on the surface this might appear to be completely separate from the original data pipeline, it is important to consider how you establish and maintain these dependencies, as changes in the data pipeline could impact this downstream consumer of the data.
Monitoring
The final component of a data pipeline is monitoring. Monitoring is a key part of any core process, especially when dependencies are involved. Data pipelines have downstream users who depend on data being refreshed, cleaned, and updated at agreed intervals, so it’s important to use monitoring tools to identify issues. Monitoring the health of the data pipeline is essential, and it’s common practice to alert engineers or support colleagues if the pipeline fails. It may also be useful to monitor whether any new data has come in from the origin and check the output for things like duplicates.
Monitoring ensures that the data pipelines stay healthy and that you can actively rectify any issues before causing bigger problems for downstream data consumers.
Best Practices for Data Pipelines
Understanding the components is essential but is not enough by itself to build production-ready data pipelines. You should follow these best practices to ensure the successful deployment of your data pipelines.
1. First, data pipelines should be predictable. It should be obvious where data comes from and how it flows through each stage of the pipeline. To enable this, tools like Dataiku automatically generate a visual data pipeline which provides the original and any future data practitioners insight into what the pipeline is doing.
2. Second, your data pipeline should scale to meet any growth in data throughput. Pipelines can be manually configured and updated over time to align with the increase in data volume. However, it’s much more convenient to build the pipeline once in Dataiku and scale it automatically — paying special attention to where you store data in each step, as well as the scalability of the compute engine.
3. Reliability is also crucial. It should be easy to maintain a data pipeline and diagnose any issues. Dataiku provides cues on which data pipeline stages are causing bottlenecks or have failed completely. You can then drill down into the logs of each step to help diagnose and fix the issue.
4. Finally, data pipelines should be testable. Continuous integration and continuous delivery (CI/CD) has been a key component of software engineering for several years, and data science pipelines should be no different.
Dataiku uses the concepts of metrics and checks. Metrics can be used for things like determining model accuracy or calculating the number of missing values in a dataset. Checks can be used on these metrics to automate further actions based on whether the check passes or fails. A failure might trigger retraining the model if its accuracy is lower than expected. There are also tools to generate data compliance reports, notify when schema changes have downstream impacts, and send alerts about pipeline failures.
Each of these features ensure pipelines are reliable, future changes are validated, and any issues are reported.
Don't Overlook Data Pipelines
Data pipelines can be complex, so tools like Dataiku provide visual aids for both building and monitoring data pipelines. They also ensure your pipelines are built for the future with automatic scaling and offer monitoring, diagnostics, and testing capabilities.