This is a guest article from our friends at Starburst. Starburst is a full-featured data lakes platform that includes all the capabilities needed to discover, organize and consume data without the need for time-consuming and costly data migrations.
In today's data-driven business environment, organizations that can effectively leverage and analyze large volumes of data gain a competitive edge. However, ensuring data is of the highest quality before making it available for analytics and data science workloads becomes increasingly difficult as the variety and volumes of data grow. Sure, cleansing 100GB of transaction data coming from your OLTP system is a relatively simple task these days, but how about a petabyte of data that now includes streaming data as well as third-party data from vendors you haven’t worked with before?
Adding new sources into your existing data quality workflows and extending those workflows to accommodate changes in the volume and variety of data requires time and resources, delaying actionable insights and making them more costly to achieve. To minimize the delays and additional costs associated with keeping your data clean at scale, you need a solution that simplifies combining data from multiple sources while maintaining high performance as the volume of data in those sources grows.
Dataiku’s intuitive interface makes it easy to define detailed data quality metrics and checks, but the queries they generate need to be sent to the data source for processing. Using a cloud data warehouse as the source for these data quality tasks may help with scalability (albeit at a high cost), but presents data access challenges. Conversely, data lakes can address some of the data access challenges, but performance can suffer at scale. Join me as I explore how those limitations can jeopardize data quality at scale and how using Starburst Galaxy as the single source for your data quality workflows in Dataiku helps organizations overcome those limitations.
The Bigger the Lake, the Harder It Is to Keep Clean
It’s inevitable that an organization’s data lake is going to expand as they start to see more and more value from it. Whether it’s landing data from new external sources, ingesting more data from existing external sources, or migrating data from internal sources, data lakes tend to get bigger the more they are used. The quicker the data lake grows, the more difficult it can be for organizations to scale their data quality practices and maintain the health of their data.
Landing new sources of external data into the lake often requires creating new metrics and checks because the new data may have a structure or format you haven’t worked with before. If the new source is delivering streaming data, then you may have to switch to running quality checks on samples of data instead of the entire data set. It’s also likely that you’ll have to redesign your metrics and checks as your data volumes grow in order to maintain optimal performance. Making these types of adjustments requires time, money, and resources and is usually unavoidable. Unfortunately, traditional data lake engines tend to deliver suboptimal performance, so these adjustments end up costing you even more time, money, and resources.
Data Silos Hinder Data Cleansing
Most organizations have their data spread across a variety of systems. Keeping data clean typically involves comparing it with other data, which may be stored in a separate system. One option is to copy all of the data you need into a single location, but that requires building and maintaining new data pipelines and duplicating your data, which adds unnecessary costs to your data quality initiatives. A better option is to leave the data where it is and use federated queries in your metrics and checks. You’ll avoid the additional cost of duplicating data while also saving the time and resources it takes to create and support new data pipelines.
Query federation is becoming a more common feature in query engines these days, but few can deliver the features needed by a mature data quality strategy that needs to scale. Some lack data lake connectivity, so quality data is restricted to what you can fit in the warehouse and what you can afford to store there. Others don’t provide cross-cloud support, limiting what you can do with your data quality checks or requiring you to duplicate your data. Query engines that support federation also seem to struggle with performance at scale, especially if a data lake is one of the sources.
Data Lakehouses to the Rescue?
Transitioning your data lake into a data lakehouse can help you scale your data quality efforts. Data lakehouses bring warehouse-like features to your data lake, which can promote consistency and accuracy in your data reducing the need to perform these checks in your data quality workflows. They can deliver better performance at scale than traditional data lakes, so your data quality jobs may continue to perform reasonably well as your data volumes grow. You can also provide more transparency around data quality by utilizing a three-layer architecture (land, structure, consume), making it easier for users to distinguish raw data from clean data. All that said, data lakehouses, just like data lakes, are only as good as the query engine that powers them.
Cleaning Your Data Lakehouse With Starburst Galaxy and Dataiku
By using a data lakehouse powered by Starburst Galaxy as the source for data quality metrics and checks in Dataiku, customers unlock several benefits, including:
Fast and interactive queries, giving Dataiku users the performance they need for their data quality queries without having to go through a cloud data warehouse. The performance lift Starburst Galaxy provides scales with your lakehouse as it grows, so your data quality queries can continue to leverage all of your data instead of only getting a partial picture of your data health through samples. The result is fast access to trusted data that has gone through comprehensive checks, which leads to faster and more confident decision-making.
Performant data federation, enabling customers to access data from multiple sources through a single connection in Dataiku. This gives users the ability to create data quality metrics and checks that span multiple data sources while minimizing the connections that need to be managed in Dataiku. By breaking down these data silos, you eliminate unnecessary ETL and data duplication, saving you time and money while ensuring you still have fast access to all the reference data you need for cleansing regardless of where it is stored.
Security and compliance are critical considerations in any data architecture and Starburst Galaxy addresses these needs by offering robust security features like role-based access control, authentication engine integration, data encryption, and query auditing. Customer data remains protected and compliant with industry standards for data governance no matter where it is stored. For Dataiku users, this means that access controls are applied consistently across all of the data being used by their data quality checks in Dataiku while simplifying the way in which those controls are enforced.
With Starburst Galaxy's capabilities in fast and interactive queries, data federation, and comprehensive security, organizations can effectively and confidently scale their data quality workflows in a cost efficient manner. By implementing a data lakehouse architecture with Starburst Galaxy as the query engine and combining that with the powerful data quality workflow capabilities of Dataiku, organizations can achieve the data quality standards they desire as their business grows without delaying insights they can act on.
Get started with Starburst Academy, an upcoming workshop, or Starburst Galaxy free trial today .