Building a Data Pipeline to Clean Dirty Data

Data Basics, Use Cases & Projects Lynn Heidmann

A data pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data (typically raw data) goes in one side, goes through a series of steps, and then pops out the other end ready for use or already analyzed.

gas pipeline near mountain

The steps of a data pipeline can include cleaning, transforming, merging, modeling, and more, in any combination. It seems simple; however, anyone that's ever worked with data knows that data pipelines can get highly complex. This blog post will simplify the process and show how you can build a data pipeline in Dataiku to help prep a dataset for analysis.

Common Data Cleaning Problems

While not a comprehensive list of problems I've encountered with datasets I've received, here are the most common:

  • Missing data
  • Multiple fields in a single column
  • Non-unique column headers
  • Non-standardized data: column headers, names, dates
  • Extra white space around text

Let's see how we can clean these up.

Pipeline Overview

Here's what we're going to build using Dataiku:

Dataiku DSS flow data pipeline screenshot

For this example, I've created a fake dataset containing 10,000 records made to mimic a partial database dump of patient information for a healthcare provider. You can get the code to generate a similar dataset, or grab the dataset itself, and the Dataiku project I created on GitHub.

Alright. Let's clean up this data!

Pipeline Build-Out

First, create a new dataset and view the data. During this step, we aren't going to do any manipulation of the column names, only import and preview the dataset.

screenshot dirty data in Dataiku DSS sample pipeline project

Next, create a new recipe to split the full name into first and last names and rename the columns. For the zip code field, change its meaning to Text, as we aren't going to do math on this field. The output of this step is a new dataset with the cleaned names.

screenshot data cleaning recipe split name in Dataiku DSS sample pipeline project

After that, clean up the home address fields. Do this by creating a new recipe for the cleaned names dataset and:

  1. Trim the address.
  2. Use regex to extract out the apartment or suite number.
  3. Rename all of the address columns.
  4. Replace the apartment or suite number in the original address field so that only the line one portion remains.
  5. Rename the rest of the fields.

Now turn your attention to the first phone column. This column has phone numbers in many formats. Extracting the parts of each format is beyond the scope of this post. For now, let's simply rename this column to home phone and remove what appears to be an erroneous extension. Do this by creating a preparation recipe that renames the column, and uses a regex to find and replace an extension.

screenshot phone data cleaning in Dataiku DSS sample pipeline project

Great work so far! We're almost finished cleaning this dataset, so let's keep going.

With all of the home information cleaned, we can now finish by cleaning up the work information and the account creation date.

The fields for the work data are, perhaps not surprisingly, similar to the home information fields. This allows us to leverage the preparation scripts we've already created. Let's first clean up the work address.

View your current flow diagram and double-click on the script you created during the home address step. Under the actions menu in the top right, select Copy, and be sure to select the right input dataset and name the output:

Dataiku Pipeline Project copy preparation recipe screenshot

Now we have a complete copy of all the work we did for the home address. All that's left to do is to update all of the column names to apply our transformations to the work columns.

To clean up the work phone, do the same thing; however, this time, use the dataset you just created as the input, and create a new preparation script to operate on the cleaned work address dataset. In that recipe, we'll do the following:

  1. Extract the extension using regex and put it into its own column.
  2. Replace the found extension in the main phone number.

Only one more step to go! Let's clean up the account created on field.

Do that by going back to the flow view, selecting the cleaned work phone dataset and click on the Prepare icon on the right side menu. Name this final, cleaned dataset account_data.

Once you're on the Script page, scroll over so you can see the account created on field. Click on the column header, and from the drop down, click Parse date... in the Script section. This will cause Dataiku to automatically determine the date formats:

Dataiku DSS pipeline project screenshot account created

And we're done!

Summary

Phew! It took me longer to write this post than to perform the work. That's because Dataiku makes it easy to create data pipelines, especially for preparing data.

What we've created here is a very simple example of what can be accomplished. I suggest you take this a step further and cluster the data by attributes such as their city or state. If you had payment information, you could combine the two datasets,  which could later be used to create a predictive model. There's also a lot of cool visualization stuff you could do as well.

You May Also Like

Moving Beyond Guesswork: How to Evaluate LLM Quality

Read More

A Tour of Popular Open Source Frameworks for LLM-Powered Agents

Read More

Custom Labeling and Quality Control With Free-Text Annotation

Read More

Get to Know NYC and Paris From the Point of View of an Algorithm

Read More