Get Started

Addressing Data Protection and Deletion With Dataiku

Use Cases & Projects, Dataiku Product Larissa Lieberson

It’s no secret that data malpractice and the release of confidential information have been making headlines in recent years. It seems that every few months there’s a more innovative way to hack into data, from falsifying customer metrics to interfering with election results. It’s time for this to end. 

Around the world, penalties for data misuse have been strengthening, as clearly seen in the adoption of the GDPR laws. And since we know you don’t want to be the next person involved in a catfish situation, Dataiku has followed the path carved out by these laws to create many features that can help protect your organization. Recently, we even developed a plugin that documents data sources with sensitive information and restricts access to them. And, apart from these explicit features, there are ways to comply with the laws through simple flows in Dataiku. 

When potential or existing customers want to have their data wiped from your company’s systems, it’s hard to find a quick and easy way to handle the requests. But don’t fear — we found a simple solution that could save your business time and resources. This article will focus on a data protection flow, built in Dataiku, that enables a user to instantly delete customer data from all databases with one click.

Building a Data Protection and Deletion Solution: The Need

The process of deleting customer data can seem like a game of telephone followed by a relay race, as teams maintain different data sources and information must be passed between them.

In the past when we received a data deletion request, it took lots of time for an analyst to complete the task. They had to manually search each database to determine whether or not it contained the user’s data, and then delete the information from each source separately. But now, a simple flow can accomplish this.

Building a Data Protection and Deletion Solution: The Final Flow

Once the flow is built, the only step required to delete a user’s data from all systems is to update a variable to a user’s email and run a single Python script. Who knew it could be so simple?

In Figure 1, we can see that only 10 recipes are used in this project.

final deletion flow in Dataiku

With our new solution using Dataiku, we were able to take the time required for an employee to delete a customer’s data from around 20 minutes to only two.

In the next section, I will provide step-by-step instructions on how to recreate a data deletion flow in Dataiku.

Building a Data Protection and Deletion Solution: The Process

  1. Create a project called ‘Data Protection.’
  2. Share raw datasets containing user information from each database/tool your company uses to the project.

raw data gets added to Dataiku

3. Create a stack recipe that outputs a single dataset containing one line per user per source. Make sure to include user email address and unique id for that data source to use in the API deletion requests later. 

stack recipe output

If you want to view which data sources contain a user’s information, simply filter the output of the stack recipe to the email of interest.

filter out stack recipe

4. Create a Python recipe that defines one function per source that will take user id as an input and delete the customer’s data via an API request.

5. In the same Python recipe, create a variable that is set to the user’s email address and filter the stacked dataset down to include only rows that contain that user.

create a variable in Python

6. Write a for-loop, a control flow statement, followed by if-then statements that iterates through each row of the filtered dataset and deletes the respective data by calling on the previously defined API deletion functions.

write a for-loop statement

7. Write logs to keep track of which customers you have deleted. Make sure to add a ‘Date Deleted’ column, an ‘Email sent’ column that is set to ‘no,’ and a ‘Status’ column that prints the status code from each API request. Make sure they all print ‘200.’

write logs

8. Once per month, use a prepare recipe to filter the logs to include only users that have been deleted in the past month.

prepare recipe in Dataiku

9. Using a Python recipe, send emails to each user notifying them that their data has been deleted from the systems. Then, update the ‘Email sent’ column to ‘yes.’

Python recipe to set up emails to users

The emails will look like this:

data deletion email example

10. Set up a scenario that runs once a month to clear the logs dataset so that information is not stored anywhere in the system.

Building a Data Protection and Deletion Solution: Benefits and Business Value

The goal of Dataiku is to simplify processes; however, data protection and compliance with local laws and regulations is also our top priority. In tandem with these two concepts, you can build a flow for your organization to reliably track deletion requests from start to finish. By streamlining the process for deleting customer data, you can save the various teams in your organization time, resources, and stress.

You May Also Like

Graph Neural Networks: Graph Classification (Part III)

Read More

Smart Manufacturing: How We’ll Get There and How We Won’t

Read More

Data Transformation: How to Transform Data More Efficiently

Read More

The Ultimate Guide to Data Pipelines

Read More