Myths & Truths Working With Personal Data in Analytics & AI Projects

Dataiku Product, Scaling AI Lynn Heidmann

The use of data across roles and industries will only become increasingly restricted, but that doesn’t have to mean a pause or paralyzation in data use. This blog breaks down a few myths and realities of working directly with personal data in analytics and AI projects.

Myth #1

Thanks in part to many sensational headlines following the official compliance deadline for GDPR declaring (dramatically) that AI and GDPR simply cannot exist together, many data teams recoiled in fear, ceasing to use personal data that was obtained lawfully, fairly, and in a transparent way.

The idea that no data team can use or process personal data in any way, shape, or form following GDPR or the implementation of other similar data regulations is a myth. In fact, Article 5 of the GDPR lists specific stipulations under which personal data may be processed. So yes, the ways in which it can be obtained and used are more limited than before, and organizations certainly need to be more prudent at every level. But it’s simply not true that data teams (including analysts as well as data scientists) are completely blocked from using personal data.

The Reality

Under GDPR, those authorized can still process personal data provided they meet certain stipulations, which can be reviewed in Article 6 but that, in sum, are necessary by contract, by law and — of course — given the consent by the data owner.

So the most important step in working with personal data for an organization is ensuring that the way consent is obtained for collecting that personal data is in line with regulations. From there, the challenge is clearly restricting and controlling access — that is, being able to separate by teams (and realistically probably more granularly, down to individuals), by topics, and by purposes to ensure proper use and access.

And therein lies the deeper reality of personal data use under data regulations: Many data teams rule out working with personal data entirely because they have trouble logistically making these separations. That is to say, they simply don’t have the processes or systems in place to do so. 

Suggested Data Team Processes to Aid Compliance

Before being able to work directly with personal data in a compliant environment, it’s mandatory to complete a data audit. Note that while the business has likely already completed an overall data audit (especially around the time of the GDPR compliance deadline), completing one specifically for personal data is important to identify specific holes and potentially issues for this type of data work.

Document the 5 Ws (+ 1 H) of Personal Data:

Who

Who are the data subjects, and who at the company can access their data?

Where

Where is personal data kept and where can it be transferred?

Why

Why is personal data kept (for what legitimate purpose)?

When

Until when are we keeping the data? When (i.e., under what conditions) do we share personal data with others (e.g., third parties)?

What

What mechanisms are in place to safeguard the personal data?

How

How is data processed, and how long should it be kept?

The answers to these questions should be documented in writing and readily available in case of audit. They should also be regularly updated to reflect changes in policies or procedures. 

Laying out this information clearly should allow for data minimization. That is, only personal data that is necessary for each specific purpose gets processed thanks to proper separation of projects, anonymization, and pseudonymization where necessary. Of course, in order to fully achieve data minimization, it’s critically important to know at a glance which data sources are used in each data project, and in particular, which of those contain personal data. Day-to-day, this means more documentation on consent, purpose, retention, etc. Clear tagging and project organization can be done easily in data science platforms.

Myth #2

It’s easy to believe that personal data can be simply anonymized and then freely used, absolving the data team and company of the specific restrictions surrounding personal data. What’s the myth? Well, it lies in “simple” anonymization. Data teams (as well as business owners and leaders) need to be careful about using this as an “escape” from good, solid data policy and careful use.

The Reality

It is true that there are specific provisions under GDPR for anonymized personal data and that, if done correctly, it can provide more flexibility in working with data because it renders it outside of the scope of GDPR or other data regulations.

Anonymizing personal data is a good way to allow lines-of-business as well as data teams (from analysts to data scientists) to work freely with data. However, true anonymization is extremely difficult to achieve, and companies looking to use anonymization as a solution should be aware of this and ensure that data is actually completely anonymous before allowing it to be freely used across the business.

It is so difficult to completely anonymize data that even big companies with tons of resources (like Netflix) make mistakes. When they first introduced the Netflix Prize — the competition to design the best recommendation engine — the company released 100 million “anonymized” ratings with a unique subscriber ID, move title, year of release, and date of rating. However, several researchers at the University of Texas were able to identify some of these users just a few weeks later. 

Suggested Data Team Processes to Aid Compliance

There are several techniques for anonymization, but note that not all of them work for all cases and that the optimal solution should be considered and executed on a case-by-case basis depending on the type and source of data.

In general, the right solution probably will involve some combination of these techniques. The ultimate goal is to anonymize the data in such a way that there is absolutely no possible way to un-anonymize it again:  

  • Hash function: This is likely what most people think of when they think of anonymization, though in and of itself, it is not an extremely effective technique. It’s a mathematical function that, given an input, produces an output to replace that input. These functions cannot be reversed.
  • Aggregation: A method of replacing personal data whereby specific values of personal data are removed and, for example, replaced with median values.
  • K-anonymity: Similar to aggregation, but instead of replacing with median values, replacing with categories (e.g., age ranges instead of specific ages). In order to qualify as K-anonymity, the information for each person contained in the dataset cannot be distinguished from at least k - 1 individuals whose information also appears.
  • L-diversity: An extension of k-anonymity, but addressing some of its weaknesses by maintaining the diversity of sensitive fields.
  • Differential privacy: A technique drawn from cryptography that introduces randomness into data, allowing questions to be asked of data without revealing any specific identifying characteristics - for a full definition and understanding (with examples), we recommend taking a look at this article

So, How Can a Data Science & AI Platform Help?

Under data privacy regulations, working directly with personal data is extremely limited, and working with anonymized data — while an interesting option if effective — is incredibly difficult (not to mention resource-intensive) to actually do correctly. So what other options are there to work with data in an increasingly regulated world?

In talking about data protection by design, the GDPR specifically mentions the use of pseudonymization, which it later defines as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately…”

While this clearly means that pseudonymized data is still personal data (since it is not anonymized), it does provide some additional freedom for data teams to work with data provided that they have specific, defined projects with controlled access and a clear data retention policy. 

Again, this sounds simple enough, yet most data teams don’t have the proper infrastructure in place to centrally:

  1. Define the project’s purpose in a clear, visible place.
  2. Document data retention policies specific to that project.
  3. Control project access.
  4. Ensure all project work is contained in this restricted project.

That’s where a data science and AI platform (like Dataiku) can come in. In general, data teams need a data science & AI platform for a variety of reasons, including pure efficiency. But critically, one of the biggest advantages is compliance with data regulations. Data science and AI platforms allow for:

  • Personal data identification, documentation, and clear data lineage  - that is, they allow data teams and leaders to trace (and often see at a glance) which data source is used in each project.
  • Access restriction and control - including separation by team, by role, purpose of analysis and data use, etc.
  • Easier data minimization - given clear separation in projects as well as some built-in help for anonymization and pseudonymization, only data relevant to the specific purpose will be processed, minimizing risk. 

Centralizing data efforts into one central place or tool allows for the simple governance of larger projects as well as individual datasets and sources. Further, developing straightforward processes (and, of course, properly training staff on those policies and procedures) for working with personal data should be top of mind. Finally, organizations need to make sure they are monitoring and enforcing personal data processes in order to move the company forward in getting value out of data without compromising individuals’ privacy.

You May Also Like

Taking the Wheel Back With Dataiku's Model Override Feature

Read More

I Have GCP, Why Do I Need Dataiku?

Read More

How to Build Tailored Enterprise Chatbots at Scale

Read More

Operationalizing Data Quality: The Key to Successful Modern Analytics

Read More