Sarus & Dataiku: Balancing Insights and Privacy Protection With LLMs

Scaling AI Kelci Miclaus (Dataiku) & Nicolas Grislain (Sarus Technologies)

In the ever-evolving landscape of pharmaceuticals, the potential of patient-level data to drive actionable insights for population health economic outcomes and patient clinical care decisions is immense. But this potential comes with a catch: the simultaneous need to tap into this exploding wealth of data while also ensuring strict adherence to privacy regulations and security standards. 

In this blog post, we’ll delve into the high-level business problems faced by pharmaceutical companies looking to balance these considerations, the current pain points and challenges they encounter, and two possible paths to solutions that aim to strike the delicate balance between data utilization and privacy protection.

As we’ll discuss, platforms like Sarus and Dataiku can help data teams across the life sciences industry make cutting edge use of their patient data without compromising on privacy, security, and control. In particular, they can fine-tune large language models (LLMs) with strong privacy guarantees, opening the door to countless applications, such as private Synthetic Data (SD) generation. And they hold the promise of extracting the full value of a firm’s data, while keeping patient health information (PHI) secure.

The High-Level Business Problem

Pharmaceutical companies are on a quest for insights that can transform patient care and drive innovation in the industry. Data and business teams across the industry seek to harness patient-level data at increasing degrees of granularity to make informed decisions. The insights they’re after could range from understanding population health economic outcomes to tailoring individual patient care plans. 

But this journey is riddled with complex challenges that demand creative and secure solutions. Where is data stored, and who has access to it? How can it be integrated into models without compromising the privacy of the individuals whose details make up that data? 

Current Challenges and Pain Points

1. Data Silos and Security Concerns

The most significant hurdle lies in the siloed nature of healthcare data. Protected health information (PHI) under regulations like HIPAA and GDPR requires stringent security measures — including, above all, detailed protocols as to how data is collected, coded, and stored. Integrating data from diverse sources without compromising patient privacy is a significant pain point. As data gets more granular, the risk of exposing PHI increases.

2. Balancing Data Integration with Privacy Protection

Integrating longitudinal and fine-grained patient data with new technologies like AI and machine learning has immense potential. But if not done carefully, with the help of sophisticated, well-governed, and secure software, this integration can inadvertently expose sensitive information around PHI, raising further concerns about compliance and patient privacy.

As we’ll now discuss, the traditional approach of de-identifying patient data, while great at pecuring PHI, has significant drawbacks when it comes to preserving the fullness of that data.  

Possible Paths Forward

De-Identification: Preserving Privacy Through Anonymization

One potential solution to the thorny insight-privacy balancing problem is de-identification, a process that involves unlinking patient IDs from their identifying characteristics. This helps in protecting patient privacy while still allowing data utilization. That said, properly achieving de-identification comes with its own set of challenges:

  1. Rigid Rules of Use: De-identification often entails adherence to strict rules that prevent re-identification, or “unblinding.” While this is essential for privacy protection, it can lead to data being locked behind sub-par analytic systems, limiting its accessibility and usability.
  2. Manual Operations: Listing which characteristics are identifying and which are not is a manual task to be conducted for each new dataset, which may involve long and costly validations by a committee of experts. Because it depends on what data is publicly available, it needs to be reconsidered and updated regularly.
  3. Loss of Data Signals: De-identification may remove or dilute crucial data signals — including ever-important contextual information — necessary for evidence generation. This can compromise the quality of the insights that can be derived from the data.
  4. No Guarantee: At the end of the process, there is no undisputed guarantee that patient privacy is correctly protected.

Privacy-Preserving Generation: Balancing Insights and Security

An alternative approach, pioneered by Sarus and platforms like Dataiku, involves privacy-preserving data generation: patient-level data is synthetically generated with the help of LLMs and differential privacy theory, all while retaining the key signals required for analysis. (More specifically, it preserves population-level signals while removing individual-level signals.) 

This approach seeks to offer the best of both worlds — actionable insights without exposing PHI. And because differential privacy does not require us to make any judgment on what is identifying and what is not, the process is fully automatic, cutting manual operations and reducing time-to-data.

But how does this work, exactly? Dataiku and Sarus have worked hard to make it easy for you. Here are the steps involved. First, of course, PHI is fed into a secured database. Then, a LLM is fine-tuned on that data with a procedure called a Differentially Private Stochastic Gradient Descent (DP-SGD). If you read that sentence and were wondering what, exactly, DP-SGD is, you’re not alone. If you don’t feel like reading this detailed explanation, the main thing you need to know is that DP-SGD bounds and blurs the contribution of each individual to the LLM, so that private information becomes undetectable, ensuring that the model doesn’t breach the sanctity of patient privacy.

Once trained, the LLM can be used to generate a new synthetic dataset that preserves the population patterns of the original dataset, and can therefore be used for annotation or machine learning, but with the mathematical guarantee that nothing private will ever come out of the LLM.

The risk that a LLM may output private information from its training dataset may seem abstract and remote, but it is very concrete and real. When prompted with "John Doe suffers", a LLM trained without differential privacy would produce completions revealing the disease of the actual "John Doe" from the private training dataset : "John Doe suffers from a severe form of pancreatic cancer".

The generation of privacy-preserving synthetic data therefore carries enormous benefits relative to the de-identification of data. It allows data teams to preserve a much higher level of detail, specificity, and data-richness than simple de-identification. And the differential privacy guarantee means that sensitive PHI is kept under lock.

Striking the Right Balance

The road to integrating privacy-preserving technologies into pharmaceutical operations is complex but essential. Sarus and Dataiku can help pharma firms strike the right balance between data utilization and privacy protection — allowing them to adhere not only to regulatory requirements but also to their moral obligation to patients. 

The pioneering PHI data synthesis methods developed by Sarus can be integrated and automated into Dataiku to enable data teams to do much more than simply secure their PHI without compromising its quality: they can benefit from Dataiku’s recipe structure and machine learning capabilities to build complex and automatable models with that data. As a result, health and life sciences firms can harness advanced data analytics and kick their use of patient data into the highest gear.

You May Also Like

The Ultimate Test of ChatGPT

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Maximizing Text Generation Techniques

Read More

Looking Ahead: AI Hurdles IT Leaders Need to Overcome in 2025

Read More