From Bedside to Backend: Making Sense of Real-World Health Data

Data Basics, Use Cases & Projects, Dataiku Product, Featured Li-Heng Fu

The world of healthcare is currently experiencing a data explosion of unprecedented proportions. This exponential growth has been fueled by a projected compound annual growth rate (CAGR) of 36% for healthcare data, set to culminate in 2025. Yet, today, we use only a fraction of this data to drive decisions in clinical trials, personalized medicine, and regulatory decisions.

The digitalization of healthcare data began in the latter half of the twentieth century, with the development of electronic health record (EHR) systems in the 1990s and government policies incentivizing EHR adoption in the 2000s. The digital transformation has not only changed the daily practices of healthcare professionals but also created a fertile ground for the secondary use of health data collected from patient care, medical claims, and patient registries. 

This field of study bears many names across various domains in the healthcare and life sciences sector: observational health research, real-world evidence (RWE), cost-effectiveness analysis, population health analysis, etc. The idea is intuitive: The data points collected during the healthcare journey reflect how healthcare is delivered in the real world. 

If the healthcare sector is sitting on top of the gold mine of health data, why has it lagged behind in AI adoption? 

From Treating Patients to Treating Data: Why I Ventured Into Health Tech

As a licensed medical doctor, I witnessed firsthand the stark contrast between the rapid technological advancements in other sectors and the slow adoption within healthcare. While smartphones and applications revolutionized daily life and business through data, doctors were still relying on fax machines and outdated methods for patient charts. This disparity ignited a critical question for me: Why was healthcare, a field so dependent on information, so far behind?

My AI journey began over seven years ago, and today, I sense a genuine eagerness within the healthcare community to embrace AI's potential. However, to truly transform healthcare with this powerful technology, we must first understand how healthcare professionals currently collect and utilize data in their daily practices.

Healthcare Is (Mostly) Data-Driven

The current gold standard of patient care dictates that medical decisions be backed by robust statistical evidence from clinical studies. The randomized controlled trial (RCT), where participants are randomly assigned to different groups to test the effects of an intervention, has been considered the most robust method to generate medical evidence since its debut in the post-war era. Every few years, the most prominent clinical experts review new evidence and revise the clinical guidelines accordingly. 

Evidence-Based Medicine in Practice

Most practitioners would agree that what they’ve observed in their daily patient care sometimes deviates from clinical guidelines. There are many potential reasons for this discrepancy. First, the guidelines are still a patchwork of medical evidence that varies in quality. Clinical trials are very expensive and time-consuming to run, and not all clinical questions are suitable or worthy of trials. For example, there are strict regulations regarding recruiting vulnerable populations like pregnant women and children for clinical trials. Clinical experts often turn to lower-grade evidence, like case reports or expert opinion, to form a clinical consensus for these populations. 

Another potential reason exposes the limitation of RCT: It is challenging to generalize its statistical results from a few thousand people to the whole human population. During my hospital rotations as a medical student, our teachers reminded us to read clinical guidelines published in the U.S. or Europe with a grain of salt. They argued that the Eastern Asian population was often underrepresented in the trials used to support these guidelines.

healthcare data person at a lab with scrubs on

The Promise of Real-World Data

Leading organizations in the healthcare and life sciences sector have been investing in observational health data (or real-world data) to fill these gaps left by traditional trial-based medicine. While randomized controlled trials often generate evidence based on highly selected patient groups, real-world data collected in daily medical practices can reveal discrepancies and strengthen the evidence generation process. A good example is the traditional drug-adverse event reporting system that heavily relies on individual submissions, leading to a lot of noise in the data and challenges in making valid conclusions. 

The regulatory bodies in the U.S. and the EU proposed frameworks that aim to incorporate real-world data into a drug’s indications, providing a more complete picture of drug safety and effectiveness. Some startups have already gone further and released their algorithms that claim to generate “digital twins” using real-world data as a substitute for control arms in clinical trials. 

Healthcare Data Is a Mess

I have painted a rosy picture of real-world data so far; however, the reality is far from simple. During my career in health tech, I heard many complaints that healthcare data is full of duplications and errors. I have to confess that I might have contributed to that mess during my years in clinical work. 

The reality is that patient data is skewed toward reimbursement purposes and regulatory requirements. Doctors often complain that, for both insurance and judiciary agencies, what is logged in the patient chart seems to matter more than what was actually done in treating patients. Studies have shown that clinicians’ burnout is associated with the documentation burden. Healthcare professionals, at their “pajama time,” resort to copying and pasting old patient charts without verifying in order to save their marriages. Thus, the errors in patient charts propagate. 

Healthcare data is not only messy but also siloed and fragmented. Due to the compliance with data safety and security in many countries, the legacy EHR systems were not designed to communicate with each other. There was a lack of a common data schema and standardized vocabularies to represent medical concepts, which posed a great challenge to integrate health data from heterogeneous sources and regions to form real-world data. 

The Quest for Data Infrastructure in Healthcare

Various communities have contributed to solutions in standardizing and harmonizing the observational health data. The OHDSI community, one of the largest international collaborations between academia, industry, and government agencies, has proposed the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), standardized vocabularies, and best practices in observational health research methodologies to encourage cross-region, multi-institutional collaborations. 

Healthcare Models Need a Practitioner in the Loop

In the early years of my career in health tech, I was involved in developing a clinical model to predict kidney function. While exploring the claims data from a kidney practice, I found that there were mostly patients in late-stage chronic kidney disease (CKD). I questioned how we could identify high-risk patients in early-stage CKD if that was our only source for training a predictive model? When I presented this finding to the director of that kidney practice, he explained that their nephrologists only accepted referrals for patients over CKD stage 4 due to limited capacity. I wouldn’t have been able to explain this particularity in the dataset without the knowledge of local practices. This encounter has taught me an invaluable lesson: Early involvement of first-line practitioners is crucial in the journey of AI adoption in healthcare. 

Dataiku's Solution in Real-World Data

The stakes for effective RWD analytics are incredibly high; one market analysis suggests a leading pharmaceutical company could achieve an average annual value increase of $300 million by integrating advanced RWE analytics. 

Our AI solutions team has added a new Dataiku Solution, RWD Cohort Discovery, to our healthcare and life sciences catalog. This solution supports the RWE analytics team in establishing an RWD data pipeline connecting to their OMOP patient datasets and creating a centralized repository to store and manage their existing cohorts' SQL scripts. This Solution then ingests the scripts into generalizable and reusable clinical electronic phenotyping for future advanced analytics. Its cohort dashboard offers a comprehensive insight into a saved cohort to facilitate communications between the analytics team and clinical experts. 

Key features of the Solution include:

  • Adopted industry standards: compatible with patient data and cohort SQL scripts in OMOP CDM
  • Provides a one-stop shop for storing and managing existing cohort scripts and query results, which can be easily shared and reused in future analysis
  • Visualize the cohort characteristics in a comprehensive dashboard to facilitate collaboration between the analytic and clinical teams

common data model standards

Real-World Data: The Foundation of Personalized Medicine

Advanced RWE analytics is accelerating clinical studies and the delivery of new therapeutics. However, its potential to transform healthcare delivery is even greater. Cancer care serves as a prime example of precision (or personalized) medicine. By analyzing the genetic and genomic expression of cancer cells, oncologists can further stratify patient populations into smaller cohorts, enabling the development of optimal treatment regimens for each individual.

The more health data we gather, the better we can understand the global population. The vision of personalized medicine cannot be fully realized without globally harmonized and diverse clinical data. Ultimately, a model's effectiveness is only as good as the training data it's built upon.

You May Also Like

How IT Leaders Can Win the Analytics and AI Race

Read More

AI Agents: Setting The Bar For Manufacturing Maintenance

Read More

Why Secure Agentic Applications Require a New Approach

Read More

The Risks and Governance Requirements of Agentic AI

Read More