Accelerating Biopharma R&D With Dataiku Solutions

Dataiku Product Kelci Miclaus

While the masses of blogs, webinars, ebooks, and LinkedIn posts continue to center around GenAI and/or AI Governance, I realized I have been remiss to celebrate and highlight two new Dataiku Solutions dedicated to life sciences R&D that we released earlier this year! 


Don’t get me wrong, GenAI and AI Governance are incredibly important topics (and rest assured, I include some stellar GenAI clickbait in the latter half of this blog), but analytics and machine learning (ML) applications continue to see rapid acceleration and mainstream use in life sciences. 

About two years ago, I wrote this blog around ML in drug discovery, translational sciences, and clinical trials, enumerating how predictive analytics can:

  • Accelerate drug/therapeutic discovery and development timelines.
  • Innovate with new methodologies in translational sciences.
  • Optimize and improve both success outcomes and operations in clinical trials.

Our AI solutions team has added two new Dataiku Solutions to our healthcare & life sciences catalog exemplifying these points and I’m overdue (and excited) to share them!

Molecular Property Prediction

The Dataiku Solution for Molecular Property Prediction allows computational chemists to query previously studied small molecules (from public chemical databases) with known bioactivity for a given protein target. These studied molecules are used to train and build predictive models for bioactivity based on their compound structure. Novel compounds are then scored by the model for their predicted bioactivity. 

This in-silico discovery process can improve the success and efficiency of the drug development cycle by prioritizing lead development candidate experimentation to help bring a stable compound quickly to preclinical and clinical testing. This solution acts as a template around extending predictive analytics to further molecular properties (like ADMET) using key features of the molecules' structures to accelerate discovery and development of new compounds in your pipeline. 

Key Features:

  • Easily query public chemical databases via API including ChEMBL and PubChem.
  • Create molecular descriptors and fingerprint features from SMILES chemical structures from RDKit and transformer language models like ChemBERTA.
  • Explore the chemical space of previously studied and novel molecular structures for drug candidates.
  • Train ML models for predicting bioactivity for a given protein target using small molecule descriptors and structure features.
  • Allow chemists to screen lead development candidates in-silico with dynamic dashboards.

molecular property prediction dataiku dashboard

Clinical Site Intelligence

The Dataiku Solution for Clinical Site Intelligence leverages the ClinicalTrials.gov database of nearly 500,000 global studies to easily uncover similar and/or competitive studies and clinical sites leveraged, guide site review and selection in novel studies, and provide sponsor overviews that equip clinical trial planning and operations teams with comprehensive analytics and modeling insights. This solution also acts as a powerful starting point with more opportunities stemming from further internal and external data sources integration to improve protocol design, clinical feasibility, and optimal clinical site selection strategies. 

Key Features:

  • Easily build custom queries and streamlined data flows via API from ClinicalTrials.gov.
  • Leverage ML models for study enrollment rate prediction and study similarity analyses based on an existing protocol or novel study synopsis.
  • Scan through studies and associated clinical sites for a given lead sponsor and gain competitive intelligence of site utilization by other sponsors.
  • Augment your site metrics by cross-leveraging processed Census and CDC datasets from our  Dataiku Solution for SDOH and geo-locating community social factors and disease prevalence around site locations to improve diversity considerations in clinical feasibility studies.
  • Thanks to a dynamic web application, enter a novel study synopsis, explore similar studies and their sites, and use individual site review cards to assess potential candidate sites.

Clinical site intelligence solution dataiku

Extending Dataiku Solutions With Large Language Models (LLMs)

If you have made it this far through the “traditional analytics and ML” section and are waiting for that GenAI cherry — congrats, you’ve arrived!!

Dataiku has worked hard to match the rapid technology development around GenAI with our LLM Mesh capabilities. LLMs can extend traditional analytic and ML pipelines quite beautifully to increase the business reach and value outcomes. Common usage patterns we see there are using LLMs to create structured data points from documents/text (that can feed into and improve predictive models), generating content or self-service insights from both input data and analytic models, and creating conversational AI-powered chatbots with RAG frameworks on dedicated structured data and/or document stores. 

Naturally, we jumped right on the bandwagon by extending many of the Dataiku Solutions in our catalog with GenAI features! We’ve built out extensions leveraging LLMs with Dataiku Answers both for our Molecular Property Prediction and Clinical Site Intelligence Solutions. 

The clinical trials database naturally contained a large amount of text summarizing and describing trials, the contacts and site information, and the patient eligibility criteria. Using our site intelligence solution, we shared the extracted data queries along with analytics around study enrollment rate predictions to create a trial assistant chatbot. 

dataiku flow

These data points were embedded into a knowledge bank that fuels Answers, our no-code visual chatbot web-application, and voila! We now have an extension which we are evaluating for performance as a conversational experience with trial design, competitive intelligence, and site planning or selection sourced from both clinicaltrials.gov and the analytic insights provided by the Dataiku Solution for Site Intelligence.

dataiku clinical trial insights assistant

clinical trial assistant dataiku

→ Catch a further sneak peak of this application in our joint AWS Bedrock and  Dataiku webinar! 

Moving on to Molecular Property Prediction, here we had only a very well structured dataset (but note all those column descriptors stored in the Dataiku metadata).

structured dataset

There’s no reason we can’t have a conversation with a well described dataset is there? New capabilities we’re releasing with Dataiku Answers now allow you to do exactly that! 

Instead of a knowledge bank, let’s use an SQL table for retrieval. 

SQL table retrieval

Now, let’s ask some questions and discover key details around bioactivity with previously studied molecules for our protein target of interest!

 

In my experience working across many healthcare and life sciences organizations, building AI-powered chatbots can be one of the most pragmatic ways to begin to deliver real business value in your GenAI initiatives and strategies. Examples like the applications above immediately expand the reach and utility of insights generated from valuable scientific research and clinical documents. 

Of course, there are many further ways these solutions could be extended with GenAI. For example with Clinical Site Intelligence, you could generate clinical site summary reports for site selection review. With Molecular Property Prediction, scientists and biotech firms are already pushing the boundaries of GenAI to create recommendation engines for potential new molecular structures as lead candidates. When we begin to holistically embed these new “shiny” tools into our existing toolbox of robust data rules, analytics, and ML, the possibilities truly become limitless.

You May Also Like

From Vision to Value: Visual GenAI in Dataiku

Read More

Data Preparation Dataiku Hidden Gems: Part 2

Read More

Maximizing Enterprise Data Products Distribution

Read More

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More