Building Generative AI Success With Dataiku & Databricks

Dataiku Product, Scaling AI, Featured Catie Grasso

With the advent of Generative AI accelerating the path to democratizing data and AI in the enterprise, the challenge to provide tooling for more people and encourage collaboration between teams becomes paramount. 

With Dataiku and Databricks, everyone from data professionals to business experts has what they need to collaborate and develop successful data and AI projects at scale. In September at our Everyday AI New York conference, there was a thought-provoking fireside chat with Jed Dougherty, VP Platform Strategy at Dataiku and Barry Daubner, VP of Sales and Business Development at Databricks. We’ve highlighted some key takeaways from the talk here.

→ Watch the Full Talk Here

We Can Go Faster Together

To kick off the talk, Amanda Milberg, Sr. Partner Sales Engineer at Dataiku cleverly equates the Dataiku and Databricks partnership to the cronut: a combination of a croissant and a donut. The unique dessert became a total hit after its inception and quickly became one of the most talked about desserts in history. Just like Dataiku and Databricks, while the two desserts (products) are great on their own, magic can happen when you mix the two together. 

personas with Databricks and Dataiku

Organizations use Dataiku as the best-in-class visual interface on top of their infrastructure of choice, in this case the Databricks Data Intelligence Platform. Dataiku and Databricks bring AI capabilities to business teams, enabling all profiles in an organization to scale analytics initiatives with secure and safe access to the Databricks Data Intelligence Platform and the easy-to-use visual interface of Dataiku. 

Now, let’s zoom in on some of the feature integrations that empower users with a full suite of data and AI capabilities closely knit with an organization’s toolkit: 

  1. Teams can read and write data directly to the Databricks Platform. They can also load data from cloud storage into Databricks. 
  2. Data engineers can write SQL code to execute using the computational power of Databricks.
  3. For business analysts, all visual recipes in Dataiku push down computation to Databricks.
  4. Data scientists can write Python code to execute in Databricks with Databricks Connect.
  5. Machine learning engineers can train models in Dataiku and the model inference is pushed down to Databricks.
  6. Last but not least, Dataiku is fully integrated with Databricks Unity Catalog, meaning that security, governance, and lineage are preserved.
feature integrations Databricks

Since Dataiku won the Databricks AI partner of the year in June 2023 at the Data + AI Summit during the Databricks Partner Summit in San Francisco, we’ve continued our momentum to extend to Generative AI. 

Training vs. Using an Off-the-Shelf LLM

After addressing the Databricks acquisition of MosaicML — which made sense for a number of reasons (namely, technology fit and cultural fit) — Jed asked Barry about when a data scientist needs to train versus when they can use an off-the-shelf LLM. At Dataiku, we’ve worked with many customers who are in the early to middle stages of rolling out LLMs inside their organizations and that’s one of the most frequently asked questions. 

Most often, Jed said, we hear of four main priorities when trying to navigate LLMs:

  • The simplest one is getting an API key from a provider to throw text at it and get responses back. This can be done via a LLM Visual Recipe inside Dataiku, for example. 
  • The next is wanting to run a private LLM (i.e., off of HuggingFace) to run on their own servers or Databricks, but they don’t need to retrain anything. 
  • Next, they might want to build a Retrieval Augmented Generation (RAG) pipeline to add their own data as context to the LLMand be able to query that.
  • Fine-tuning or actually training a model from scratch. 

According to Barry, the amount of use cases that actually fall into that final bucket varies depending on the size of the organization, the datasets they have, and what they are trying to accomplish with a given use case. He gave the example of a model trained with Stanford Center for Research on Foundation Models using the PubMed dataset, a life sciences publication repository owned by the National Library of Medicine in the U.S. that has decades of content. 

Most of it now is behind a paywall, but the team found all 10 million full-text documents and trained a model from scratch over the course of a few days. The goal of the model initially was to have it try to pass the U.S. medical licensing exam. The model didn’t pass, but was only trained on publications related to biotech and life sciences. 

So, What Does All of That Mean in Practice?

The TL;DR is that organizations really need to analyze at a use case by use case level and choose the right model or serving mechanism for their specific needs, it is absolutely not one size fits all! They will not have a single LLM to solve all the problems in an organization, especially with large enterprises that have dozens of teams working with LLMs. They will (and should!) have many deployed, some are just going to be APIs and others will be their own trained models, some will be fine-tuned, and will be running across a wide range of different platforms.

Having a cohesive service layer on top of those — the Dataiku LLM Mesh — makes it so that teams can maintain the governance of these different LLMs for different use cases, with access control and cost control. With the LLM Mesh sitting between LLM service providers and end-user applications, companies have the agility to choose the most cost-effective models for their needs, both today and tomorrow, ensure the safety of their data and responses as well as create reusable components for scalable application development.

You May Also Like

How to Reach the Apex of Data Preparation

Read More

How to Address Churn With Predictive Analytics

Read More

What Is MLOps?

Read More

4 Do's and Don'ts of Hiring and Upskilling for AI Talent

Read More