How Synthetic Data Enhances Generative AI for Enterprises

Use Cases & Projects, Dataiku Product, Scaling AI, Featured Murtaza Khomusi

This article was written by our friends at Gretel. Gretel is the global leader in high-quality, privacy-preserving synthetic data generation, providing secure, customizable, and scalable data solutions for AI development.

The race to integrate Generative AI applications across enterprise domains is more intense than ever. According to a recent IBM report, 64% of CEOs are under pressure to implement Generative AI into their enterprise functions. However, the key to adapting generalist large language models (LLMs) for specialized tasks lies in data.

Leading LLMs are already trained on the vast majority of the public web — this makes them good generalists. For these models to master specialized tasks — like detecting a specific type of fraud pattern unique to a regional bank — they need to be exposed to new, specialized data sources. This is where enterprises encounter a “data wall,” an impasse where the necessary data either does not exist or is too sensitive to be exposed to a model.

To overcome these traditional data scarcity and privacy challenges, organizations are increasingly turning to synthetic data, data generated from an AI algorithm rather than collected from the real world, which provides a viable solution to push these models the last mile toward production. Below, we delve into both data bottlenecks and how enterprises are successfully leveraging synthetic data to overcome these obstacles and drive innovation.

synthetic data

Data Scarcity

In many cases, the data needed to fine-tune a model to teach a LLM something new or improve its performance on something it already knows simply does not exist. Data collection efforts can be costly, require large user bases, and can take months to implement. Surprisingly, this problem is felt at both small and large relatively data rich organizations.  

With synthetic data, users can create custom datasets using a simple prompt, telling a model to generate the specific categories of data it needs in just seconds. It also allows users to dictate the exact conditions it would like the data to simulate, giving a level of unparalleled control in data generation. Better yet, users can iteratively improve the dataset further with prompts to continue to shape the dataset until it meets their needs. 

For example, a North American e-commerce store might already excel in selling equipment for popular sports in the region such as football, baseball, and basketball. This data allows it to fine-tune an LLM-powered customer service chatbot to answer questions about these sports. But what if this retailer now wants to expand into a popular European sport like rugby or cricket? 

Yes, the retailer can stock these items in its brick-and-mortar or online stores, but how will it ensure the chatbot is able to answer questions about these new topic areas on day one and provide a consistent customer experience? Leveraging synthetic data, the organization can create data that pertains to customer interactions and sales for these new areas, instantly providing a means for the retailer to expand the domain expertise of its customer chatbot service and ensure a seamless digital experience. Similar uses for synthetic data exist in virtually every industry. 

Data Privacy 

In other instances, an organization might not want to risk exposing its sensitive data to a model. Sharing sensitive data with public Generative AI services like ChatGPT has become akin to dropping data into the public cloud. Even when attempting to fine-tune the underlying models through their public APIs, a growing number of incidents have occurred in which LLMs unintentionally share personal identifiable information (PII) with unauthorized users. This risk, as well as the increasing rise of advanced prompt engineering attacks, causes many organizations to take caution when considering exposing its most sensitive customer or employee data with any LLM or public AI service.

In these cases, enterprises often turn to fine-tuning a local, self-hosted LLM with their own data. Here, a user might feel comfortable exposing some of its raw data to a locally hosted model, but still opt to create synthetic, private versions of its most sensitive information so that the model can train on its insights while carrying no direct link to the sensitive entities in the original data source. This is especially relevant in highly regulated industries like healthcare, finance, and the public sector where strict data use policies, even internally within an organization, significantly delay or entirely prevent key digital initiatives. 

For example, in the healthcare industry, sensitive clinical records that contain protected health information (PHI) such as patient records for all those treated for a disease like pancreatic cancer usually cannot be used as training data for a Generative AI model. When researchers plan to use this data to train a model, their intention is not to teach a model about specific patients, but to teach a model about general statistical properties and patterns about a disease and its treatment. That is, clinical practitioners are looking to teach a model about the disease, not the patient. 

When generating synthetic data, a user can train a model to learn all the statistical value of such a dataset and create a statically similar one that contains all the analytic value but none of the original sensitive patient information. This enables medical researchers to leverage the latest technologies when trying to advance cures for diseases while preserving the privacy of individual patients. For these regulated sectors, innovation using the latest AI technologies is otherwise totally inaccessible.

Final Thoughts

Synthetic data plays a key component as a data source in the MLOps journey. Organizations often combine synthetic and real-world data together to create powerful data foundations that are able to customize LLMs that power leading-edge chatbots, co-pilots, and agents.

An end-to-end analytics and AI platform like Dataiku, combined with synthetic data platforms like Gretel, provide a powerful solution for rapid LLM adoption in enterprises. Users can generate synthetic datasets with Gretel and seamlessly integrate synthetic data into dedicated collections today in Dataiku’s data catalog, adding descriptions and tags that allow others to see that certain fields of features are synthetically generated. This helps improve data stewardship and provides end-to-end visibility into data lineage, enabling users to create a trusted and auditable data foundation and build responsible and interpretable AI. Dataiku also simplifies fine-tuning and self-hosting local LLMs, if data privacy concerns prevent you from using commercial AI services.

Feel free to explore these workflows and more with the Dataiku LLM Mesh and Gretel’s synthetic data platform today and accelerate your journey to launching your next Generative AI service. 

You May Also Like

A How-to Guide to Design an Enterprise GenAI Platform

Read More

How Dataiku and v4c.ai Are Teaming Up for Next Generation AI

Read More

Don’t Let Your Analytics & AI Tooling Problems Snowball

Read More