We’ve previously covered what large language models (LLMs) are, how large language models work, and how to use LLMs in the enterprise. We’ve also delved a bit into use cases for ChatGPT and LLMs in an enterprise context. In this blog post, we’ll talk about how you can leverage LLM technology in the Dataiku platform.
Recap: LLMs in the Enterprise
As a quick recap, there are two ways to use a large language model (LLM) in an enterprise context:
- By making an API call to a model provided as a service, such as the GPT-3 models provided by OpenAI, including the gpt-3.5-turbo or gpt-4-0314 models that powers ChatGPT. Generally, these services are provided by specialized companies like OpenAI, or large cloud computing companies like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.
- Downloading and running an open-source model in an environment that you manage. Platforms like Hugging Face aggregate a wide range of such models.
Each approach has advantages and drawbacks. But in general, both options allow you to choose from smaller and larger models, with tradeoffs in terms of:
- The breadth of their potential applications
- The sophistication of the language generated
- The cost and complexity of using the model
Dataiku & LLMs
Dataiku is the platform for Everyday AI, enabling data experts and domain experts to work together to build artificial intelligence (AI) into their daily operations. Together, they design, develop and deploy new AI capabilities, at all scales and in all industries. Organizations that use Dataiku enable their people to be extraordinary, creating the AI that will power their company into the future.
More than 500 companies worldwide use Dataiku, driving diverse use cases from predictive maintenance and supply chain optimization, to quality control in precision engineering, to marketing optimization, and everything in between.
Dataiku supports both of these approaches for using LLMs in the enterprise. On top of supporting this particular use case, Dataiku also provides the requisite broader framework necessary for properly deploying any AI use case. This includes data connectors, data preparation capabilities, automation, and governance features.
Our core vision has always been to allow enterprises to quickly integrate the latest machine learning models and neural networks into their operations. This vision continues with LLMs.
Using Public Models via API in Dataiku
Dataiku provides a direct interface with the popular OpenAI text completion APIs, including the powerful GPT-3 family of models. Using this integration is simple:
- Prepare your input dataset. The Dataiku integration supports use cases where you provide both instructions and input text (“change the following text from first-person to third-person,” “I went for a long walk one day…”) and where only instructions are provided (“write a 50-word email addressed to Mr. Smith”). Like all data preparation steps in Dataiku, this may be done using the visual data preparation tools or using the programming language of your choice.
- Configure your integration with the OpenAI text completion API, adding the API keys obtained from OpenAI when signing up for their service.
- Select the model that you want to call, referring to the OpenAI documentation to choose the right one for your particular task, as well as the additional parameters offered by the OpenAI API.
- Specify the output dataset where the completed text will be written to.
In order to use other models via API, the API call can be accomplished using a Python or R recipe for Steps 2-4 above.
Using an Open-Source LLM in Dataiku
One of Dataiku’s core strengths and design principles is extensibility. If a particular feature or function is not natively available in Dataiku, the platform allows users to build that integration themselves.
This is the case for open-source LLMs, as it is for other machine learning models not natively available in Dataiku. A high-level overview of the process is as follows, which would typically be done in several Python recipes:
- Download and install a Python package, such as the Hugging Face transformers package.
- Download the pretrained model.
- If you have relevant label data and out-of-the-box performance is not satisfactory, you can consider retraining or fine tuning the model.
- Perform the text generation step (called inference), and write the results back to the desired data store.
- (Optional) The model could then be packaged into a visual plugin, allowing non-coding users to apply the model to their tasks.
In addition, Dataiku provides additional Natural Language Processing (NLP) capabilities that are built on LLMs to assist in building machine learning models using natural language text as an input. For example, sentence embedding allows text variables to be converted to numerical values that can then be easily used in the visual machine learning interface for classification and regression tasks.
See LLMs With Dataiku In Action
Dataiku is the platform for Everyday AI, enabling data experts and domain experts to work together to build AI into their daily operations. See Dataiku in action as we walk through several use cases for large language models in the platform: