In the fast-paced world of business and technology, few innovations have sparked as much intrigue as Large Language Models (LLMs). LLMs have become the hottest topic of discussion across boardrooms and kitchen tables alike. Thanks to their ability to comprehend and generate human language, LLMs are rewriting the rules of human-machine interaction and paving the way for a new era of possibilities. While this technology is undoubtedly exciting, striking a balance between harnessing the power of LLMs for innovation while safeguarding sensitive information has become a critical challenge for organizations.
While organizations value the extensive knowledge that proprietary SaaS LLMs like ChatGPT provide, using these LLMs out of the box doesn’t always meet the needs of the business. This is because the most interesting business use cases require taking into account non-public documents.
This challenge comes in two forms: first, the LLM has not been trained on these non-public documents; and second, the question prompts and necessary contextual information are sensitive and can't be shared with a SaaS LLM provider. Not to mention the fact that SaaS LLM API token rates can be expensive.
To address these challenges, this blog post will:
- Cover how to create a vector store from a corpus of niche documents, then smartly engineer user prompts with the appropriate context from this corpus.
- Provide an example of using a smaller, lower cost, open source foundational model, specifically Dolly, a large language model (LLM) trained by Databricks for less than $30 to exhibit ChatGPT-like human interactivity.
- Demonstrate how to build a full pipeline for data preparation, prompt engineering, and Q&A chatbot interface within Dataiku — fully contained on a company's infrastructure with no external data movement.
What Is Dolly?
Dolly, created by Databricks, is an open source instruction-following Large Language Model. It is based on the EleutherAI pythia model family. Lightweight and open source LLMs, like Dolly or MPT-7B, illustrate how organizations can use LLMs to deliver high-quality results quickly and economically. The accessibility of these models complement Dataiku’s vision of Everyday AI, helping to democratize LLMs by transforming them from something very few companies can afford into an asset every company can own and customize to improve their products.
We will illustrate this concept in practice with a use case study below, highlighting how the complementary technologies of Dataiku and Databricks can support enterprises in developing generative AI-powered applications while also addressing their broader data science and AI needs.
LLM Pipeline Development
Step 1. Preparing the Data
Building an effective LLM pipeline involves a holistic approach that goes beyond just deploying the LLM. Since customization requires clean, curated data, our Flow starts by pre-processing the raw text data to ensure the question and answering data is refined and ready for analysis.
In our case, we download the raw data from the Gardening and Landscape Stack Exchange in the form of XML files. This is where the power of Dataiku comes into play. Our ability to upload XML files and automatically detect the structure allows you to parse unstructured data files without the use of code. With a few visual preparation steps and recipes, our data is transformed from unstructured to structured data.
Step 2. Build a Vector Store
Since LLMs only know as much as they have been shown during their training period, we combine the LLM with our gardening Q&A data source to provide relevant and up-to-date answers. We do this by building a vector store, indexing our embedded text into a fast and searchable database. We use FAISS and LangChain from inside a Python code recipe in our Dataiku flow to populate our vector store. The Dataiku flow seamlessly combines visual and code-based preparation steps, making the entire process transparent and easily understandable.
When a new question is asked, we query our vector store to find similar answers of questions asked in the past. These relevant answers are passed as context to our LLM. This retrieve-then-read pipeline technique provides an efficient, low-cost solution for customizing our LLM without retraining the model to change the underlying parameters.
Step 3. Prompt Engineering & Inference Q&A
We pass this context into our LLM through a prompt template. We can structure the prompt to provide instructions on how our LLM should respond to our question. An example of our prompt template is provided below.
Once our prompt template is defined, we use built in functions from LangChain to pass the template to our LLM and return a generated response. Dataiku makes it easy to incorporate open source LLMs from Huggingface into your Flow. Simply load the model into the code environment resource directory and select that code environment for the recipe which runs the inference. While we know from quantitative benchmarks that open source LLMs underperform proprietary SaaS LLMs, the illustrative example below shows how we still get a fairly good answer with “relatively small” models.
Example Question |
Example Generated Answer |
Every summer despite my best attempts to be careful, I end up developing a nasty rash due to an unfortunate encounter with poison ivy while doing routine yard work. How can I best identify poison ivy, poison oak, and poison sumac? |
It is important to be careful during yard work and to wash your hands and clothing thoroughly after contact with the plants. One method to identify poison ivy, poison oak and poison sumac is to observe how they grow: the stems are often swollen and bumpy, they usually grow in clusters, and when they are crushed they give off a bitter-smelling oil. If you are working near trees it is important to inspect the trees from the ground to see if there are any symptoms or if insect bites are present. |
Step 4. Democratizing Our LLM to the Enterprise
Deployment of functional data products to be utilized by the business is pivotal, as it enhances the value and impact of the LLM. Luckily, we can use no-code Dataiku Applications to package our project as a reusable application with predefined actions to build our Flow with new inputs and view the results. The application below shows how no-code users can simply ask or upload a batch of questions, run a scenario to submit these questions for inference, and view and/or download the generated answers from Dolly.
Developing an application maximizes the potential of the LLM and achieves overall pipeline success since it makes our LLM available to business users to run inference without writing any code.
Conclusion
While LLMs have emerged as a groundbreaking innovation with the potential to reshape the way we work, organizations face the challenge of harnessing their power while ensuring data security. Smaller and lower-cost foundational models, such as Dolly or MPT-7b, offer a solution to this challenge.
As open source LLMs become more accessible, this technology will become a commodity that every organization can use to improve the way they work. With the democratization of LLMs, we can expect to see the widespread adoption of this transformative technology and the realization of its full potential across various industries.