In the ever-evolving field of AI and natural language processing (NLP), large language models (LLMs) have emerged as powerful tools for a wide range of applications, from text generation to question and answering. However, the process of adapting these models to specific tasks and domains can be challenging. Various methods have gained traction such as prompt engineering, Retrieval Augmented Generation (RAG), fine tuning and continued pre-training. With multiple methods that have proven to be successful, developers often encounter the same dilemma: Which approach is best for their specific use case?
Since the answer is not black and white, in this blog post we will compare two approaches with a real-world use case: in-context learning and fine tuning. Using Dataiku and models from Amazon Bedrock, we will analyze where one approach may be preferred over the other and provide insights into the trade-offs between flexibility, performance, and resource requirements. Whether you are a researcher, developer, or simply curious about the latest advancements in AI, this analysis will shed light on how to compare and choose the right approach for your use case, equipping you with the knowledge to make informed decisions when working with LLMs.
Tech Primer: Understanding the Fundamentals
In-context learning is a technique where the model is provided with information (e.g., instructions, guidelines, examples) in the prompt that helps the LLM understand how best to perform the target task during inference. Additionally, context can also provide guardrails and moderation cues that help the LLM scope its answer to the task at hand, such as setting expectations to guide tone and style, or demonstrating a safe and appropriate response.
In particular, few-shot learning is a method commonly used for in-context learning, in which the model uses the context from a handful of good examples at inference time to learn what is expected in the output. Think of it like teaching a friend a new card game by playing through a few rounds with open hands, so they can figure out the rules and strategy based on the immediate context you’re giving them in the moment. This approach leverages the model's ability to learn from patterns and adapt to new tasks without explicit retraining, and doesn’t alter the model’s weights or parameters at all.
On the other hand, fine-tuning involves further training a pre-trained LLM on a task-specific dataset, allowing the model to adjust its parameters and weights to accommodate the nuances of the new domain. To fine-tune a foundation model, you provide a dataset that consists of sample prompts and model responses. Then, you train the foundation model on the data in hopes that the fine-tuned foundation model is able to provide you with more specific responses. Although fine-tuning can lead to better performance on specific tasks, this process can be computationally expensive and requires large volumes of relevant, labeled training data. It is important to note that as opposed to in-context learning where the model weight is unchanged, with fine-tuning, the model’s weight is updated.
Driving Operational Efficiency at the Oil Rig
In the oil and gas industry, rig operators are tasked with documenting hourly drilling notes throughout their shift and generating a summary at the end of each day. This process can be time consuming and monotonous, impacting operational efficiency and the quality of life for rig operators. By building an AI assistant that automatically generates daily summaries using the hourly notes, we can streamline this process, allowing rig operators to focus on more critical tasks.
This use case was inspired by an AWS blog on how to customize LLMs using Amazon Bedrock. But why customize the models at all?
With generalist models, producing high-quality, relevant responses can be a challenge when processing text data from highly specialized domains. For example, drilling notes are full of oil industry jargon and abbreviations that are difficult for non-specialists to decipher. It’s unlikely that popular or open source foundational models were exposed to this particular type of data or task during the training process, and so they might not produce summaries that are up to an acceptable standard.
This makes this use case a perfect example of when it’s appropriate to use in-context learning and/or fine-tuning to generate a more useful, domain-specific output (note: the two approaches aren’t mutually exclusive!). We’ll use the same dataset and LLM that AWS did, but apply these LLM customizations approaches using Dataiku’s no- and low-code visual interface in order to simplify and accelerate the solution building process.
Understanding the Dataset and Tooling
The Norwegian multinational energy company Equinor has made a set of drilling reports, known as the Volve dataset, available for research, study, and development purposes. This labeled dataset contains 1,759 daily drilling reports — each containing both hourly comments and an output daily summary — from the Volve field in the North Sea.
To build the project, we used two technologies: Dataiku and Amazon Bedrock.
- Dataiku - Dataiku includes visual tools like built-in data preparation recipes and a prompt engineering studio built atop Dataiku's LLM Mesh, so teams can safely harness the power of LLMs and create scalable business applications without needing to code.
- Amazon Bedrock - Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
Thanks to the LLM Mesh, it was simple to connect Dataiku to Amazon Bedrock in order to access the LLMs we wanted to customize for our experiment. As seen in the figure below, we authenticate via an existing Dataiku-to-Amazon S3 connection where the credentials of that connection are used to connect to Amazon Bedrock.
Dataiku provides direct connections to LLM providers via a secure API gateway with built-in security and usage controls.
Preparing the Data in Dataiku
The Volve source data is available as JSON files, which Dataiku easily read in and parsed into a more human-readable tabular dataset format.
Raw source JSON data, parsed into a tabular format.
With the group visual recipe, we aggregated the rig reports by concatenating all the hourly notes from one day into a single row. Using the prepare recipe, which has over 100 configurable processors for simple to advanced data prep tasks, we also added steps such as removing extraneous columns and creating new columns that would serve as labeled examples for the few-shot learning and prompt engineering we’d perform later on.
Dataiku Flow for data preparation using visual, no-code recipes.
A field constructed by concatenating LLM instructions with notes based on examples from the Volve dataset.
Applying In-Context Learning
Recall that in-context learning refers to the model's ability to learn and adapt at inference time based on the information provided within the context of a single input or prompt. We applied few-shot learning by constructing a few examples of what a good input-output pair would look like, so we could incorporate them alongside the explicit task instructions as part of our LLM prompt.
Dataiku Flow containing additional steps taken to prepare data with examples for in-context learning
For prompt engineering, we used Prompt Studios to design and iterate on the LLM prompt. In this visual interface, it’s simple to assign dataset columns as variables to include our examples in the prompt. We can select from the list of approved models to quickly assess how different LLMs perform on the same prompt and test cases, and also review the estimated cost of the query if we were to run it at scale.
As you can see in the figure below, we used AWS Titan G1 Lite from Amazon Bedrock for our experiment.
Dataiku Prompt Studio: Advanced prompt mode, using examples for added context.
Fine-Tuning: An Alternative Approach for Customizing the LLM
For situations where few-shot learning and prompt engineering don’t quite close the gap on the desired quality of the generated text, fine-tuning the model itself might be an option. In Dataiku, we can use either a code-based or visual approach for fine-tuning; both methods allow the customized model to be saved to the LLM Mesh and have all the same control and governance as other connected LLMs.
For our project, we used the no-code recipe to fine-tune the AWS Titan Text G1 Lite model from Amazon Bedrock.
Dataiku Flow for visual fine-tuning, outputting a new model object “titan_ft_v1”
Dataiku fine-tuning recipe configuration
Note: For Amazon Bedrock models, the fine-tuned output model needs to be deployed in AWS, as pricing is done by deployment (not by inference). Deploy the model in Amazon Bedrock by using provisioned throughput and then add the model to the list of existing models in the Dataiku-Amazon Bedrock connection.
Comparing the Outputs of In-Context Learning and Model Fine-Tuning
Going back to the original question at hand, which approach worked better for our use case?
Side-by-side comparison of daily summary real data with synthetic outputs from LLMs using both in-context learning and fine tuning approaches.
When we manually read over the two generated outputs it appears the fine-tuned AWS Titan model produced superior summaries as compared to the baseline model with the additional context provided to the LLM by few-shot learning. To further confirm this analysis and automate LLM evaluation at scale, we leveraged LLM-as-a-judge metrics to evaluate the ground truth summary against the two generated summaries.
We also used an evaluation recipe in Dataiku, which allows us to store model evaluation results and performance metrics over time and compare the models across dimensions like answer correctness and answer relevance. This analysis confirmed our initial assessment that the fine-tuned model produces summaries that are more similar and correct to the daily summaries found in our labeled training dataset.
Comparing LLM-as-a-Judge Metrics in Dataiku’s Model Evaluation Store
That said, keep in mind that there are other avenues we could’ve pursued to try and achieve better summaries before turning to fine-tuning. For example, we could go back to our Prompt Studio and select a different LLM from Amazon Bedrock; Claude 3, in particular, has been noted for its capacity to understand and utilize context effectively within a conversation or task, often performing few-shot learning scenarios.
Considerations for Selecting the Model-Customization Approach
Both methods we tried offer distinct advantages depending on the use case. In-context learning provides flexibility and rapid adaptability without the need for extensive retraining, making it ideal for scenarios where quick adjustments are required, large amounts of labeled training data aren’t readily available, or the computation and hardware costs of fine-tuning would be prohibitive.
On the other hand, fine-tuning can often enable more precise model behavior tailored to specific tasks or domains, but requires more resources and time.
Comparing the approaches for flexibility, resources required, and performance
Final Takeaways
The good news is: You don’t have to pick just one method! By leveraging Dataiku's LLM Mesh and built-in GenAI components, powered by Amazon Bedrock APIs with a wide range of foundational models, organizations can choose the approach that best aligns with their operational needs, ensuring efficient and effective deployment of AI solutions. In fact, combining these methodologies might offer the most robust solution, allowing for both precision and adaptability in real-time applications.
Regardless of the chosen method, one thing is for certain: Implementing an AI assistant for generating daily summaries will achieve operational efficiencies and value gains, ultimately improving both productivity and overall employee satisfaction. Check out the Ørsted story to see some of these takeaways in practice.