Ever tried getting a quick answer from a 300-page financial 10-K report or a lengthy legal contract packed with fine print? It’s not easy! For many organizations, dealing with massive documents like these is just part of the daily routine. While they contain critical information, their sheer size poses a challenge for humans and large language models (LLMs) alike.
In the rapidly evolving era of Generative AI and LLMs, two primary methods have emerged for analyzing or extracting insights from large-scale documents: leveraging extended context windows or using Retrieval-Augmented Generation (RAG). Recently, a data scientist who works with legal documents and statements of work in a procurement context told us he always first tries to make his GenAI use cases work with a long context prompt before using a RAG pipeline, even if the documents are a dozen pages. This begs the question: With context windows expanding so rapidly, will RAG become obsolete?
The allure of feeding an entire document directly into a model with a huge context window is undeniable, because it reduces the need for complex prompt engineering and data pre-processing workflows. However, the RAG approach, which strategically selects relevant sections of text to inform a model's response, is a powerful and proven technique. Therefore, organizations need to carefully consider the strengths and limitations of each approach when working with these types of lengthy documents. In this blog, we’ll explore the implications of these two approaches, how they compare, and what it means for enterprise AI applications.
Understanding Context Windows
At the heart of the debate between long-context models and RAG lies the concept of the context window. In simple terms, a context window is the maximum amount of text (measured in tokens) that an LLM can process in one go. Early LLMs like the original GPT-1 model, released by OpenAI in June 2018, had a modest context window of just 512 tokens, roughly equivalent to one page of text.
Fast forward to today, and we're seeing models like Google’s Gemini 1.5 boasting context windows capable of processing over a million tokens, or approximately 1,500 pages of text. If the current pace of GenAI innovation continues, we’re likely to see context windows grow even larger, though the rate of expansion is driven by a combination of model optimization and hardware advancements.
The Long-Context Approach: Pros and Cons
With such vast context windows available, the temptation to feed an entire document directly into the model is hard to resist, especially in use cases where inconsistent or complex formatting makes breaking the document into chunks a challenging task. Using whole documents is simply easier, which translates to less complex pipelines and thus faster development time. However, the long-context model approach comes with tradeoffs such as increased latency and processing costs, both of which can be significant.
Artificial Analysis LLM leaderboard: A comparison of GPT-4o, Llama 3, Mistral, Gemini, and over 30 models
Although pricing model and latency time for hosted LLM services vary widely by provider, you can generally expect input token cost to scale linearly according to the length of the text you provide in your prompt, which makes sense: More input tokens processed, more dollars charged. In the figure above that focuses just on the Google Gemini family of LLMs, note that models with larger context windows also may charge more per token, so cost actually grows across two dimensions.
Latency may also be a limitation of the long-context approach if your application requires very fast responses. For instance, you can see that Google Gemini 1.5 Flash has a context window of one million tokens and a median first-chunk response time of 0.39 seconds; compare this to Google Gemini 1.5 Pro, which can accept up to two million tokens but the median response time is more than double that of the Flash model.
Additionally, including whole documents in the context window comes with other drawbacks, such as increased tendency for the model to get “distracted” by all the irrelevant information surrounding the key insights you’re seeking. This is related to the problem of position bias, where a model's accuracy can vary depending on where within the document the relevant information is located. For example, a model might perform better when key insights appear near the beginning or end of a document, but struggle to retrieve critical details buried in the middle, potentially leading to incomplete or skewed analysis in long, dense texts. Even cutting-edge reasoning models such as the newly announced OpenAI o1 series suffer from the distraction problem, which is why OpenAI recommends including “only the most relevant information to prevent the model from overcomplicating its response” in your prompt.
RAG: An Established Approach
RAG provides an alternative path by focusing on a targeted strike, rather than “boiling the ocean.” Instead of throwing everything into a model’s context window, RAG techniques involve an information retrieval step that identifies the most relevant sections of a document and includes them in the prompt as additional context to the model. This efficient and precise approach ensures that the LLM focuses only on the information most likely to accurately answer the query.
The RAG method can be particularly effective in scenarios where the document is lengthy, but only a small portion is directly relevant to the task at hand. RAG not only reduces the computational load (and thereby cost and latency) but also can provide exact source citations and improve the accuracy of the model’s output. The ability to fact-check results from trusted knowledge repositories is a critical factor in mitigating the risk of hallucinations in enterprise applications where precision is key and user trust is paramount.
Evaluating Long-Context Models vs. RAG: Insights from Recent Research
To understand the practical implications of these approaches, let’s look at a recent study by Salesforce AI Research that compared the effectiveness of long-context models vs. RAG. The study designed a framework to evaluate how well each method could retrieve precise insights from large sets of documents, a task akin to finding a needle in a haystack.
The researchers created five distinct groups, which they called “haystacks,” each composed of about 100 documents containing multiple insights on various topics. For instance, one topic could be “how to manage stress,” and the associated insights with that topic could be “deep breathing,” “daily walk,” and “meditation.” They then evaluated 10 different models, comparing their performance in both long-context and RAG setups. For the long-context setup, they provided the full haystack in the prompt (~100,000 tokens), while for the RAG setup they selected relevant chunks totaling about 15,000 tokens using six different retrieval techniques (see figure below).
Key Findings
The study used two main metrics to evaluate the two methods against each other: coverage and citation accuracy.
- Coverage refers to the extent to which a model can retrieve and include relevant information from a document or dataset. A high coverage score indicates that the model can effectively capture a broad range of insights or data points.
- Citation accuracy, on the other hand, measures how accurately the model can reference or attribute specific pieces of information to their sources.
Full: no retrieval and full context provided, KW: retrieval based on keywords, Vect: retrieval based on sentence-transformers embedding, LongE: retrieval based on an advanced embedders for long-context, RR3: retrieval based on rerank 3 model of Cohere, Rand: random retrieval (lower bound), Orac: artificially boosted retrieval (unrealistic, to provide an upper bound).
Our main takeaways from the research?
- Despite RAG’s renowned prowess at retrieving relevant information from a document, long-context approaches do not necessarily dramatically downgrade coverage; certain long-context LLMs can produce comparable (or even better) coverage. For example, in this study’s controlled setting, Claude 3 Opus with full context achieved a 76.2% coverage, which is better than many other RAG-boosted models.
- The RAG approach significantly improves citation accuracy. With the exception of Gemini 1.5-pro that stands out with comparable citation scores in RAG versus long-context set-up, RAG consistently was the winning approach when it came to identifying the precise reference insights to support the LLMs response.
- Long-context suffers from position bias. In the long-context setup, the researchers evaluated the impact of positioning relevant insights in a particular place in the document (beginning, middle, end). All three of the models evaluated demonstrated position bias: GPT-4o and Claude3 Opus performed better when insights were at the end of the document, while Gemini-1.5 Pro performed better when insights were at the beginning of the document.
For now, long-context models are ideal when understanding the entirety of a document provides a clear advantage, such as in summarization tasks. By contrast, for scenarios where accurately retrieving precise details are crucial and citation quality is important, RAG is still the way to go.
Pro Tip: Mitigate Position Bias With Reordering
As mentioned earlier, position bias occurs when a model's performance depends on where information is located in the input text (e.g., a model might favor information at the beginning or end of a document). This can lead to underperformance when key details are in a position marginalized by the model or buried in middle passages, an effect colloquially known as the “lost in the middle phenomenon.”
That said, It’s important to note that although long-context might be more sensitive to position bias, this problem can also impact RAG pipelines. A technique called reordering can help mitigate this effect by reorganizing the retrieved chunks to strategically place the most relevant information at the “top and tail” of the provided context. For example, in a RAG system with reordering, the retriever might initially select multiple chunks of text, but instead of treating them equally or using them in their original order, the most relevant chunks are positioned at the beginning or end, and the least relevant chunks are positioned in the middle.
Final Takeaways
Both approaches have their place in enterprise AI applications, and the choice between them will depend on the specific use case. But you don’t need to guess! Here is a basic framework for qualifying the method to pursue:
The good news is, with secure connections to the latest long-context models via the LLM Mesh and pre-built components to accelerate RAG workflows, Dataiku empowers teams to use both approaches with ease. As LLMs continue to evolve, the debate between long-context models and RAG will likely persist. But to answer the question posed at the beginning: As of today, no! RAG is not obsolete. Despite the growth of context windows, RAG remains a vital tool in the data scientist’s arsenal and continues to be an indispensable method for certain tasks.