Beyond Text: Taking Advantage of Rich Information Sources With Multimodal RAG

Use Cases & Projects, Dataiku Product Vivien Tran-Thien

Retrieval augmented generation (RAG) has become a very popular approach for creating question-answering systems based on specific document collections. RAG consists of including relevant pieces of information in the prompt of large language models (LLMs) to compensate for their knowledge gaps.

However, standard RAG pipelines only handle text. This is a significant limitation because most of the important information sources entail multiple modalities. For example, documents often include charts, diagrams, photographs, or tables, while instant messages may contain voice messages, pictures, and videos.

This blog post shows how to leverage non-textual content. Specifically, it will:

  • Describe a straightforward multimodal RAG pipeline;
  • Present several variants of this pipeline;
  • Highlight practical challenges and their potential remedies;
  • Introduce a publicly available and reusable Dataiku project that illustrates many of the techniques mentioned here.

A Straightforward Multimodal RAG Pipeline

Let’s assume that the goal is to answer questions based on documents (e.g., PDF documents) containing both text and images. We begin with the simple multimodal RAG pipeline depicted in Figure 1.

Figure 1: a straightforward multimodal RAG pipeline
Figure 1: A straightforward multimodal RAG pipeline

The pipeline is composed of the following steps:

  1. Content extraction: All input documents are broken down into a list of text chunks and a list of images.
  2. Embedding: The text chunks and images are mapped into a shared vector space using a multimodal embedding model (e.g., a joint text-image model like CLIP or SIGLIP). Such a model maps representations of a given entity in various modalities to similar vectors. The resulting vectors are then added to a vector store.
  3. Semantic retrieval: When the user asks a question, this question is converted into a vector by the multimodal embedding model. This vector is used to query the vector store and retrieve the text chunks and images that are most semantically similar to the question.
  4. Answer generation: The question, the retrieved text chunks, and the retrieved images are included in a prompt sent to a multimodal LLM, which generates the answer. A multimodal LLM (e.g., GPT-4V, GPT-4o, Gemini 1.5) is a model that accepts texts and images as inputs and generates some text.

Steps 1 and 2 are executed only once for a given collection of documents, while steps 3 and 4 are repeated for each question of the user.

Potential Design Alternatives

Figure 2: possible variants for a multimodal RAG pipeline
Figure 2: Possible variants for a multimodal RAG pipeline

Figure 2 shows some variants of the previous pipeline. Steps 1, 2, 3, and 4 have already been described above and we now discuss Steps 2a, 3a, 4a, 2b, and 4b as potential alternatives.

Instead of embedding all elements into a shared vector space with a multimodal embedding model (i.e., Step 2), an alternative could be to use multiple embedding models (Step 2a). For example, we may use a CLIP model for texts and images and a CLAP model for audios. At retrieval time, we embed the user question with each of the embedding models (this requires all of them to cover the text modality) and use the corresponding vectors to retrieve a list of similar elements for each vector space. We then need to merge these lists to determine the elements to include in the prompt (Step 3a). Merging these lists may involve either concatenating them or using a more sophisticated approach to rerank the combined list based on relevance scores or other criteria.

Alternatively, we can obtain a text representation of all modalities, enabling the use of a simple text embedding model (Step 2b). For example, we can automatically caption an image with an image captioning model or simply extract the text it contains with an Optical Character Recognition (OCR) software. This is particularly relevant if most or all of the information contained in the non-text elements can be adequately summarized with text. At this stage, we do not need to obtain a comprehensive representation of the input elements but just one representation explicit enough to determine whether these elements are relevant for a certain question.

When it comes to generating the answer, a multimodal LLM (Step 4) may be replaced with:

  • A regular LLM (Step 4a), if all non-text elements along with the user question have been processed by a question answering model (e.g., a visual question answering model if these elements are images);
  • A regular LLM (Step 4b), if all elements have previously been converted to text during the embedding step (Step 2b).

This is in particular necessary if the multimodal LLM does not cover all target modalities.

Please note that the choice of the variants can be different from one modality to another. For instance, at the embedding step, we can use SIGLIP as a text-image embedding model for both the text and image modalities while using a captioning model for an audio modality.

Practical Challenges With Real-Life Documents

Even if the underlying principles are quite simple, implementing an effective multimodal RAG pipeline can be difficult. Let’s review three practical challenges and their potential remedies when the source documents are PDF documents with both texts and images.

First, not all images are useful to answer plausible questions. To prevent false positives during retrieval, the initial step could involve discarding certain images based on their dimensions, as very small images, like logos, might lack informative content. Additionally, we can remove images on the basis of their type (which can be identified through zero-shot classification) or because they do not contain any text. Indeed, photographs and pictures without texts may be included in the documents solely for aesthetic purposes.

Another potential problem is that images can be inadequately cropped. For example, with unstructured, an open source Python library to process documents, tables are extracted thanks to a table detection algorithm which can occasionally return erroneous bounding boxes. If this occurs, a potential solution is to change or fine-tune the underlying layout detection model.

Additionally, the content extraction step, for instance with the unstructured library, returns a list of elements (images or text boxes) along with their corresponding bounding boxes, which represent their location on the page. Unfortunately enough, the caption of an image may be represented by a text box that is disjoint from the image as illustrated in Figure 3(a). In this case, this caption will not be taken into account during the embedding step if we follow one of the approaches mentioned in the previous section. Since captions are often essential to properly interpret an image, this means that a significant part of the information of the image is lost.

Figure 3. Left: Atext box that corresponds to the caption of an image may not be identified as such. Middle: One way to rediscover the implicit link between the image and the caption is to use a heuristic rule based on the coordinates of the bounding boxes of the images and text boxes. Right: An alternative is to send the image and its surroundings to a multimodal LLM and prompt it to generate a caption.
Figure 3. Left: A text box that corresponds to the caption of an image may not be identified as such. Middle: One way to rediscover the implicit link between the image and the caption is to use a heuristic rule based on the coordinates of the bounding boxes of the images and text boxes. Right: An alternative is to send the image and its surroundings to a multimodal LLM and prompt it to generate a caption.

To address this, we can attempt to rediscover the implicit link between the image and its caption. We can use a heuristic rule to relate text boxes and images on the basis of their respective bounding boxes. For instance, a text box can be interpreted to be the caption of an image if its x-coordinates are between the x-coordinates of the image and its y-coordinates are close enough to the y-coordinates of the image, as shown with Figure 3(b). Alternatively, as in Figure 3(c), we can automatically:

  • Capture the image and its surroundings in a new picture;
  • Outline the image in red;
  • Send the entire picture to a multimodal LLM;
  • Prompt the LLM to generate a caption for the highlighted image.

In this case, the multimodal LLM will hopefully take advantage of the existing caption to describe the image.

Getting Started With Dataiku

If you are a Dataiku user, beginning with a sample project from the Dataiku project gallery is a simple way to get started. This project demonstrates how to:

  • Extract images and text chunks from PDF documents with unstructured;
  • Embed these elements with a multimodal embedding model (CLIP);
  • Caption the images and embed these captions and the text chunks with a text embedding model;
  • Filter out images that are too small or correspond to photographs;
  • Handle the case of images that have been rotated;
  • Retrieve the relevant images and text chunks given a user question and generate the answer with a multimodal LLM (with GPT-4o or IDEFICS2);
  • Incorporate the question answering pipeline in a web application.
Screenshot of the web application included in the Dataiku example project
Screenshot of the web application included in the Dataiku example project

Conclusion

Multimodal RAG is a promising way to better take advantage of rich information sources and create effective question answering systems. Such an approach is more and more accessible thanks to the availability of powerful multimodal LLMs and multimodal embedding models.

However, implementing an effective multimodal RAG pipeline can be challenging, as it may require extensive data preparation steps specifically tailored to the format of the input documents. Considering the complexity of such a pipeline and the relatively recent emergence of multimodal models, it’s reasonable to expect a fair share of inconclusive answers, even when the relevant information is present in the input documents.

Thank you to Camille Cochener and François Phe who helped build the Dataiku example project.

You May Also Like

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

Looking Ahead: AI Hurdles IT Leaders Need to Overcome in 2025

Read More