RAG is not just text

Retrieval Augmented Generation (RAG) is a common pattern in AI-powered apps that allows a model to generate completions using context data that might not have been available during training. It's beneficial when working with large language models because it helps the model make inferences on private data such as customer data or your specific products and services[1].

A modern RAG stack[2] includes a Vector database with embeddings of chunked text data used in a retrieval system usually powered by some approximate nearest neighbor search algorithm [3]. Chunk data pulled from the retrieval process is then passed to the model as context to generate a completion.

Embedded
Document
chunks
Embedded...
Retrieval System
Retrieval System
User
query
User...
LLM
LLM
Text is not SVG - cannot display

I think RAG has gained a lot of popularity because of its simplicity and flexibility. It's not the most difficult system to get started with and when implemented correctly the results can be quite impressive. However, the overwhelming majority of content I've seen around RAG has focused on just text generation so in this article I want to make the case that Retrieval Augmentation can be applied to a lot more than LLMs.

When broken down into its core elements, RAG has just 3 components.

  1. A query - This is usually the user's input but can be manipulated to improve the efficacy of the system. For example, a user may provide a document or image which is converted into text before being passed to the system.
  2. A retrieval system - This is the process of finding context data to be passed to the rest of the system. This doesn't have to be a vector database and also does not have to be just text. Your RAG application could perform an internet search or traditional keyword search on a data store you already own to find relevant context data.
  3. A model that generates output - This model should the provided context data and user query to generate the desired output. I argue that the exact mechanism and format of the output do not matter. The output may or may not have additional post-processing applied to it before its intended use.

Looking at RAG from this vantage point, we can imagine an image-based system where a user provides a prompt that searches for images in a database for use in a stable diffusion pipeline. The output of the pipeline could be a new image that's a new image generated from the user's prompt with the context data of a conditioning system like ControlNet[4].

Image DB
Image DB
Search Engine
Search Engine
ControlNet
ControlNet
Stable Diffusion
Stable Diffusion
User
query
User...
Text is not SVG - cannot display

While this is just a single example, I found a few papers that described similar processes while doing research for this blog post.

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models[5] describes a very familiar setup.

In RDMs, a set of nearest neigh-bors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to “prompt” a general trained model after training and thereby specify a particular visualstyle.

The authors demonstrate the efficacy of an external database which they use to condition the model's output at inference time which I think satisfies the "Retrieval Augmentation" portion of RAG.

Re-Imagen: Retrieval-Augmented Text-to-Image Generator[6], describes a similar process.

... we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities’ visual appearances.

Conclusion

TL;DR: I think RAG is just any retrieval system designed to augment a model's ability to generate an output, constraining the definition of a proper RAG system to just vector DB and chunked text limits its potential significantly. Hopefully, I made a good enough case that the next time someone tells you they built a RAG system you feel inclined to ask "What kind?"[7].

👋
Thanks for reading. If you like my work, follow me on Mastodon. I write about stuff I work on, AI, and other random dev stuff.

  1. (no date) help.openai.com. Available at: https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts (Accessed: 2024-9-18). ↩︎

  2. Mujtaba, H. (2024) Understanding the RAG architecture model: A deep dive into modern AI. medium.com. Available at: https://medium.com/@hamipirzada/understanding-the-rag-architecture-model-a-deep-dive-into-modern-ai-c81208afa391 (Accessed: 2024-9-28). ↩︎

  3. (2024) Understanding the approximate nearest neighbor (ANN) algorithm. www.elastic.co. Available at: https://www.elastic.co/blog/understanding-ann (Accessed: 2024-9-28). ↩︎

  4. (no date) Text-to-image generation with ControlNet conditioning. huggingface.co. Available at: https://huggingface.co/docs/diffusers/v0.15.0/en/api/pipelines/stable_diffusion/controlnet (Accessed: 2024-9-28). ↩︎

  5. Robin, R. et al. (2022) Text-guided synthesis of artistic images with retrieval-augmented diffusion models. Available at: https://arxiv.org/pdf/2207.13038.pdf ↩︎

  6. Wenhu, C. et al. (2022) Re-Imagen: Retrieval-Augmented Text-to-Image Generator. Available at: https://arxiv.org/pdf/2209.14491.pdf ↩︎

  7. Thanks to my team for indulging by debates and ultimately inspiring this blog post. ↩︎

Subscribe to Another Dev's Two Cents

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe