Let's Be Honest About RAG
Sep 9, 2022
Large language models (LLMs) have emerged from the pages of CS papers to become the center of gravity in discussions about the future of business enterprise. This was not inevitable — users have designed an ecosystem of tools to augment LLM capabilities through the strategic modifications of prompts. This blog focuses on one of these approaches: Retrieval Augmented Generation (RAG). While fine-tuning techniques can become costly with the burdens of data upkeep, RAG reduces the cost of customization by remaining agnostic to the underlying model.
Nevertheless, RAG is not a silver bullet: it contains many dynamic constituent parts and processes, making implementation and scalability beyond proof of concept development a difficult task for organizations. Thus, implementing customized retrieval methods for a particular use case and fine-tuning embedding models require technical ML expertise, and man-hours that are simply not feasible for many users.
Why you need more than just an LLM
While LLMs superficially seem like autonomous and powerful agents, they are surprisingly limited in the range of tasks that they can robustly complete. While LLMs consistently outperform humans on tasks that do not require information retrieval, such as logical reasoning tests, LLMs—without human support— fail when faced with tasks that are not captured in their training data. For instance, LLMs cannot complete any task that requires precise recall of events after 2021. RAG and other methods solve this problem by efficiently providing relevant contextual updated information to an LLM when making queries. Just as humans use tools as crutches to grapple with our limited memories, say grocery lists and search engines, RAG enables LLMs to access the right information at the right time in the right contexts. Currently, RAG enables LLMs to interact with proprietary data and customize a product for diverse use cases. Examples abound in cases where a model might need new data constantly in a structured fashion, such as search engines over internal docs (Confluence, Slack, Notion), customer support bots explaining complex insurance conditions, and drafting follow-up emails on meeting context.
However, one might opt to fine-tune their LLM instead. Fine-tuning involves gathering a data set of prompt completion pairs that mimic the type of tasks your LLM would do when deployed into production. For example, if you want to create an LLM that is especially good at summarization, you would provide the LLM with examples of long-form text and their respective summaries. After being exposed to so many examples of this specific, the LLM will be substantially better at summarizing the kinds of texts it has seen, but we cannot be certain that its summarization skills in other contexts increase in tandem. Fine-tuning can also be used to encode and improve memory, in much the same way fine-tuning is used to improve a particular skill.
However, this is computationally expensive and the process must be repeated on a regular basis. Furthermore, fine-tuning presents enterprises with significant security risks as fine-tuned LLMs may regurgitate sensitive user information and trade secrets that may be contained in the data that it was trained on. This means that one must maintain separate LLMs for every different permission level of information, further increasing cost and the need for oversight. On the other hand, creating ACLs (Access Control Lists) and granular permissions is a lot easier at the data level - through RAG, tags when they are inserted into the vector database, rather than at the LLM level.
Furthermore, fine-tuning for proprietary data can make the LLM both the reasoning engine and information retrieval mechanism. By separating these tasks we can better debug and explain LLM responses by having a clear look into what information is being retrieved.
When using LLMs with proprietary data, RAG’s adaptability, efficiency in both data and cost and security make it the optimal approach (even when fine-tuning) to make LLMs useful to your customers.
Overview of RAG Pipeline
Now, we will lay out the parts of an RAG pipeline that allows an LLM to retrieve information from a corpus that is continually updated.
This is where the data currently lives in an organization. This may be centralized in a singular location such as a data lake, but the data can also be stored in a collection of distributed sources such as GDrive, Sharepoint, and Confluence.
The text in each of these documents must be extracted in order to be processed. For some kinds of data, such as a plain texted Google doc, this is very straightforward since the Google Drive API will give you a string. For more complicated documents such as an image or a PDF containing a table, Optical Character Recognition (OCR) might be needed.
Next, the information taken from the data source is cut into smaller pieces or “chunks”. There are two broad considerations we must keep in mind while chunking, stemming from the fact that information might be lost when embedding these chunks. The first consideration is to preserve syntax, ensuring that the grammatical structure of the text is maintained even after chunking. This is essential for retaining contextual meaning during subsequent processing. The second consideration is preserving semantics, or maintaining the inherent meaning of the text while it's divided into chunks. Semantics captures the deeper understanding and interpretation of the text. Chunking done right keeps coherent, meaningful units of information together, even if they are separated into different chunks, by overlapping related chunks.
One of the most popular general chunking methods is recursive chunking. It works by iteratively breaking down a text into smaller units based on linguistic cues. A sentence can be understood at various levels of granularity: at the level of clauses, phrases, and individual words. This hierarchical approach helps maintain the structural integrity of the text while allowing for various levels of detail in the chunks, preserving the tradeoff. Pinecone has a brief explanation of chunking strategies here.
Now these chunks of text are converted into a vector that represents the semantic meaning of the text. Most people currently use Open AI’s Ada model. From our experience with tasks that are general, Ada does a good job, but using an open-source model for more specialized tasks can lead to improvements. Hugging Face has a leaderboard ranking embedding models here.
Thus, large, dynamic organizations require embedding models that are tailored to their text data and their context. For example, firms operating in spaces with specialized jargon, such as the medical or legal field, require a fine-tuned embedding model as the meaning of words depends on context. For example, the word “consideration” generally means careful thought but in the context of contract law, it refers to something of value. Consequently more deliberately choosing and optimizing your embedding model will improve RAG.
These embedded vectors must be stored in a database so that they can be quickly retrieved. Popular vector databases include Pinecone, Weaviate, Chroma DB, and Amazon Open Search. Many of these use FAISS, an algorithm built by Meta to run fast approximate nearest neighbor algorithms to make retrieval easier. Furthermore, adding as much metadata as possible can help with more advanced retrieval techniques.
Retrieval for a LLM
Just like the data sources, each query is converted to a vector using the same model, allowing us to query the closest neighbors efficiently in the vector database. In other words, we identify and retrieve the vector that is closest in meaning to the query. Then we look at the corresponding chunk to the vector and use this in the prompt for the LLM providing the right context for each prompt.
Managing this Pipeline
Once the vector database is set up, you have only a few lines of code to unleash an LLM’s productive potential for your business. For example, you can set up a chatbot to chat with your documents, draft emails, and automate memo creation. However, this may only prove useful for Twitter demos, as there are several factors that can complicate the real-world deployment of RAG-based pipelines.
First, handling file changes and efficiently managing the update of vectors in the vector database is challenging to do efficiently. You would need to quickly determine which text chunks have changed. One solution is to store a simple hash of every chunk in a document and compare the hashes to identify what has been changed
Another challenge is maintaining the reliability and robustness of a system that crawls large file systems, as maintaining knowledge of every file across different sources is an information-heavy and complex task. One needs to be able to quickly adjust when the crawler fails and maintaining 24/7 uptime in rapidly changing situations is difficult.
Generally, this would not present a challenge as Extract Transform Load (ETL) frameworks can help facilitate data movement and manage data upkeep. However, ETL platforms are not currently sufficient to implement RAG. This is because they can’t manage unstructured data or the transformation to a vector, although they can help with the extraction of this data.
We have spoken to many companies that are implementing a pipeline themselves, and who have seen initial success using RAG. However, these early gains do not last. With thousands of data sources to manage, a firm’s limited number of software engineers can be quickly overwhelmed.
Our last concern is more a commentary on the Zeitgeist surrounding AI, rather than a technical problem. Since RAG architecture is relatively new, there are no established best practices, and companies may be overwhelmed by the myriad of new methods of retrieval and embedding that come out every day. As a result, organizations that are peripherally interested in commercial AI tools, but do not possess native knowledge of the technology, are disincentivized from exploring personalized LLMs.