How to Improve RAG Results

How to Improve RAG Results

You are trying to build a smart Q&A system that can answer questions about specific documents or huge chunks of text. You want it to be as good as, or even better than, the top AI chat systems out there when it comes to giving accurate and relevant answers.

What’s the Challenge?

The tricky part is making a system that can:

  • Understand user queries accurately
  • Find relevant information in large texts quickly
  • Combine information from different document parts
  • Generate accurate and natural-sounding responses
  • Handle various question types, from fact-checks to overviews
  • Operate efficiently in terms of cost and speed

The easy solution would be to use prompting over a large context window LLM and provide the entire document content as the context.

However, this approach often isn’t practical due to cost constraints or long inference times.

3 Major Scenarios We Need to Address

To address the challenge we need to use techniques to reduce the amount of information in the context while maintaining answer quality.

To achieve this, we will address three major scenarios, each requiring different solutions:

  1. Information needed is in a single document fragment (e.g., what is X)
  2. Information is spread across multiple fragments (e.g. what are the differences between X and Y)
  3. Information requires understanding large portions or the entire document (e.g. give me an overview of X)

Scenario 1: Information in a Single Document Fragment

When the information needed is in a single fragment, we can:

  • Improve retrieval accuracy by enhancing the search query. This can be done using LLM-generated rephrasing techniques like Hypothetical Document Embeddings (HyDE).
  • Utilize search algorithms which perform better than top K vector search, like BM25 or hybrid search combining keyword and semantic search.
  • Implement late interaction models such as COLBERT for more nuanced matching between query and document fragments.
  • Rerank retrieved fragments to prioritize those most likely to contain the required information.

Another thing to consider is preprocessing documents to create more coherent fragments, either by leveraging existing document structure (e.g., Markdown or Word sections) or by approximating sections based on sentence embedding similarity.

Scenario 2: Information Spread Across Multiple Fragments

When the information is spread across multiple fragments, we can:

  • Use an LLM to split user’s question into multiple subquestions, then apply techniques from scenario 1 to each subquestion.
  • During document indexing, extract a graph of concepts. When answering, traverse this graph to identify relevant or related concepts to include in the context.

Scenario 3: Information Requires Understanding Large Portions or the Entire Document

For information requiring broader document understanding:

  • Generate and store gradual summaries of document fragments.
  • Locate relevant summaries using techniques from scenarios 1 and 2.
  • The RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) is an example of this technique.

Benefits of Applying These Techniques

Those techniques require more effort compared to naive approach of using top N chunks retrieved from vector DB to generate the answer. But the benefits are significant:

  • Improved retrieval accuracy, ensuring the most relevant information is included in the context
  • Reduced context size, leading to lower costs and faster inference times
  • Better handling of complex queries that require information from multiple parts of a document
  • Improved ability to provide high-level overviews or summaries when needed

What Next?

It’s worth noting that retrieving the correct information is only the first part of building a well performing RAG system.

Additional important steps include:

  • Validating the LLM-generated response. This can be done by checking if the response contains the extracted document fragments or by using a separate, powerful LLM or a panel of LLMs to judge the correctness of the answer.
  • Gathering user feedback and leveraging known correct and incorrect responses to continually improve the system. This feedback can be used to optimize information retrieval processes and refine answer generation.
  • Optimizing prompts used in the RAG pipeline. Solutions like DSPy or TextGrad can be employed to systematically improve prompt effectiveness based on performance data.
comments powered by Disqus

Related Posts

RAG: Build or Buy

RAG: Build or Buy

Sometimes I’m being asked whether it’s better to buy a commercial solution for RAG or build something using … (enter your preferred tech LlamaIndex, Langchain, etc.

Read More
LLMs for SaaS: Beyond the Hype

LLMs for SaaS: Beyond the Hype

Large Language Models. They are all over news and social media - do they have any practical applications?

Read More