/

How to Optimize Your RAG System

How to Optimize Your RAG System

How to make RAG better

A Guide to Measuring and Improving Retrieval-Augmented Generation

Picture this:

You’re a product manager at a publishing company. It’s late 2022 and suddenly everyone starts flipping out about ChatGPT. The media keep going on about how AI is about to revolutionise everything, and a month or so later there are rumblings at The Board level. They want to know “What we are doing about this AI stuff with all our precious company data?” Eventually the question is passed down the management chain and lands, plop, into your lap. What are we doing, you think? After some research and head-scratching, it becomes clear there is a new paradigm emerging whereby you can use a search system to assemble some data, and then present it to the AI to formulate an answer. Magic!

You present this idea to your management team, explaining it’s very much an emerging approach called (rather unfortunately) RAG. ‘Make it so!’ comes The Board response, worried about competitive advantage (and Klingons). And so off you go. Skunking about in your skunkworks project.

Fast forward 9 months and it’s clear that all the magic you expected to happen isn’t quite so magic. The R bit in RAG stands for Retrieval. Information retrieval has always been hard, and ChatGPT-like models provide some new tools to do it, but don’t magically solve the problem. And if the R bit in RAG is not right, then the G bit in RAG (Generation) will be based on the wrong things. And then the A bit (Augmented) isn’t really that much of an Augmentation. Retrieval Augmented Generation (RAG) is harder and takes much more to get right than it seemed, back when we all thought it was magic.

Meanwhile The Board are (g)rumbling again…

Board Room discussion AI and RAG

If this rings any bells, and it’s making you start to sweat, then you’ve come to the right place.

But let’s be clear, at Pureinsights we don’t have any magic to offer either. We don’t have an ‘AI whisperer’ up our sleeves or some advanced technology from the future (although we do have our Discovery platform which helps a lot).  What we do have is many decades as dedicated AI and search experts, working with hundreds of companies large and small, helping them solve these sorts of problems. We know the tricks, what works and what doesn’t, and we have the credibility to back you up when you have to tell the Board that it’s harder to get right than you (and they) first thought.

The quote “If you can’t measure it, you can’t improve it” has been attributed to various people (probably because it’s actually fairly self-evident – The Editor). And whilst it’s easy enough to know how to do that in a simple domain like ‘How quickly can I solve a Rubiks cube?,’ ‘How well is my RAG system working’ is a good deal more complex. There’s no one size answer to improving RAG, so measuring and putting in place a way to evaluate is a good place to start.

Let’s start with what to measure, and then we’ll talk about some tools to help you do that.

What to measure?

Right, well we can split the metrics up into some broad categories:

  • Retrieval metrics
  • Generation metrics
  • Summarization metrics
  • Holistic metrics

Retrieval Metrics

In RAG, the answer or summary is generated from a set of search results, often using what’s called a vector search which uses the vector (aka embeddings), that are umm…. embedded in the Large Language Model (LMM).  This is just another type of search though, albeit it based on semantic similarity rather than keywords, so all of the normal retrieval metrics apply.  Because RAG provides the context to an AI and asks it to use that as the basis for its answer, I will call them contexts, but they are just search results really.  The most common metrics are:

  • Precision – Measures the fraction of contexts retrieved that would be considered relevant/good
  • Recall – Measures what proportion of the available relevant/good contexts are retrieved
  • F score – a score derived from combining Precision and Recall
  • Normalized Discounted Cumulative Gain (NDCG)Evaluates the quality of the ranked contexts by considering the position of relevant documents and comparing to a manually curated target/optimum.
  • Mean Reciprocal Rank (MRR)Mean rank of the first relevant document in the retrieval list.

All of these measures require some up front knowledge of what a ‘good’ response is in order calculate.  This means someone has to work that out by knowing the dataset and developing a ‘ground truth’ dataset to measure against.  Sometimes we can have an LLM help with that, but human curated is much better.  It doesn’t need to be a massive set though.

Generation Metrics

The generation part typically compares the generated text / answer with either some ground truth answer / texts or by asking an LLM to evaluate the answer.  This set of metrics is more novel for those of us who been working in this area for a while, but basically boils down to various ways of assessing how ‘correct’ the generated answer is.

Faithfulness/Accuracy/Hallucinations – This is the main one and essentially this is whether the answer is correct/true or not.  LLMs can struggle with numbers and dates and invent things so this is a measure of how factually accurate a response is compared to ground truth.

Entity Recall – How many of the entities mentioned in the ground truth response appear in the generated response. In general, a measure of completeness is useful, especially if we are doing summarization (see next section).

Similarity – how similar are the ground truth and generated text based on n-gram, tokens, synonyms, stemming etc.  There are a few standard evaluations for this including BLEU, ROUGE & METEOR

Generation objectives – depending on use case there may be some other considerations such as safety, conciseness & absence of bias to measure.  Working in a regulated industry will have very different constraints to a blog, or even some marketing material.

Knowledge Retention – some tools allow a series of questions to be asked, and then a check back to an earlier one to test retention.  This is useful in conversational interfaces.

Summarization Metrics

As well as answering questions, summarization is a common use case for RAG-based systems, so as well as the Generation-based metrics (which also apply) we can add:

Content Overlap – Measures how much of the key information from the source text is preserved in the summary.

Compression Ratio – Ratio of the length of the original text to the length of the summary.

Coverage – Measures the proportion of important content in the source text that is included in the summary.

Holistic Metrics

Finally, we can add some fuzzier or end-to-end metrics such as:

Human Evaluation – Involves human judges assessing the quality of retrieval, generation, and summarization.  Factors include relevance, coherence, fluency, and informativeness. A feedback mechanism is advisable in most systems.

End-to-End Task Success – Evaluates how well the entire RAG system performs in accomplishing a specific task.  Metrics could include task completion rate, accuracy, and time taken to complete the task using the RAG tool.  Whilst everyone can see this kind of interface is cool and fun, demonstrating that it improves efficiency is very important,

User Satisfaction – Measures end-user satisfaction with the overall system performance.  Stickiness, numbers of questions or return/bounce rate may be good proxy indicators.

Latency – Measures the time taken for the system to retrieve, generate, and summarize responses.  This can be a little slow currently (but not prohibitively so), however faster is always better.

We wouldn’t necessarily recommend capturing all of these as some will be more important than others for a given scenario, but it will be important to have at least one or two in each category.

Tools and frameworks

All this measuring business sounds like a palaver, but luckily there are a few tools out there to help. Typically, these are written in Python and squarely aimed at data scientists or engineers who need to be able to easily implement some or all of the metric listed above. They function a little like unit tests in software development where the inputs and target outputs are defined, either by a human or sometimes with synthetically generated data. They then allow iterative evaluations. Unlike unit tests which are pass/fail, sometimes these provide metrics which are expressed as a floating point number or percentage. They often use a language model to help with the evaluation.

DeepEval – This tool is very comprehensive, combining RAGAs and G-Eval with many other metrics and features. It includes a user feedback interface and robust dataset management. Integrations with Langchain and Llamaindex add to its versatility. DeepEval offers a Free / open-source version and also offers a hosted cloud version for easier deployment and scaling which includes real time monitoring

RAGAs – Aimed at developers, and relatively popular, RAGAs has a focused set of metrics for continual learning in RAG applications. It is less flexible than DeepEval but offers straightforward implementation. The tool helps developers maintain and improve RAG system performance with minimal complexity.

UpTrain – is another open-source platform to help evaluate and improve your LLM applications. It provides scores for 20+ pre-configured evals (and has 40+ operators to help create custom ones), perform root cause analysis on failure cases and give insights including a dashboard for experiments.

TruLens – Similar in capability to RAGAs, TruLens uses feedback from ground truth, humans, or an LLM, or a mix of these, to guide evaluation. This flexibility allows it to be tailored to different needs and scenarios, making it useful for continuous model improvement.

Tonic Validate – An open-source tool from a commercial company, Tonic Validate offers a good range of metrics and a clean user interface. The tool is accessible and easy to navigate, making it practical for a wide range of users.

MLFlow – A comprehensive MLOps platform where RAG evaluation is just one part. It might be too much for those only needing RAG-specific features but ideal for those wanting a full MLOps solution. MLFlow relies mainly on LLMs for RAG evaluation and fits well into broader machine learning workflows.

Open AI Evals – we should also mention OpenAI has its own Evaluation framework but that is not targeting RAG specifically and so whilst it has a library of evaluation functions they are not as high level and targeted as elsewhere.  Again, if you have reason to need to think beyond RAG, this might be worth consideration.

Note: this is by no means a comprehensive list.  There are also another class of tools that are more about monitoring real-time workloads in production and providing checks as part of a CI/CD pipeline.

Conclusion: How to Optimize Your RAG System

So, wrapping up, if you want to improve RAG, you first step is implementing a way to gather metrics, measure and evaluate the impact of changes. Quite often a change to address an issue in one area will have a knock on elsewhere so an array of pre-canned evaluations is essential. Then, once you have a way to understand the impact of changes, you can start to make them. So, we would recommend first reviewing the metrics, and deciding which are most important for what you want to achieve, and then take a look at the tools and which cover what you need (they don’t all cover all of the metrics I mentioned) and go from there.

Some of the typical challenges are discussed in a previous blog post (see Related Resources).

And if all of this sounds complicated, don’t worry, we’re here to help. If you have any questions, please CONTACT US or drop me a note at info@pureinsights.com.  

Cheers,

Matt

Related Resources

Twitter
LinkedIn

Stay up to date with our latest insights!