Matt Willsmore
In this post, we’ll explore five common challenges when implementing RAG (Retrieval Augmented Generation), and some possible solutions we are seeing out in the field as this new way of discovering knowledge evolves.
Here at Pureinsights most of us have a deep and long heritage in search and have spent our careers designing and implementing search systems of all shapes and sizes. This is what we’re known for and why our customers seek out our expertise, but recently we’re finding a new focus for our customers, and that’s people who are needing help not with their search system (well maybe a bit of help), but more specifically with their efforts to implement Retrieval Augmented Generation (RAG) systems end to end.
Before we get started, if you are new to this subject, you may want to start by reading another blog post What is Retrieval Augmented Generation? This will provide you with some background for this more detailed blog on the common challenges of implementing RAG. Otherwise, let’s get started.
1. Don’t underestimate the Retrieval bit in RAG.
There’s sometimes a tendency for a bit of magical thinking when AI is involved. If we just use an LLM it will just know what we want, right….? Wrong. All the same information retrieval issues are still there. Vector search is just another way of doing a search based on similarity, and similarity is not always the best way to rank results. What about document authority, freshness, specific important keywords in the text? They are all still really important too. When you ask about coffee production in Kenya during the first Quarter of the year and the similarity model brings you back answers from the Ethiopian first Quarter from two years ago, it can get a bit annoying. Especially when it turns out that the figures are from the interim report and not the final one where they had been updated. But for the AI model perhaps African coffee producing countries have a very similar vector representation and the Ethiopian one was just a bit more ‘coffee-ish’ than the Kenyan one, and in any case how is the model supposed to know which year you’re talking about, not to mention which version of the report is more authoritative?
Vector search does a great job, but nearly everyone we speak to has encountered these sorts of issues implementing RAG. So here are a few things you can do about it:
- Prompt engineering. Well yes – this is the obvious one. Crafting and refining the prompt to set your base instructions and handle exceptions will get you a long way, but that can also have its own set or problems. Changing a prompt to deal with a problem case can make it perform less well with previously good cases. So, then you need some reference queries that you can check, to be confident you haven’t broken everything, and then maybe you want to automate that too, a bit like unit tests. Which brings me on to…
- How do you know if things are getting better if you don’t measure? There are a number of methodologies to evaluate the retrieval part, including our Relevance scoring methodology as well as new methodologies/frameworks to evaluate RAG. Applying these and other tools to RAG is essential but is too big a subject to cover here and so we’ll make it the subject of an upcoming blog post.
- Hybrid search. Nobody said we had to *just* use vector search. So, let’s put good old BM25/TF-IDF to work as well. In fact, remember all that work we did tuning keyword relevance and boosting PDFs and prioritizing the latest news articles on our search system 5 years ago? That’s pretty darned useful still! Some search systems allow you to combine a mix of vector and keyword searches in one go. With others you need to do it outside of the search system and merge them yourself, but in either case, you are likely to get a better overall retrieval experience.
- Limit the data. One approach to the risk of getting stale or outdated answers, or answering from less authoritative sources is to simply only search the latest and best versions of all your data. Of course, this is a limitation, but it can also be a good starting point.
- Implementing a feedback mechanism so that users can report poor responses, and then creating a mechanism to manually or automatically address those issues is absolutely essential. Just accept that it’s not going to be perfect on day one, and that gathering the required information to improve it later addresses the matter head on and creates a route to improvement later.
- This has been a technique in search for decades. Also known as Learning to Rank it uses traditional keyword search to do an initial ranking, and then re-ranks the top N results using a model that has typically been trained on known ‘good’ query and result pairs or sets. In this case Vector search generates the initial ranking, and then then a bi-directional or cross encoder re-ranks to get a small set to send to the Generative AI.
- Security Filtering. Obviously, we don’t want our shiny new RAG system blurting out secrets to people who shouldn’t know them, so we’ll need to make sure the content is appropriately security trimmed. Luckily this is a known and solved problem in search engines so we just need to use the question askers identity to trim the data in the same way as regular search.
- Context Filtering. Similar to security filtering, if we know a person’s identity, then perhaps we can use their role to make an educated guess at what they may be interested in? Answering a question to an R&D person from information held in the Marketing domain ahead of information in the R&D domain may not be optimal. We can also filter out sections of the content based on what works best.
2. One prompt to rule them all
We spoke earlier about prompt engineering (everyone’s new favourite type of ‘engineering’ 😊), and how a carefully crafted prompt which handles exceptions, priorities and expected outputs can be tuned to handle unexpected responses. However, there may come a point where engineering a prompt one way to answer a certain class of questions means that a second or third class of questions are not answered well. For example, if someone asks “Do you have any accounting jobs in London?” it may be better to apply a contentType=”Jobs” filter, or at least a boost for this type of content type. Alternatively, someone may ask “Which African country produces the most coffee?” which might need to be broken down into sub-queries to get coffee production for each country, and then have the system evaluate/combine the results.
At that point you need more than one prompt for your RAG system. And then you’d have to have an initial step which selects the most appropriate prompt or prompts for you, so then you have multi – step queries, and quickly you’re into intelligent agent territory.
And that’s fine. To scale and do a good job, in the way that we would like, maybe we should acknowledge that and use a framework to orchestrate everything.
3. In your own time GPT…
Whilst the R in RAG is generally very fast (if you’re doing it right) because search engines are designed to work fast at high scale, the G bit on the other hand is generally slower. And that’s just how things are. The end.
No, but seriously, there’s not a huge amount to be done, and that’s an important consideration, especially when considering the multi-step queries, we were just talking about. Some models are faster than others though. For example, Mistral report their model performing 6x faster than Llama2[1] (they quote 30 tokens per second on an M1 GPU)[2]. The latest GPT 4 has been benchmarked at around 50 tokens per second[3]. But even 50 tokens a second means a few seconds in many cases. So don’t bother trying to get it super-fast, or to meet the SLAs you have in place for regular search. It’s not the same and probably won’t be for the foreseeable future. What you can do is use an AI model or service that implements streaming which allows you to get back the response in batches, a few tokens at a time and then progressively output the response just like a chatbot would[4]. You’ll find your users are much happier to wait for a response so long as they can see it is visibly being constructed in front of their eyes.
4. How chunky should a chunk be?
Unless your content is all in bitesize pieces, you’re going to need to implement a chunking strategy so that you can retrieve the right bits for the Generative AI to answer from. In many cases this equates to a paragraph, but it will vary from client to client. The key element is that it needs to be a semantic unit of text that will provide a full answer to a question. Another consideration is whether to add additional metadata such as title, year, author etc to the chunk either as an extra verbalization or as metadata fields. Playing around with chunk sizes is probably the best option for now, as the industry starts to dream up new ways of identifying useful semantic cut points via embedding time series or some other method. Don’t fall into the trap of thinking bigger chunks are better though – it’s better to try to align with the natural structure of the content. Additionally, higher k values don’t necessarily improve matters: generative models tend to favour the start and end of the context window, meaning answers can get lost in the middle, so smaller contexts can be better (if your retrieval precision is good enough!). Intuitively we might expect a larger context to give the generative AI more to work with, but not only is a smaller context window cheaper and quicker, it also reduces the chances of the model getting distracted or focusing on the wrong thing.
It’s worth also considering which model your embeddings are coming from. Models will differ in capability and will be better/worse at certain domains of questions. Fine tuning may also be an option, possibly using synthetic data also generated by an LLM.
Finally, we often recommend that the chunks overlap by a sentence or so as the link between text chunks can often provide additional context. This provides a good balance between contextualization and succinctness.
5. Context is everything
Talking of context, a key pattern we are seeing emerge is that there is a need to provide a level of structure around the chunks to group them together, especially to operate at scale. This leverages the inherent structure and concepts in the source data to focus on which chunks to search. There are two ways of doing this:
- Document hierarchies – the structure of the hierarchy will vary but at the very least chunks are part of a document and might need to be considered together. It’s easy to envisage more hierarchy layers whereby, for example, documents are part of a project and projects are within a specific industry. So, we’d end up with a hierarchy of Industry > Project > Document > Chunk. By using the multi hop approach we described earlier, it’s possible to narrow the search space at each step in the hierarchy to ensure that by the time we search the Chunks, the search space has been reduced.
- Knowledge graphs – this is just an extension of the document hierarchy described above, but instead of organising the data into a hierarchy to try to narrow the scope, knowledge graphs organise the entities and relationships into a directional graph. By identifying the key entities are relationships required to answer a question, it’s possible again to reduce the search space for chunks by pre-filtering using the Knowledge Graph representation. This can be done by filtering the search space as well as enriching the prompts with additional domain information, so the knowledge graph becomes another retrieval source to help answer questions.
Obviously these last two techniques are harder and more intensive to get up and running so we would recommend starting easier and making sure you can measure the value of your experiments.
Wrapping up on the challenges of implementing RAG
So, there we are, then, a dispatch from the frontiers of AI enhanced search systems. Hopefully these pointers and thoughts are useful to help you tackle common challenges when implementing RAG. This is a fast-moving area of focus in the industry right now, and we expect new techniques to evolve rapidly. And if all of this sounds complicated, don’t worry, we’re here to help. With decades of experience as consultants in search and our Pureinsights Discovery Platform™, which takes advantage of the latest advances in search, including RAG and Knowledge Graphs, Pureinsights is perfectly positioned to support you.
So, as always, feel free to CONTACT US with any comments or questions or to request a complimentary consultation to discuss your ongoing search and AI projects.
Cheers,
– Matt
Footnotes
[1] https://arstechnica.com/information-technology/2023/12/new-french-ai-model-makes-waves-by-matching-gpt-3-5-on-benchmarks/
[2] https://mistral.ai/news/mixtral-of-experts/
[3] https://www.taivo.ai/__a-wild-speed-up-from-openai-dev-day/
[4] https://cookbook.openai.com/examples/how_to_stream_completions