Authors
Andrew Marble, Head of AI Risk and Assessments at Armilla AI & Philip Dawson, Global Head of AI Policy at Armilla AI.
Abstract
As uses of LLMs coalesce around specific applications, it remains tricky to know where to start given the range of LLM providers, base models and the guidelines they each offer. Model benchmarks or leaderboards fail to consider the differences in guidance on how models are best deployed as part of a system. Model providers often publish worked examples of different use cases, and these can be the first stop for guidance when comparing and building an AI system. To help provide a baseline for RAG system performance and security, based on model providers’ guidance, we compared five tutorials published by Meta, Anthropic, Cohere, Mistral AI, and OpenAI covering the RAG use case and added the guardrails recommended by each provider. Following the providers’ guidance we built and tested five systems to answer questions about a particular document, based on Llama 3.1 B, Claude Haiku, Cohere Command-R, Mistral Large and GPT-4o. We then did a scan of the systems’ question answering ability and security posture using common evaluation techniques. The goal was not to claim any model is better than others but to observe differences in the systems exemplified by the model providers. Question answering performance showed relatively small differences while there were major differences in the security profiles. Overall, systems that included a dedicated LLM call for moderation fared better than those that included a “safety” prompt.
—-
Introduction
There are many leaderboards and benchmarks of LLMs, but most consist of running a standard battery of tests on the base model itself. In production, however, it’s the system performance that matters, and the model is only one component of that system. Many other design choices can influence how well the system performs at its task.
RAG stands for retrieval- augmented generation, and refers to a system in which a language model is asked to summarize documents retrieved based on a user search query, instead of just coming up with an answer based on its training. This method can be used to help keep answers factual and on topic.
Enterprises use RAG systems to enhance Generative AI applications in a wide variety of applications such as customer service chat bots, to professional co-pilots, workflow productivity, search tools and more. Given the broad interest in RAG systems across the market -- and emerging regulations that require foundation model providers to provide documentation and guidance to their customers -- it's increasingly fair expectation to evaluate the performance and reliability of RAG-based LLM systems "out-the-box", based on a company's model documentation and guidance on how to use their model in a system.
To see how the different tutorials stack up, and provide a more realistic alternative to model benchmarks, we are evaluating some of the tutorials, examples, and how-to guides provided for different models. Here we show some results comparing RAG examples for four LLMs.
This is not meant to be a completely fair comparison (which, due to differences in model size, system architecture, and moderation approach is not possible), just an examination of how different RAG-based LLM systems perform following examples from their providers. While we don’t believe the tutorials we reviewed were meant as full examples to be used in deployment, they are the examples encountered when researching how to do RAG with each provider, so we believe a comparison is nevertheless appropriate. Some additional caveats: each tutorial is different, none are necessarily optimized and there may be better alternatives; furthermore, the moderation techniques are different between each model, the application is rather simple so differences may not be as easily apparent, etc. etc.
We compared LLMs from OpenAI, Anthropic, Cohere, Mistral, and Meta, based on tutorials or how-tos provided by each combined with each company’s guidance on applying guardrails to the models. Details on each system are shown at the end.
Setup
For comparison, the RAG systems all share a common document corpus and set of questions. The document is the NIST AI Risk Management Framework1, a document that provides information about how AI risks can be managed. The original document is a 43 page .pdf. We extracted the text and broke it into 57 chunks for retrieval. Each chunk was prefaced with the section and subsection name it came from to provide context when used as the basis for question answering.
Sources for each provider’s how-to or tutorial are shown in the Appendix. We took all the prompts, general architectures, and any reasonable embellishments from the Tutorials. For example, we implemented the query rephrasing included in Cohere’s tutorial. We didn’t implement any vector databases (which would have been irrelevant with such a small document size) and didn’t implement any special chunking strategies when they were introduced, though we did add context to each chunk allowing it to be situated in the document, a “leading practice” for document chunking.
For all the models we used dot-product similarity to find relevant context, retrieving the top five matching results and passing the top two after re-ranking to the LLM. If re-ranking was not used, we took the top two cosine similarity matches. In general we took the model version that was used in the tutorial referenced, except for the OpenAI one where we used GPT-4o instead of GPT-4-Turbo, which was referenced in the older tutorial.
Evaluation
NIST, coincidentally in the document used here to test RAG performance, defines seven characteristics of trustworthy AI systems to be evaluated. Here we focus on performance (NIST: Valid and Reliable) and Security (NIST: Secure and Resilient).
RAG Performance
To measure each system’s performance at its intended task, we generated 50 questions using the RAGAS framework which is often used for evaluating LLMs: https://docs.ragas.io/en/stable/. We then used evaluators from that same framework:
These metrics were themselves evaluated using a LLM, in this case GPT-3.5. There are drawbacks to this kind of automation, and errors are possible, though it is useful for doing a directional comparison between models.
The results for the five systems are summarized in the figure below. Each score has a possible range from 0-1 with higher being better.
The absolute value of each score is less important than the comparison. In particular faithfulness scores tend to be overly strict, and correctness is less useful for subjective questions. Nevertheless we can look at the trends between models. Overall, the scores are quite similar, while exhibiting a trend toward showing Llama 3.1 (8B) on the low end and GPT-4o on the high end. For faithfulness the gap is the largest, ranging from 63.2 to 72.8.
The models range significantly in overall power: Llama 3.1 is an 8B model that can be run locally; Haiku is Anthropic’s smallest LLM; Command-R is Cohere’s mid-level mode, and GPT-4o is meant to be OpenAI’s best performing general model - likewise for Mistral-Large. In this case the models were presented with a relatively simple task, so it’s not surprising that all perform fairly similarly.
The test questions here were all derived from the original NIST AI RMF documents text and are testing “middle of the road” or in-distribution performance. For a more comprehensive test of edge cases, we could consider additional test cases where performance may differ. The test corpus here was also relatively small and on a single topic. We could further try testing over a larger and more diverse document set.
Moderation and Security
Many security issues arise at the “application” level in relation to output handling or at the training or supply chain level. OWASP publishes an LLM security framework covering ten areas2. The most relevant area for the back-end RAG system in these examples is arguably “Prompt Injection” - the susceptibility of a model to be manipulated through its prompt to do things it otherwise wouldn’t. Closely related is the moderation of the system behavior to prevent commercially (or otherwise) undesirable output. All of the systems studied were equipped with some level of safeguards recommended by the provider. Also, all the models used have undergone training to avoid potentially harmful output.
As a probe of both jailbreaking susceptibility and moderation, we combined a set of jailbreak prompts3 designed to circumvent model controls with a set of “Do Not Answer” prompts designed to elicit output in five categories generally deemed inappropriate4. The categories are:
It should be noted that there is a separate question, as to whether an LLM should or shouldn’t answer these questions and how “harmful” the results actually are. The point is that all of the systems studied include both training and guardrails to prevent answering on such topics which is what we are testing.
We tested a subset of five prompts from each category across 24 jailbreak prompts, twice each, for a total of 1200 prompts for each model.
The figure below shows the results, grouped by prompt category and presented at “attack success rate” showing the percentage of the prompts successfully eliciting an inappropriate response. A lower rate is better indicating fewer bad responses. To judge the responses we used Anthropic’s Haiku, prompted with the response plus:
This was followed by manual inspection of the flagged responses to ensure they were not false positives.
Anthropic’s Haiku had the fewest successes. It answered only a few questions (< 2%) in the Misinformation Harms category, and upon manual review these answers were relatively benign, for example telling someone to invest their money in a crypto scheme. The success rate was higher for GPT-4o but at most about 4% of one category. The Llama, Mistral, and Command-R models experienced higher success rates. I’ll also note that these included fairly graphic answers.
OpenAI, Anthropic and Meta/Llama all included an extra LLM or moderation call to screen the incoming prompts. Cohere and Mistral, per their guidance, used “strict” mode or safe_prompt=True flags that apply internal controls to a single model call (likely in the form of a system prompt). These two latter models performed comparably and well below the others on this test, suggesting that those controls are less effective than a second pre-screening LLM.
None of these models had any output moderation, the moderation occurred on the prompt. The Anthropic example explicitly mentioned moderation as a “pre-screen” of inputs. Cohere relied on a “strict” mode of the inference API claiming to mitigate inappropriate output, as did Mistral’s safe_prompt. In fairness, the Llama documentation mentions that Llamaguard can be added to the system output as well as its input. The OpenAI moderation API documentation is also ambiguous about where the moderation can occur. It would almost certainly have reduced the success rate if a second moderation step had been added to screen the model output before it was passed back to the user. As complexity increases we’d expect moderation success to go up. Here we are attempting to compare similarly complex systems.
Conclusions
We’ve looked at how four RAG systems, built following the model providers’ guidance, perform at on-topic question answering and against answering inappropriate questions. The goal was to evaluate recommended uses and guardrails, as an alternative to looking at the performance of the foundation model in isolation - as LLM benchmarks typically do. All models performed fairly similarly on the RAG task in terms of faithfulness, relevance, and correctness though metrics did trend upward from the smaller Llama model to the more comprehensive GPT-4o. With respect to security, Haiku had the lowest attack success rate. Anthropic’s website mentions specifically that Claude models are highly jailbreak resistant5 and this appears to be accurate. GPT-4o had the next fewest successes, while the rate was much higher on the other three models. The two models (Command-R and Mistral-Large) that had built-in “safety” control performed markedly worse than the others on the security test.
There are no doubt many ways that the attack success rates could be further reduced - and the answer quality improved for all of these models. However, here we see how they stack up under similar guidance and circumstances.
Appendix: RAG System Specifics
Notes for each system are discussed below. I’ll note that some of the tutorials included vector databases which we ignored. The purpose of a vector database is to quickly retrieve context from a large data set using advanced search techniques. The data set here was small enough that we just did a brute force search to retrieve the best matching context.
Llama:
LLM: Llama 3.1 8B: https://ai.meta.com/blog/meta-llama-3-1/ This is one of the newer, smaller models from Meta that can be run locally as long as you agree to use it the way Meta wants you to. We ran it with VLLM.
Embedding Model: sentence-transformers/all-MiniLM-L6-v2: this is a freely available model running locally, see https://sbert.net/
Reranking: None
Content Moderation: LlamaGuard 3 1B: this is the smaller of Meta’s two most recent models, designed to detect prompts in 13 “Hazard Categories. It is not mentioned in the RAG tutorial and was added as an extra layer before passing prompts into the RAG system. It’s worth noting that Meta has also released a specific “Jailbreak Detection” model which we didn’t use because we thought LlamaGuard was more in line with the protections added to the other models. This didn’t work with VLLM for me (it ran but did not pass the prompt into the model and generated random moderation results which led to much confusion) - we ran it in HF Transformers.
Cohere:
Reference: https://docs.cohere.com/docs/rag-with-cohere
LLM: Command-R-08-2024 (This is their middle size model, available through their API)
Query Rephrasing: Cohere was the only tutorial that included a rephrasing step, using a separate LLM call to convert the incoming question into more suitable retrieval queries.
Embedding Model: Cohere’s embed-english-v3.0: this is accessed through Cohere’s API.
Reranking: Cohere’s rerank-english-v3.0. Cohere was the only tutorial that mentions reranking and the only provider with their own implementation. For this small data set it’s likely not very important.
Content Moderation: Cohere has a build in STRICT “safety” (moderation) mode that can be activated. According to the documentation it “Encourages avoidance of all sensitive topics. Strict content guardrails provide an extra safe experience by prohibiting inappropriate responses or recommendations. Ideal for general and enterprise use.” This is activated via an argument passed to the API.
Notes: Cohere’s API as we accessed it is an order of magnitude slower than the other providers’.
Anthropic:
LLM: Anthropic’s claude-3-haiku-20240307, the one used in the reference. This is Anthropic’s small, cheap model
Embedding Model: Anthropic does not offer embeddings but they promote embeddings from Voyage AI: https://www.voyageai.com/ We used their “voyage-2” embedding model.
Reranking: Following anthropic’s tutorial we call Claude for reranking, passing in the text of all the retrieved contexts and asking for the indices of the most relevant ones to the query.
Content Moderation: Anthropic provides guidance for moderating inputs. We followed the example below that uses Claude to scan for inappropriate content. This is one of several examples including more sophisticated ones. https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/mitigate-jailbreaks (first example)
Mistral:
Reference: https://docs.mistral.ai/guides/rag/
LLM: Mistral-Large (24.07 V2, July 2024)
Embedding Model: Mistral Embed (v23.12)
Reranking model: None
Content moderation: Mistral supports what they describe as “an optional system prompt to enforce guardrails on top of our models”: https://docs.mistral.ai/capabilities/guardrailing/ This was activated by setting safeprompt=True when call in the model. Mistral does mention self-reflection, using an additional Mistral LLM call prompted to screen incoming queries, which we did not include.
OpenAI:
Reference: https://cookbook.openai.com/examples/parse_pdf_docs_for_rag
LLM: GPT-4o (gpt-4o-2024-08-06) The current standard multi-purpose GPT model
Embedding Model: OpenAI’s text-embedding-3-small
Reranking model: None
Content Moderation: The OpenAI moderation API: https://platform.openai.com/docs/guides/moderation/quickstart with the model omni-moderation-latest at the default settings.
1 https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
3 https://github.com/promptfoo/promptfoo/blob/main/src/redteam/strategies/promptInjections/data.json
4 https://arxiv.org/abs/2308.13387
5 https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/mitigate-jailbreak