Using synthetic data to bootstrap your RAG system evals

llm
python
rag
openai
Author
Affiliation
Published

August 7, 2025

Modified

August 7, 2025

I’ve been catching up on Hamel Husain and Shreya Shankar’s AI Evals for Engineers & PMs course. If you’re thinking of taking it, I highly recommend it. There’s no better material on evaluating AI systems out there.

One part that I found particularly interesting was the section on generating synthetic data for bootstrapping the evaluation of RAG systems. I’ve worked on several AI projects, but in all of them, we had real data to work with, so I hadn’t had the chance to generate synthetic data for this purpose.

I decided to work through a simple example and write some notes about it. In this article, I’ll walk you through the process of bootstrapping your RAG system evals using synthetic data.

Prerequisites

If you plan to follow along, you’ll need to:

  1. Sign up and generate OpenAI and LangSmith API keys.
  2. Create a .env file in the root directory of your project and add the following lines:
OPENAI_API_KEY=your_openai_api_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=your_langchain_project_name
LANGSMITH_API_KEY=your_langsmith_api_key
  1. Create a virtual environment in Python and install the following packages:
uv venv
uv add langchain langchain-openai langchain-community jupyter chromadb python-dotenv nest_asyncio sentence-transformers
  1. Download the People Group’s section from GitLab’s handbook.

I’m also assuming you’re familiar with the basics of RAG systems and how to use vector databases. If you need a refresher, you can check out my RAG tutorial.

Then, you’ll be able to run the code from this article. If you don’t want to copy and paste the code, you can download this notebook.

How to generate synthetic data for RAG evals

The process is simple. Here’s how it works:

  1. Split your source document into chunks and store them in a vector database.
  2. Sample a few chunks from the vector database.
  3. For each sampled chunk, extract a fact from it and generate a question that is unambiguously answered by the fact.
  4. Define evaluation metrics for your RAG system.
  5. Optionally, filter the generated questions to remove the ones that don’t seem realistic.
  6. Measure the performance of your RAG system.

In the next sections, I’ll help you implement this process step by step, providing code snippets you can run in a Jupyter notebook.

Setup

You’ll use asyncio in some of the code snippets, so you must enable nest_asyncio to run the code:

import nest_asyncio

nest_asyncio.apply()

Then, you can proceed as usual, importing the required packages:

import asyncio
import os
import random
from textwrap import dedent

import chromadb
import pandas as pd
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from dotenv import load_dotenv
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_text_splitters import MarkdownTextSplitter
from langsmith import Client, traceable
from pydantic import BaseModel
from sentence_transformers import CrossEncoder

load_dotenv()

These are the most important libraries you’ll use in this article:

  • chromadb: Vector database for storing and retrieving document embeddings
  • langchain: Framework for building LLM applications
  • langchain-openai: Wrapper for OpenAI’s API, providing access to LLMs and embeddings
  • pydantic: Provides models for generating structured data and validating types
  • sentence-transformers: In the last section of the article, you’ll use this library to rerank the retrieved documents.

The rest of the libraries will handle typical Python tasks, such as reading files, managing environment variables, etc.

For this tutorial, you’ll be building a RAG system that feeds an internal chatbot that helps employees of a company answer questions about company policies.

I chose this topic because there’s a pretty good source of data we can use for this: The GitLab Handbook. We’ll just use the People Group section of the handbook, to keep costs manageable.

Your next step is to load the data from the handbook:

loader = DirectoryLoader(
    "../data/synthetic-data-rag/people-group/", glob="**/*.md", loader_cls=TextLoader
)
docs = loader.load()

len(docs)
99

If you set everything up correctly, the cell should output the number of documents in the docs variable. Depending on when you read this article, the number of documents may change, as the handbook is updated regularly. The date I downloaded the data, there were 99 documents in the People Group section.

Then, you must add the data to the vector database.

Index data

A vector database is a database designed to efficiently store and query data as vector embeddings (numerical representations). Provided with a user query, it’s the engine you use to find the most similar data in your database.

For this tutorial, you’ll use ChromaDB. Let’s set it up:

openai_ef = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
client = chromadb.PersistentClient(
    path="../data/synthetic-data-rag/chroma",
)
collection = client.get_or_create_collection(
    "gitlab-handbook", embedding_function=openai_ef
)

This code snippet:

  1. Defines an embedding function that uses OpenAI’s API to generate embeddings for the documents.
  2. Creates a ChromaDB client to interact with the vector database.
  3. Creates a collection in the vector database to store the document embeddings and sets the embedding function to use the OpenAI embedding model.

Next, you should generate embeddings for the documents and store them in the vector database. But there’s a little catch: some documents are longer than the maximum token limit of the embedding model (8192 tokens). This will break the indexing process. To avoid this, you must split the documents into smaller chunks.

You can do this with the following code snippet:

text_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4o",
    chunk_size=400,
    chunk_overlap=0,
)
splits = text_splitter.split_documents(docs)

The handbook is written using Markdown, so you can use the MarkdownTextSplitter to split the documents. This will use the headings in the files for the splitting in addition to the number of tokens. This generally results in better chunks, as they will be more likely to contain complete thoughts or sections of the document.

The from_tiktoken_encoder method lets you do the splits based on the number of tokens, not characters which is the default behavior. You’ve set a chunk size of 400 tokens, with no overlap.

After running the splitting code, you can check the number of chunks created by running this:

len(splits)
999

Now you have another problem: you have too many chunks. If you try to add all of them to the vector database at once—which also generates their embeddings—you’ll likely hit the OpenAI API rate limits.

To solve this, let’s define a utility function that adds the chunks to the vector database in batches:

def create_batches(ids, documents, metadatas, batch_size=100):
    batches = []
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i : i + batch_size]
        batch_documents = documents[i : i + batch_size]
        batch_metadatas = metadatas[i : i + batch_size]
        batches.append((batch_ids, batch_metadatas, batch_documents))
    return batches

Then, you can apply this function to the chunks you created earlier:

ids = [f"{str(i)}" for i in range(len(splits))]
documents = [doc.page_content for doc in splits]
metadatas = [doc.metadata for doc in splits]


if collection.count() > 0:
    print("Collection already exists, skipping creation.")
else:
    print("Adding documents...")
    batches = create_batches(ids=ids, documents=documents, metadatas=metadatas)
    for i, batch in enumerate(batches):
        print(f"Adding batch {i} of size {len(batch[0])}")
        collection.add(ids=batch[0], metadatas=batch[1], documents=batch[2])

Feel free to adjust the batch size according to your needs. This should take a few seconds to run. Once it’s done, your vector database should be ready to use.

Next, you’ll define a couple of functions to interact with the vector database and a data model to represent the retrieved documents you’ll be working with.

class RetrievedDoc(BaseModel):
    id: str
    path: str
    page_content: str


def get_similar_docs(text: str, top_k: int = 5) -> list[RetrievedDoc]:
    results = collection.query(query_texts=[text], n_results=top_k)
    docs = [results["documents"][0][i] for i in range(top_k)]
    metadatas = [results["metadatas"][0][i] for i in range(top_k)]
    ids = [results["ids"][0][i] for i in range(top_k)]
    return [
        RetrievedDoc(id=id_, path=m["source"], page_content=d)
        for d, m, id_ in zip(docs, metadatas, ids)
    ]


def get_doc_by_id(doc_id: str) -> RetrievedDoc:
    results = collection.get(ids=[doc_id])
    doc = results["documents"][0]
    metadata = results["metadatas"][0]
    return RetrievedDoc(id=doc_id, path=metadata["source"], page_content=doc)

In this code snippet, you define two functions:

  1. get_similar_docs will let you retrieve the most similar documents to a user query.
  2. get_document_by_id will let you retrieve a document by its ID.

RetrievedDoc is a Pydantic model that represents a retrieved document. It includes the document’s ID, file path, and the page content. This model will help you structure the data you retrieve from the vector database.

Next, you can sample documents for the synthetic data generation:

golden_docs_idx = random.sample(range(len(splits)), 200)
golden_docs = [get_doc_by_id(str(i)) for i in golden_docs_idx]

This will result in 200 documents that you’ll use to generate the synthetic data.

Generate QA Pairs

Using the documents you just sampled, you can generate synthetic data. For each document, you’ll extract a fact and generate a question from it.

This is simple but it has a big issue: it may produce questions that are too easy to answer and result in an overly optimistic evaluation of your RAG system.

To create more challenging synthetic queries, Hamel and Shreya recommend adding similar confounding chunks to the generation process, so that it can generate questions in an adversarial manner. The generator will create a question that is uniquely answered by the target chunk but also include themes or keywords that are present in other chunks.

Here’s an example of how this works:

Target chunk: “George Orwell’s masterpiece, Nineteen Eighty-Four, was published in June 1949 and introduced the concept of ‘Big Brother’ to a global audience.”

Similar chunks:

  1. “Aldous Huxley’s Brave New World, another influential work of dystopian fiction, was first released in 1932 and explores themes of social conditioning and control.”
  2. “Ray Bradbury’s Fahrenheit 451, published in 1953, depicts a future society where books are banned and ‘firemen’ burn any that are found.”

Synthetic Question: “In what year was the dystopian novel that introduced the concept of ‘Big Brother’ published?”

The target chunk helps the generator come up with a synthetic question. The similar chunks provide distractors that help the generator include themes or keywords that are also present in other chunks (e.g., dystopian fiction), making the question more challenging.

To do this, you can use the following prompts:

system_prompt_generate = dedent(
    """
    You are a helpful assistant generating synthetic QA pairs for retrieval evaluation.

    Given a target chunk of text and a set of confounding chunks, you must extract a specific, self-contained fact from the target chunk that is not included in the confounding chunks. Then write a question that is directly and unambiguously answered by that fact. The question should only be answered by the fact extracted from the target chunk (and not by any of the confounding chunks) but it should also use themes or terminology that is present in the confounding chunks.

    Always respond with a JSON object with the following keys (in that exact order):

    1. "fact": "<the fact extracted from the target chunk>",
    2. "confounding_terms": "<a list of terms or themes from the confounding chunks that are relevant to the question>",
    3. "question": "<the question that is directly and unambiguously answered by the fact>",
    
    You should write the questions as if you're an employee looking for information in the handbook. The question should be as realistic and natural as possible, reflecting the kind of queries an employee might actually make when searching for information in the handbook.
    """
)

user_prompt_generate = dedent(
    """
    TARGET CHUNK:
    {target_chunk}

    CONFOUNDING CHUNKS:
    {confounding_chunks} 
    """
)

These prompts will be used to generate the synthetic data. The system_prompt_generate defines the generation process, explaining how to extract facts and generate questions, and user_prompt_generate provides the required context: target and confounding chunks.

Then, you initialize the LLM, set up the response model, and define a function to format the documents for the LLM:

class Response(BaseModel):
    fact: str
    confounding_terms: list[str] = []
    question: str


llm = ChatOpenAI(model="gpt-4.1-mini", temperature=1)
llm_with_structured_output = llm.with_structured_output(Response)

messages = ChatPromptTemplate.from_messages(
    [("system", system_prompt_generate), ("user", user_prompt_generate)]
)

def format_docs(chunks: list[RetrievedDoc]) -> str:
    return "\n".join(
        [f"*** Filepath: {chunk.path} ***\n{chunk.page_content}\n" for chunk in chunks]
    )

Finally, you can define a function to generate the synthetic data. This function will take a target chunk, retrieve the most similar chunks from the vector database, and generate a question from the target chunk using the similar chunks as distractors:

async def generate_qa_pair(chunk):
    similar_chunks = get_similar_docs(chunk.page_content)
    compiled_messages = await messages.ainvoke(
        {
            "target_chunk": format_docs([similar_chunks[0]]),
            "confounding_chunks": format_docs(similar_chunks[1:]),
        }
    )
    output = await llm_with_structured_output.ainvoke(compiled_messages)
    return output

To speed up question generation, you can run this concurrently using asyncio:

tasks = [generate_qa_pair(random_split) for random_split in golden_docs]
qa_pairs = await asyncio.gather(*tasks)

df = pd.DataFrame([qa_pair.dict() for qa_pair in qa_pairs])
df.to_excel("../data/synthetic-data-rag/files/qa_pairs.xlsx", index=False)

Here are some of the resulting QA pairs:

  1. Example 1:
    • Question: How soon should managers send the results after the 360 feedback cycle closes to prepare for the feedback meeting?
    • Answer: Managers should send the results of 360 feedback within 48 hours of the feedback cycle closing so they can prepare and come to the meeting with questions and discussion points.
  2. Example 2:
    • Question: At what point in the hiring process must candidates disclose outside employment or side projects for GitLab to assess potential conflicts with their job obligations?
    • Answer: Candidates at a certain stage in the recruiting process are asked to disclose outside employment, side projects, or other activities so GitLab can determine if a conflict exists with their ability to fulfill obligations to GitLab.

Even though the questions seem relevant, it’s not entirely clear if they are truly the type of questions real users ask.

To improve that, you can iterate a bit more on the prompt, including few-shot examples of real or adjusted queries. Or, you can take the lazy way out and generate a filter that will help you remove the questions that don’t seem realistic enough. Let’s do that!

Filter QA pairs

For that, you should open the Excel file and manually review some of the generated Q&A pairs and evaluate them for relevance. Then, you should provide those questions in a system prompt as examples to the LLM, asking it to assign a score to each question based on your evaluation criteria.

Here’s an example of how to do that:

system_prompt_rate = dedent(
    """
    You are an AI assistant helping us curate a high-quality dataset of questions for evaluating an company's internal handbook. We have generated synthetic questions and need to filter out those that are unrealistic or not representative of typical user queries.

    Here are examples of realistic and unrealistic user queries we have manually rated:

    ### Realistic Queries (Good Examples)

    * **Query:** "What is the required process for creating a new learning hub for your team in Level Up at GitLab?"
        * **Explanation:** Very realistic user query. It's concise, information-seeking, and process-oriented.
        * **Rating:** 5
    * **Query:** "Where is the People Operations internal handbook hosted, and how can someone gain access to it?"
        * **Explanation:** Realistic query but might be a bit too detailed for a typical user.
        * **Rating:** 4
    * **Query:** "Who controls access to People Data in the data warehouse at GitLab, and what approvals are required for Analytics Engineers and Data Analysts to obtain access?"
        * **Explanation:** Seems reasonable but too lengthy for a typical user query. 
        * **Rating:** 3

    ### Unrealistic Queries (Bad Examples)

    * **Query:** "If a GitLab team member has been with the company for over 3 months and is interested in participating in the Onboarding Buddy Program, what should they do to express their interest?"
        * **Explanation:** Overly specific and unnatural. No real user would ask this.
        * **Rating:** 1
    * **Query:** "On what date did the 'Managing Burnout with Time Off with John Fitch' session occur as part of the FY21 Learning Speaker Series?"
        * **Explanation:** Irrelevant and overly specific. Not a typical user query. 
        * **Rating:** 2

    ### Your Task

    For the following generated question, please:

    1.  Rate its realism as a typical user query for an internal handbook application on a scale of 1 to 5 (1 = Very Unrealistic, 3 = Neutral/Somewhat Realistic, 5 = Very Realistic).
    2.  Provide a brief explanation for your rating, comparing it to the examples above if helpful.

    ### Output Format

    **Explanation:** `[Your brief explanation]`
    **Rating:** `[Your 1–5 rating]`
    """
)

user_prompt_rate = dedent(
    """
    **Generated Question to Evaluate:**
    `{question_to_evaluate}`
    """
)

Hamel and Shreya generally discourage using Likert-type 1-5 scales for LLM judges. However, in this case, we’re not aiming for a very accurate judge, we’re just trying to have a method that works well enough to filter out the questions. We don’t need to make this overly complicated.

Using the LLM judge, you can apply the filter to the generated questions:

class ResponseFiltering(BaseModel):
    explanation: str
    rating: int


llm_with_structured_output_filtering = llm.with_structured_output(ResponseFiltering)

messages_filtering = ChatPromptTemplate.from_messages(
    [("system", system_prompt_rate), ("user", user_prompt_rate)]
)

async def rate_qa_pair(qa_pair):
    compiled_messages = await messages_filtering.ainvoke(
        {"question_to_evaluate": qa_pair.question}
    )
    output = await llm_with_structured_output_filtering.ainvoke(compiled_messages)
    return output


tasks = [rate_qa_pair(qa_pair) for qa_pair in qa_pairs]
results = await asyncio.gather(*tasks)

rated_qa_pairs = [
    {
        "rating": result.rating,
        "explanation": result.explanation,
        "question": qa_pair.question,
        "answer": qa_pair.fact,
    }
    for (result, qa_pair) in zip(results, qa_pairs)
]

df_rated_qa_pairs = pd.DataFrame(
    rated_qa_pairs, columns=["Rating", "Explanation", "Question"]
)

df_rated_qa_pairs.to_excel(
    "../data/synthetic-data-rag/files/rated_qa_pairs.xlsx", index=False
)

This will result in a list of ResponseFiltering objects, each containing an explanation and a rating for the corresponding question.

You can save the results to a file for later use.

Evaluate the RAG system

Now that we have the filtered QA pairs, it’s time to evaluate your RAG system. You’ll evaluate two parts of the RAG system: retrieval and generation.

Let’s use LangSmith to store our evaluation results. Start by creating a dataset on LangSmith:

langsmith_client = Client()
dataset_name = "Gitlab Handbook QA Evaluation 2"

try:
    dataset = langsmith_client.create_dataset(dataset_name=dataset_name)
    examples = [
        {
            "inputs": {
                "question": h["question"],
            },
            "outputs": {
                "answer": h["answer"],
                "doc": {
                    "id": chunk.id,
                    "path": chunk.path,
                },
            },
        }
        for h, chunk in zip(rated_qa_pairs, golden_docs)
        if h["rating"] >= 5
    ]
    langsmith_client.create_examples(dataset_id=dataset.id, examples=examples)
except Exception:
    print("Dataset already exists, skipping creation.")
    dataset = langsmith_client.read_dataset(dataset_name=dataset_name)

This will create a dataset on LangSmith with the rated QA pairs. Each example will include the question, answer, and the document from which the question was generated.

Retrieval Metrics

To evaluate the retrieval part of the RAG system, you can use metrics such as recall@k, precision@k, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).

For this tutorial, you’ll use two metrics: MRR and recall@k.

Mean Reciprocal Rank (MRR)

MRR measures how well the RAG system retrieves relevant documents. It calculates the average of the reciprocal ranks of the first relevant document for each query. It essentially measures how quickly the system retrieves the first relevant document for a given query.

Reciprocal Rank (RR) is calculated for a single query. It is the reciprocal of the rank at which the first relevant document is found. For example, if the first relevant item is at position 1, the RR is 1. If it’s at position 3, the RR is \(1/3\). The formula is:

\[RR = \frac{1}{rank}\]

where \(rank\) is the position of the first relevant document.

MRR is the average of the Reciprocal Rank scores across all your queries. It provides a single, aggregate measure of retrieval performance. An MRR of 1 means you found the correct document at the first position for every query. The formula is:

\[MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank}\]

where \(|Q|\) is the number of queries and \(rank_i\) is the position of the first relevant document for query \(i\).

Recall@k

Recall@k measures the proportion of relevant documents retrieved in the top k results. It helps you understand how many relevant documents are retrieved by the RAG system.

For example, if you retrieve 5 documents and 3 of them are relevant, recall@5 is 3/5 = 0.6. The formula is:

\[Recall@k = \frac{|\text{relevant documents in top k}|}{|\text{total relevant documents}|}\]

where the numerator is the number of relevant documents retrieved in the top k results, and the denominator is the total number of relevant documents for the query. To get the overall performance, you average the Recall@k values across all queries.

In our specific case, since we only have one relevant document per query, Recall@k will be 1 if the relevant document is in the top k results, and 0 otherwise.

You can define two LangSmith evaluators to calculate these metrics:

def mrr(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    reference_docs = [str(reference_outputs["doc"]["id"])]
    docs = outputs.get("docs", [])
    if not docs:
        return 0.0
    rank = next((i + 1 for i, doc in enumerate(docs) if doc in reference_docs), None)
    return 1.0 / rank if rank else 0.0


def recall(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    reference_docs = [str(reference_outputs["doc"]["id"])]
    docs = outputs.get("docs", [])
    if not docs:
        return 0.0
    return float(any(doc in reference_docs for doc in docs))

LangSmith evaluators take inputs, outputs, and reference_outputs as arguments. The inputs are the user query and the retrieved documents, the outputs are the generated answers, and the reference_outputs are the target chunks.

Using those values, you can use the formulas we discussed to compute the MRR and recall@k metrics.

Generation metrics

In addition to measuring how good are the retrieved documents, you also want to measure if the LLM makes good use of them to generate answers. For that, Hamel and Shreya recommend using ARES or RAGAS.

The only issue is that ARES requires a human preference validation set of at least 50 examples and the standard RAGAS metrics consume tons of tokens. So, to keep things simple, you’ll build 3 simple metrics using an LLM judge, similar to RAGAS’ Nvidia metrics:

  • Answer accuracy: Measures how accurate the generated answer is compared to the expected answer.
  • Context relevance: Measures if the context provided to the LLM is relevant to the user query.
  • Groundedness: Measures if the generated answer is grounded in the provided context.

Compared to RAGAS implementation of these same metrics, the ones you’ll define here don’t do multiple runs and average the resulting scores. But, if you want to, you can easily apply a parallelization strategy to do this. I’d also recommend using reasoning models, as they tend to perform better in these types of tasks.

Let’s see how to implement these metrics using LangSmith.

Answer accuracy

This metric evaluates how accurate the generated answer is compared to the expected answer. It’s an LLM judge that scores the generated answer against a reference answer using a 0, 1, 2 scale:

system_prompt_answer_accuracy = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the accuracy of a User Answer against a Reference Answer, given a Question.

    Here's the grading scale you must use:

    0 - If User Answer is not contained in Reference Answer or not accurate in all terms, topics, numbers, metrics, dates and units or the User Answer do not answer the question.
    2 - If User Answer is full contained and equivalent to Reference Answer in all terms, topics, numbers, metrics, dates and units.
    1 - If User Answer is partially contained and almost equivalent to Reference Answer in all terms, topics, numbers, metrics, dates and units.

    Your rating must be only 0, 1 or 2 according to the instructions above.

    Your answer must be a JSON object with the following keys:
    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_answer_accuracy = dedent(
    """
    **Question:** `{question}`
    **User Answer:** `{user_answer}`
    **Reference Answer:** `{reference_answer}`
    """
)

messages_answer_accuracy = ChatPromptTemplate.from_messages(
    [("system", system_prompt_answer_accuracy), ("user", user_prompt_answer_accuracy)]
)


class ResponseAnswerAccuracy(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")

llm_with_structured_output_answer_accuracy = llm.with_structured_output(
    ResponseAnswerAccuracy
)


async def answer_accuracy(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> float:
    compiled_messages = await messages_answer_accuracy.ainvoke(
        {
            "question": inputs["question"],
            "user_answer": outputs["answer"],
            "reference_answer": reference_outputs["answer"],
        }
    )
    output = await llm_with_structured_output_answer_accuracy.ainvoke(compiled_messages)
    return output.rating / 2.0

Similar to the retrieval evaluators, you can define LangSmith evaluators for the answer accuracy metric. The inputs will contain the user question, the outputs will have the generated answer, and the reference_outputs will have the expected answer.

Context relevance

This metric evaluates if the context provided to the LLM is relevant to the user query. Similar to the answer accuracy metric, it uses an LLM judge that scores the context relevance against a reference answer using a 0, 1, 2 scale:

system_prompt_context_relevance = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the relevance of a Context in order to answer a Question. 

    Do not rely on your previous knowledge about the Question. Use only what is written in the Context and in the Question.

    Here's the grading scale you must use:

    0 - If the context does not contain any relevant information to answer the question.
    1 - If the context partially contains relevant information to answer the question.
    2 - If the context contains relevant information to answer the question.

    You must always provide the relevance score of 0, 1, or 2, nothing else.

    Your answer must be a JSON object with the following keys:
    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_context_relevance = dedent(
    """
    **Question:** `{question}`
    **Context:** `{context}`
    """
)

messages_context_relevance = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt_context_relevance),
        ("user", user_prompt_context_relevance),
    ]
)


class ResponseContextRelevance(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")
llm_with_structured_output_context_relevance = llm.with_structured_output(
    ResponseContextRelevance
)


async def context_relevance(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> float:
    compiled_messages = await messages_context_relevance.ainvoke(
        {
            "question": inputs["question"],
            "context": outputs["context"],
        }
    )
    output = await llm_with_structured_output_context_relevance.ainvoke(
        compiled_messages
    )
    return output.rating / 2

You can define a context_relevance evaluator. The inputs will contain the user question and from the outputs you’ll use the context provided to the LLM.

Groundedness

This metric evaluates if the answer is grounded in the provided context. Like before, you use an LLM judge that scores the groundedness of the answer against the context using a 0, 1, 2 scale:

system_prompt_groundedness = dedent(
    """
    You are an expert evaluator. Your task is to evaluate the groundedness of an assertion against a context. 

    Do not rely on your previous knowledge about the assertion or context. Use only what is written in the assertion and in the context.

    Here's the grading scale you must use:

    0 - If the assertion is not supported by the context. Or, if the context or assertion is empty.
    1 - If the context partially contains relevant information to support the assertion.
    2 - If the context fully supports the assertion.

    You must always provide the relevance score of 0, 1, or 2, nothing else.

    Your answer must be a JSON object with the following keys:

    1. "explanation": "<a brief explanation of your rating>",
    2. "rating": "<your rating, which must be one of the following: 0, 1, 2>"
    """
)

user_prompt_groundedness = dedent(
    """
    **Assertion:** `{answer}`
    **Context:** `{context}`
    """
)

messages_groundedness = ChatPromptTemplate.from_messages(
    [("system", system_prompt_groundedness), ("user", user_prompt_groundedness)]
)


class ResponseGroundedness(BaseModel):
    explanation: str
    rating: int


llm = ChatOpenAI(model="gpt-4.1-mini")
llm_with_structured_output_groundedness = llm.with_structured_output(
    ResponseGroundedness
)


async def groundedness(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
    compiled_messages = await messages_groundedness.ainvoke(
        {
            "answer": outputs["answer"],
            "context": outputs["context"],
        }
    )
    output = await llm_with_structured_output_groundedness.ainvoke(compiled_messages)
    return output.rating / 2

You use the same approach as before to define LangSmith evaluators for this metric. The inputs will contain the user question, the outputs will have the context provided to the LLM.

Run evaluation

Now we can run the full RAG pipeline and evaluate its results using the LangSmith evaluators.

system_prompt_generation = dedent(
    """
    You're a helpful assistant. Provided with a question and the most relevant documents, you must generate a concise and accurate answer based on the information in those documents.
    """
)

user_prompt_generation = dedent(
    """
    QUESTION: {question}

    RELEVANT DOCUMENTS:
    {documents}
    """
)

messages_generation = ChatPromptTemplate.from_messages(
    [("system", system_prompt_generation), ("user", user_prompt_generation)]
)

llm_generation = ChatOpenAI(
    model="gpt-4o-mini",
)

A good starting point is to evaluate different values for the number of retrieved documents (K). For example, you could evaluate different values for the number of retrieved documents.

Langsmith requires a wrapper or target function that encapsulates your RAG system. In your case, this function takes a user query, retrieves the most similar documents, generates an answer using the LLM, and returns the generated answer with the document IDs and context retrieved.

I’ll run this code for K values of 3, 5, and 10, and compare the results.

for K in [3, 5, 10]:
    print(f"Running evaluation for K={K}")

    @traceable
    async def target(inputs: dict) -> dict:
        relevant_docs = get_similar_docs(inputs["question"], top_k=K)
        formatted_docs = format_docs(relevant_docs)
        messages = await messages_generation.ainvoke(
            {
                "question": inputs["question"],
                "documents": formatted_docs,
            }
        )
        response = await llm_generation.ainvoke(messages)
        return {
            "answer": response.content,
            "docs": [doc.id for doc in relevant_docs],
            "context": formatted_docs,
        }

    experiment_results = await langsmith_client.aevaluate(
        target,
        data=dataset_name,
        evaluators=[recall, mrr, answer_accuracy, context_relevance, groundedness],
        max_concurrency=50,
        experiment_prefix=f"top-{K}",
    )

Using aevaluate you can speed up the evaluation process and run the evaluations concurrently. I got the following results:

k Answer accuracy Context relevance Groundedness MRR Recall
3 0.81 0.93 0.97 0.75 0.80
5 0.80 0.94 0.97 0.76 0.85
10 0.85 0.97 0.98 0.77 0.91

You can see that recall improves significantly as you increase the number of retrieved documents, which is expected. Answer accuracy and context relevance only seems to improve significantly when you increase the number of retrieved documents from 5 to 10.

If you got here, you’ve successfully built a RAG system and evaluated it using synthetic data. The next steps would be to continue making changes to parts of the pipeline and re-running the evaluations to see how they affect the performance of your RAG system.

You will also want to improve the quality of the generated questions, and ideally include real user queries in the evaluation process.

Next, I’ll show you how to squeeze a bit more performance from the retrieval part of the RAG system by using a reranker.

Improve metrics with a reranker

A quick way to improve your RAG system is to rerank the retrieved documents. In addition to doing retrieval using vector similarity or keyword search, you can have a reranking step that uses a more capable model to score and reorder the retrieved documents based on their relevance to the user query.

Let’s use sentence-transformers with an open-source model to do this:

cross_encoder = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1")

To rerank a set of documents, you take the results from the retrieval step and pass them to the reranker, which will return a new set of documents ordered by their relevance to the user query.

Here’s an example:

query = "What is the process for creating a new learning hub for your team in Level Up at GitLab?"
hits = get_similar_docs(query, top_k=50)
cross_inp = [[query, h.page_content] for h in hits]
reranker_scores = cross_encoder.predict(cross_inp)
sorted_hits = sorted(hits, key=lambda x: reranker_scores[hits.index(x)], reverse=True)

This takes the 50 most similar documents to the query and reranks them using a cross-encoder model. The reranker_scores are used to sort the documents in descending order of relevance.

Now you can do the same evaluation as before (for k = 5), but including this new reranking step:

K = 5

def get_reranked_docs(
    query: str, similar_docs: list[RetrievedDoc]
) -> list[RetrievedDoc]:
    cross_inp = [[query, doc.page_content] for doc in similar_docs]
    reranker_scores = cross_encoder.predict(cross_inp)
    sorted_docs = sorted(
        similar_docs, key=lambda x: reranker_scores[similar_docs.index(x)], reverse=True
    )
    return sorted_docs


@traceable
async def target_with_reranking(inputs: dict) -> dict:
    relevant_docs = get_similar_docs(inputs["question"], top_k=75)
    reranked_docs = get_reranked_docs(inputs["question"], relevant_docs)[:K]
    formatted_docs = format_docs(reranked_docs)
    messages = await messages_generation.ainvoke(
        {
            "question": inputs["question"],
            "documents": formatted_docs,
        }
    )
    response = await llm.ainvoke(messages)
    return {
        "answer": response,
        "docs": [doc.id for doc in reranked_docs],
        "context": formatted_docs,
    }

experiment_results = await langsmith_client.aevaluate(
    target_with_reranking,
    data=dataset_name,
    evaluators=[recall, mrr, answer_accuracy, context_relevance, groundedness],
    max_concurrency=50,
    experiment_prefix=f"top-{K}-reranked",
)

Here are the results of the original run with k=5 and the run with a reranker:

experiment answer_accuracy context_relevance groundedness mrr recall
k=5, vanilla 0.80 0.94 0.97 0.76 0.85
k=5, rerank 0.97 0.97 1.00 0.79 0.91

You can see that it immediately improves most metrics, especially the answer accuracy and recall. You should expect some variance in the results, so be aware that the change is not necessarily as big as it seems (but that’s a topic for another article!).

Conclusion

In this tutorial, you’ve learned how to use synthetic data to bootstrap your RAG system evaluations. We covered:

  • Synthetic data generation: How to generate QA pairs from your documents using adversarial techniques with confounding chunks
  • Evaluation metrics: Both retrieval metrics (MRR, Recall@k) and generation metrics (answer accuracy, context relevance, groundedness)
  • Filtering synthetic data: Using an LLM judge to filter out unrealistic questions and improve the dataset quality
  • Performance optimization: How reranking can significantly improve both retrieval and generation metrics

This approach gives you a solid foundation for evaluating RAG systems even when you don’t have real user data. Synthetic data is useful for getting started quickly, but remember that you should incorporate real user queries into your evaluation process, as that is the most realistic way to evaluate your RAG system.

Hope you find this tutorial useful. If you have any questions, leave a comment below.

Citation

BibTeX citation:
@online{castillo2025,
  author = {Castillo, Dylan},
  title = {Using Synthetic Data to Bootstrap Your {RAG} System Evals},
  date = {2025-08-07},
  url = {https://dylancastillo.co/posts/synthetic-data-rag.html},
  langid = {en}
}
For attribution, please cite this work as:
Castillo, Dylan. 2025. “Using Synthetic Data to Bootstrap Your RAG System Evals.” August 7, 2025. https://dylancastillo.co/posts/synthetic-data-rag.html.