What is Retrieval Augmented Generation (RAG)?

llm
rag
python
Author
Affiliation
Published

June 29, 2025

Modified

July 3, 2025

Retrieval Augmented Generation (RAG) is the most popular approach to providing LLMs with external information before they generate a response.

RAG is a technique where you retrieve the information required to solve a user’s query, then augment the context of the LLM with that information, and generate a response. In this tutorial, you’ll learn why RAG is useful, when to use it, and how to build your own RAG pipeline, step-by-step, using Python.

Let’s get started!

What is RAG?

It’s a technique to improve LLM answers by providing them with external information before they generate a response. It consists of three steps:

  1. Retrieve: The system starts by searching a specific knowledge base for relevant information about the query.
  2. Augment: This retrieved information is added to context that’s used by the LLM to generate a response.
  3. Generate: The LLM uses both your question and the provided information to generate an answer.

In addition to reducing costs and latency, RAG is useful because it reduces hallucinations, lets you use current data, and builds trust with users by (potentially) providing citations.

Vector databases

A vector database (VectorDB) is a database designed to store and query data as vector embeddings (numerical representations). So, provided with a user query, it’s the engine you use to find the most similar data in your database. It’s one of the most popular components of the retrieval step in RAG pipelines.

In recent years, many new vector databases have been created. But, in most cases, they had to re-discover that many of the ideas in the old generation of vector databases such as BM25-based retrieval were still valid and useful.

Some popular vector databases are:

  1. New generation: Qdrant, Chroma, Pinecone, Weaviate.
  2. Old generation: Elasticsearch/OpenSearch and Postgres+PGVector

In this tutorial, you’ll use Chroma. For client projects, I’ve used Elasticsearch, Postgres, Weaviate, and Qdrant. Many companies are already using Elasticsearch or Postgres, so it’s often easier to get started with them.

Why use a VectorDB?

If you have a small dataset, there’s no real reason to use a vector database. But if you’re dealing with thousands or millions of documents, you’ll need to use a vector database to efficiently retrieve the most relevant documents.

They’re useful because:

  1. The more noise in the context provided to the LLM, the more likely it is to produce bad output.
  2. It takes more time to process a longer context
  3. It costs more to process a longer context

Retrieval

Retrieval is the process of finding the most relevant documents in the vector database. There are two main approaches when dealing with text-based data: term-based retrieval and embedding-based retrieval.

Term-based retrieval

Term-based retrieval is a technique that uses the terms in the query to find the most relevant documents in the vector database.

It’s based on the following ideas:

  1. TF-IDF: Counts how often a term appears in this document (TF). Measures how rare the word is across all documents (IDF). Highlights terms important and unique to this specific document.
  2. Okapi BM25: Expands TF-IDF to introduce a weighting mechanism for term saturation and document length.

Embedding-based retrieval

Embedding-based retrieval is a technique that uses the embedding of the query to find the most relevant documents in the vector database.

For small datasets, you can use k Nearest Neighbors (k-NN) approach to find the most relevant documents, in which you calculate the similarity score between the query vector and every other vector stored in the VectorDB. Sort all the vectors based on these similarity scores and return the ‘k’ most similar vectors (relative to the query).

For larger datasets, you can use Approximate Nearest Neighbors (ANN) such as Locality-Sensitive Hashing (LSH) or Hierarchical Navigable Small World (HNSW) to find the most relevant documents.

Prerequisites

To follow this tutorial you’ll need to:

  1. Sign up and generate an API key in OpenAI.
  2. Set the API key as an environment variable called OPENAI_API_KEY.
  3. Create a virtual environment in Python and install the requirements:
  4. Download the sample PDF file
python -m venv venv
source venv/bin/activate
pip install langchain chromadb langchain-openai langchain-community python-dotenv pypdf jupyter

Once you’ve completed the steps above, you can run copy and paste the code from the next sections. You can also download the notebook from here.

RAG without vector database

Let’s go through an example without a VectorDB. We’ll simply augment with the full text of the document.

First, import the necessary libraries and load the required variables from the .env file.

import os

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

This code will import the necessary libraries and load the required variables from the .env file.

Read the document (retrieval)

Next, we’ll use a langchain DocumentLoader to load the document. Since, we’re dealing with a PDF file, we’ll use the PyPDFLoader.

There are many DocumentLoaders available in langchain. You can find the full list here.

file_path = "../_extras/what-is-rag/bbva.pdf"
loader = PyPDFLoader(file_path)
pages = []

for page in loader.lazy_load():
    pages.append(page)

A document loader is a class that processes a document and returns a list of Document objects. In the case of the PyPDFLoader, it will read each page of the PDF file and return the text of each page with some additional metadata.

A single page will look like this:

pages[0].model_dump()
{'id': None,
 'metadata': {'producer': 'Adobe PDF Library 15.0',
  'creator': 'Adobe InDesign 16.1 (Windows)',
  'creationdate': '2021-03-24T14:51:54+01:00',
  'moddate': '2021-03-24T14:51:54+01:00',
  'trapped': '/False',
  'source': '../_extras/what-is-rag/bbva.pdf',
  'total_pages': 4,
  'page': 0,
  'page_label': '1'},
 'page_content': "EDICIÓN AQUA PREP 01-01\nBANCO BILBAO VIZCAYA ARGENTARIA, S.A. - Plaza de San Nicolás, 4 - 48005 BILBAO\nReg. Mer. Vizcaya -T omo 3858, Folio 1, Hoja BI-17 BIS-A, Inscripción 1035ª C.I.F.: A48265169\n1 / 4\nThis document contains the Pre-contractual information and the Prior General Information of the Aqua Pre-paid Card contract \n(hereinafter, the Card) in accordance with the provisions of the Ministerial Order ECE/1263/2019, on the transparency of \ninformation conditions applicable to payment services, and Bank of Spain Circular 5/2012, on the transparency of banking services \nand responsibility in the granting of loans.\nThe information highlighted in bold is especially important, in accordance with Circular 5/2012\n1. ON THE PAYMENT SERVICE PROVIDER\n1.1 Details and registration\nBANCO BILBAO VIZCAYA ARGENTARIA, S.A.\nAddress: Plaza San Nicolás, 4 - 48005 BILBAO. \nPhone number: 900 102 801\nWebsite address: www.bbva.es\nRegistered in the Biscay Commercial Register, Volume 2083, \nFolio 1, Sheet BI-17-A, Entry 1\n1.2 Supervisory Authorities:\nBanco de España (Registry 0182)\n[Spanish National Securities Market Commission]\n2. ON THE USE OF THE PAYMENT SERVICES\n2.1 Main characteristics: PREPAID CARD .\nThe Holder may specify that the card be physical or virtual. \nT erms and conditions governing the availability of funds: in \nother words, when and how the holder will obtain the money:\na) The Card, against a balance previously loaded on it, \nmay be used to purchase goods or services in any of \nthe physical or virtual establishments affiliated with the \ncard systems to which the Card belongs and that are \nshown on it.\nb) T o make online payments with the Card, the Account \nHolder must consult the details pertaining to the card \nnumber, expiration date and CVV via the BBVA website \nor mobile app.\nc) Withdraw money from ATMs, Bank branches and \nany other entities that allow it against the balance \npreviously loaded on it.\nT ransactions carried out with the Card will reduce the \navailable balance.\nUnder no circumstances may transactions be carried out \nin excess of the current unused loaded balance at any time \n(available balance).\n2.2 Conducting transactions. Consent.\nT o withdraw money or pay with the Card in physical \nestablishments, you must present the Card and enter your \npersonal identification number (PIN).\nThe Card's contactless technology can be used to pay or \nwithdraw cash with the Card without having to enter the PIN for \ntransactions under 50 euros.\nFor online shop purchases, you must identify yourself in the \nmanner indicated by the Bank, enter the security password and \nfollow the procedure specified by the Bank..\n2.3 Execution period\nThe transactions will be charged to the Direct Debit Account on \nthe date on which they were executed.\nPre-contractual information and \ninformation  booklet  prior to \nconcluding the payment services \ncontract\nAQUA PRE-PAID CARD",
 'type': 'Document'}

You can see that in addition to the page content, it includes metadata about the source file, the page number, etc.

This document is about the conditions of some specific banking product. We’ll use it to answer a question about it.

Augment the context

Now that we have all the pages of the PDF available as a text, let’s build the context we’ll use to generate a response.

We’ll define a system and a user prompt. In the system prompt, we’ll define the role of the assistant and in the user prompt, we’ll provide the user question and the documents.

system_prompt = """
You are a helpful assistant that can answer questions about the provided context.

Please cite the page number used to answer the question. Write the page number in the format "Page X" at the end of your answer. 

If the answer is not found in the context, please say so.
"""
user_prompt = """
Please answer the following question based on the context provided:

Question: {question}

Documents:
{documents}
"""

pages_str = ""
for i, page in enumerate(pages):
    pages_str += f"--- PAGE {i + 1} ---\n{page.page_content}\n\n"

We’ve set up the system and user prompt, and a a variable that stores the pages we extracted as a single string. When we make a request to the model, we’ll combine all of these into messages and send them to the model.

Now, we’re ready to generate a response.

Generate response

To generate a response we’ll use gpt-4.1-mini and combine the system and user prompts we’ve built to augment the model’s context.

model = ChatOpenAI(model="gpt-4.1-mini", temperature=0)

def get_response(context_vars: dict):
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt.format(**context_vars)),
    ]
    response = model.invoke(messages)
    return response.content


question = "What is the main idea of the document?"
response = get_response({"question": question, "documents": pages_str})
print(response)
The main idea of the document is to provide the pre-contractual and general information regarding the Aqua Pre-paid Card offered by Banco Bilbao Vizcaya Argentaria, S.A. (BBVA). It outlines the terms and conditions of the card, including its features, usage, fees, security measures, responsibilities of the cardholder and the bank, contract duration, amendments, termination, applicable law, dispute resolution procedures, and other important legal aspects. The document aims to ensure transparency and inform potential cardholders about their rights and obligations before entering into the contract. 

Page 1 to Page 4

In this code, we’ve combined the system, user prompt, the pages extracted from the document, and a user question (“What is the main idea of the document?”) into messages the model can understand.

If you run the code, you’ll get an accurate answer from the model. Try running it with a different question.

question = "What are the daily transaction limits?"
response = get_response({"question": question, "documents": pages_str})
print(response)
The daily transaction limits for the Aqua Pre-paid Card are as follows: The daily purchase limit will be determined by the Card's balance and up to a maximum of 1,000 euros per day. The Holder and the Bank may modify the initially specified limits. Additionally, the monthly limit for collecting lottery and gambling prizes is ten thousand euros. (Page 2)

As long as the document contains the information you need, you will likely get an accurate answer from the model.

But you can do better. Right now, the model is using the full text of the document to answer the question. Most questions only require a few sentences from the document.

To answer the “What are the daily transaction limits?”, the model used 3,528 input tokens. While in reality, it only needed less than 500 input tokens.

For small documents such as this one, the difference isn’t a big deal. But when you’re dealing with thousands of documents and potentially millions of tokens, the difference can be significant in terms of costs, latency, and accuracy.

Let’s see how we can use a VectorDB to improve improve this.

Conclusion

In this post, you’ve learned about what RAG is, how it works, and how to implement it in Python. You’ve learned why you’d want to use it, and how to do it.

You’ve walked through the process of: - Extracting text from a PDF file - Creating embeddings for the chunks - Storing the embeddings in a VectorDB - Querying the VectorDB to find the most relevant chunks - Using the model to generate a response

Hope you find this article usefl. If you have any questions or comments, put them in the comments section below.

Citation

BibTeX citation:
@online{castillo2025,
  author = {Castillo, Dylan},
  title = {What Is {Retrieval} {Augmented} {Generation} {(RAG)?}},
  date = {2025-06-29},
  url = {https://dylancastillo.co/posts/what-is-rag.html},
  langid = {en}
}
For attribution, please cite this work as:
Castillo, Dylan. 2025. “What Is Retrieval Augmented Generation (RAG)?” June 29, 2025. https://dylancastillo.co/posts/what-is-rag.html.