Retrieval-Augmented Generation

Intermediate💾 Memory PatternsMeta AI (Lewis et al., 2020)

Intent

Retrieve relevant information from an external knowledge base before generating a response, grounding the LLM in facts.

Problem

LLMs have a training cutoff date and can't access private/proprietary information. They also hallucinate — confidently stating things that aren't true. You need a way to ground the model in accurate, up-to-date, domain-specific information.

Solution

Before generating a response, retrieve relevant documents from a knowledge base using the user's query. Inject the retrieved content into the prompt as context, then ask the LLM to answer based on the provided information. The typical pipeline: Query → Embed → Search vector database → Retrieve top-K documents → Inject into prompt → Generate response. RAG bridges the gap between what the model knows (training data) and what it needs to know (your specific data).

Diagram

Query → [Embed Query]
              ↓
        [Vector Search] → Top K documents
              ↓
        [Inject into prompt as context]
              ↓
        [LLM generates answer grounded in retrieved docs]
              ↓
         Response with citations

When to Use

Knowledge-intensive tasks requiring accurate, up-to-date information
When you need the LLM to answer about proprietary data
Reducing hallucination by grounding in source documents
Customer support, documentation search, legal research

When NOT to Use

Creative tasks that don't need factual grounding
When all needed context fits in the prompt without retrieval
Tasks where the model's training data is sufficient

Pros & Cons

Pros

Dramatically reduces hallucination with source grounding
Works with any LLM — no fine-tuning needed
Knowledge base can be updated without retraining
Provides citations for verifiable outputs

Cons

Retrieval quality is the bottleneck — bad retrieval means bad answers
Chunking and embedding strategy significantly affects quality
Adds latency from the retrieval step
Doesn't help if the answer isn't in the knowledge base

Implementation Steps

1Prepare your knowledge base: clean, chunk, and embed documents
2Choose a vector database (Pinecone, Weaviate, Chroma, pgvector)
3Implement the retrieval pipeline: embed query → search → rank results
4Design the prompt template that incorporates retrieved context
5Add citation tracking so responses reference source documents
6Evaluate retrieval quality: are the right documents being found?

Real-World Example

Internal Documentation Q&A

Employee asks: 'What's our parental leave policy?' System embeds the query, searches the HR document database, retrieves the relevant policy section, and generates a concise answer with a link to the full policy document.

PythonRAG with Embedding Search

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [e.embedding for e in response.data]

def rag_query(question: str, documents: list[str], top_k: int = 3) -> str:
    doc_embeddings = np.array(embed(documents))
    query_embedding = np.array(embed([question])[0])

    similarities = doc_embeddings @ query_embedding
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    context = "\n\n".join(documents[i] for i in top_indices)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return response.choices[0].message.content

References

Retrieval-Augmented Generation — Lewis et al., 2020