ptrnsai

Retrieval-Augmented Generation

Intermediate💾 Memory PatternsMeta AI (Lewis et al., 2020)

Intent

Retrieve relevant information from an external knowledge base before generating a response, grounding the LLM in facts.

Problem

LLMs have a training cutoff date and can't access private/proprietary information. They also hallucinate — confidently stating things that aren't true. You need a way to ground the model in accurate, up-to-date, domain-specific information.

Solution

Before generating a response, retrieve relevant documents from a knowledge base using the user's query. Inject the retrieved content into the prompt as context, then ask the LLM to answer based on the provided information. The typical pipeline: Query → Embed → Search vector database → Retrieve top-K documents → Inject into prompt → Generate response. RAG bridges the gap between what the model knows (training data) and what it needs to know (your specific data).

Diagram

Query → [Embed Query]
              ↓
        [Vector Search] → Top K documents
              ↓
        [Inject into prompt as context]
              ↓
        [LLM generates answer grounded in retrieved docs]
              ↓
         Response with citations

When to Use

  • Knowledge-intensive tasks requiring accurate, up-to-date information
  • When you need the LLM to answer about proprietary data
  • Reducing hallucination by grounding in source documents
  • Customer support, documentation search, legal research

When NOT to Use

  • Creative tasks that don't need factual grounding
  • When all needed context fits in the prompt without retrieval
  • Tasks where the model's training data is sufficient

Pros & Cons

Pros

  • Dramatically reduces hallucination with source grounding
  • Works with any LLM — no fine-tuning needed
  • Knowledge base can be updated without retraining
  • Provides citations for verifiable outputs

Cons

  • Retrieval quality is the bottleneck — bad retrieval means bad answers
  • Chunking and embedding strategy significantly affects quality
  • Adds latency from the retrieval step
  • Doesn't help if the answer isn't in the knowledge base

Implementation Steps

  1. 1Prepare your knowledge base: clean, chunk, and embed documents
  2. 2Choose a vector database (Pinecone, Weaviate, Chroma, pgvector)
  3. 3Implement the retrieval pipeline: embed query → search → rank results
  4. 4Design the prompt template that incorporates retrieved context
  5. 5Add citation tracking so responses reference source documents
  6. 6Evaluate retrieval quality: are the right documents being found?

Real-World Example

Internal Documentation Q&A

Employee asks: 'What's our parental leave policy?' System embeds the query, searches the HR document database, retrieves the relevant policy section, and generates a concise answer with a link to the full policy document.

PythonRAG with Embedding Search
from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [e.embedding for e in response.data]

def rag_query(question: str, documents: list[str], top_k: int = 3) -> str:
    doc_embeddings = np.array(embed(documents))
    query_embedding = np.array(embed([question])[0])

    similarities = doc_embeddings @ query_embedding
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    context = "\n\n".join(documents[i] for i in top_indices)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return response.choices[0].message.content

References