Retrieval-Augmented Generation
Intent
Retrieve relevant information from an external knowledge base before generating a response, grounding the LLM in facts.
Problem
LLMs have a training cutoff date and can't access private/proprietary information. They also hallucinate — confidently stating things that aren't true. You need a way to ground the model in accurate, up-to-date, domain-specific information.
Solution
Before generating a response, retrieve relevant documents from a knowledge base using the user's query. Inject the retrieved content into the prompt as context, then ask the LLM to answer based on the provided information. The typical pipeline: Query → Embed → Search vector database → Retrieve top-K documents → Inject into prompt → Generate response. RAG bridges the gap between what the model knows (training data) and what it needs to know (your specific data).
Diagram
Query → [Embed Query]
↓
[Vector Search] → Top K documents
↓
[Inject into prompt as context]
↓
[LLM generates answer grounded in retrieved docs]
↓
Response with citationsWhen to Use
- Knowledge-intensive tasks requiring accurate, up-to-date information
- When you need the LLM to answer about proprietary data
- Reducing hallucination by grounding in source documents
- Customer support, documentation search, legal research
When NOT to Use
- Creative tasks that don't need factual grounding
- When all needed context fits in the prompt without retrieval
- Tasks where the model's training data is sufficient
Pros & Cons
Pros
- Dramatically reduces hallucination with source grounding
- Works with any LLM — no fine-tuning needed
- Knowledge base can be updated without retraining
- Provides citations for verifiable outputs
Cons
- Retrieval quality is the bottleneck — bad retrieval means bad answers
- Chunking and embedding strategy significantly affects quality
- Adds latency from the retrieval step
- Doesn't help if the answer isn't in the knowledge base
Implementation Steps
- 1Prepare your knowledge base: clean, chunk, and embed documents
- 2Choose a vector database (Pinecone, Weaviate, Chroma, pgvector)
- 3Implement the retrieval pipeline: embed query → search → rank results
- 4Design the prompt template that incorporates retrieved context
- 5Add citation tracking so responses reference source documents
- 6Evaluate retrieval quality: are the right documents being found?
Real-World Example
Internal Documentation Q&A
Employee asks: 'What's our parental leave policy?' System embeds the query, searches the HR document database, retrieves the relevant policy section, and generates a concise answer with a link to the full policy document.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [e.embedding for e in response.data]
def rag_query(question: str, documents: list[str], top_k: int = 3) -> str:
doc_embeddings = np.array(embed(documents))
query_embedding = np.array(embed([question])[0])
similarities = doc_embeddings @ query_embedding
top_indices = np.argsort(similarities)[-top_k:][::-1]
context = "\n\n".join(documents[i] for i in top_indices)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using only the provided context. Cite sources."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return response.choices[0].message.content