Skip to main content
AI engineering glossary

What is RAG (Retrieval Augmented Generation)?

RAG (Retrieval Augmented Generation) is an AI architecture that retrieves relevant documents from a knowledge base and injects them into a large language model's context window before generating an answer: grounding responses in source material instead of relying purely on the model's parametric memory.

Last updated 2026-04-27BearPlex AI Engineering Team

Overview

RAG was introduced in a 2020 Facebook AI paper and has become the dominant pattern for production enterprise AI. The architecture solves the two biggest problems with raw LLMs: hallucinations (the model inventing facts) and stale knowledge (the model only knows what it was trained on). By retrieving fresh, authoritative documents at query time and forcing the model to ground its answers in those documents, RAG enables AI systems that cite sources, respect document permissions, and stay current as your knowledge base evolves.

How RAG works

A RAG system has three core components: an indexing pipeline that processes source documents into searchable chunks; a retrieval layer that finds the most relevant chunks for a given query; and a generation step that injects those chunks into the LLM's prompt. At query time: (1) the user's question is converted to an embedding vector; (2) the vector is matched against the indexed document chunks via similarity search (often combined with keyword search for hybrid retrieval); (3) the top-K matches are reranked for relevance; (4) the matches are inserted into a prompt template along with the user's question; (5) the LLM generates an answer grounded in those specific documents, ideally with inline citations.

When to use RAG vs alternatives

RAG is the right choice when your knowledge base changes frequently, when you need citations, when you have role-based access controls, or when your data is too large to fit in a model's context window. Fine-tuning is better when you need to change the model's style or behavior rather than its knowledge. Long-context models (1M+ tokens) can sometimes replace RAG for small static knowledge bases, but cost-per-query and retrieval precision usually favor RAG at scale.

RAG architectures (basic to advanced)

Naive RAG: split documents → embed → store → retrieve top-K → answer. Works for prototypes but breaks at scale. Advanced RAG: adds query rewriting, hybrid search (dense + sparse), cross-encoder reranking, and citation tracking. GraphRAG: builds a knowledge graph over documents and uses it to traverse multi-hop relationships, strong for complex reasoning over interconnected data. Agentic RAG: gives the LLM tools to perform multiple retrieval steps, refine queries, and decide when it has enough context.

Use cases

  • Customer support AI grounded in product documentation, past tickets, and resolution playbooks
  • Legal contract analysis with citations to specific clauses
  • Healthcare clinical decision support with cited evidence from medical literature
  • Internal knowledge management for enterprise wikis, Confluence, SharePoint
  • Compliance and regulatory Q&A with citation back to source regulations
  • Sales enablement with retrieval over call recordings, win/loss notes, and pricing playbooks

Examples in production

Anthropic

Anthropic's Claude provides 'Citations' as a first-class API feature, generating responses with verifiable references to source documents.

Source

OpenAI

OpenAI's Assistants API includes a built-in file search retrieval system implementing RAG patterns over uploaded documents.

Source

Perplexity AI

Perplexity is a consumer-facing RAG system over the web: every answer comes with cited sources, demonstrating the architecture at search-engine scale.

Source

Microsoft Copilot

Microsoft 365 Copilot uses RAG (called Microsoft Graph grounding) to retrieve from a user's emails, files, and calendar before answering.

Source

RAG compared to alternatives

AlternativeChoose RAG whenChoose alternative when
Fine-tuning
Modifies the model's weights via additional training on domain examples
RAG when your knowledge changes frequently, when you need citations, or when you have role-based access requirements.Fine-tuning when you need to change the model's style, format, or behavior, and you have 1000+ high-quality training examples.
Long-context models
LLMs with very large context windows (1M+ tokens) that can fit entire knowledge bases
RAG for cost efficiency at scale, fresh data, and access control. RAG queries cost cents; long-context queries cost dollars.Long-context for small static knowledge bases (< 1M tokens) where retrieval engineering overhead exceeds inference cost savings.
Prompt-only with browsing
Letting the LLM use a web search tool at runtime
RAG when you need consistent, controllable, repeatable answers from your specific document set.Browsing when you genuinely need open-web knowledge and accept variability in source quality.

Common pitfalls

  • Treating retrieval as solved: chunking strategy, hybrid search, and reranking each move accuracy by 10-30 points. Most RAG systems leave 20+ points on the table by skipping these steps.
  • Skipping evaluation: without a golden dataset and faithfulness metrics (RAGAS, LLM-as-judge), you have no way to detect when retrieval quality silently degrades.
  • Forgetting access control: most RAG tutorials show retrieval without permissions. In production, role-based document filtering must happen BEFORE similarity search, not after.
  • Over-retrieving: dumping 50 chunks into a 100K context window doesn't help; it actually hurts (lost-in-the-middle effect). Optimal retrieval typically returns 3-7 high-quality chunks.
  • Ignoring chunking strategy: arbitrary 512-token chunks destroy semantic boundaries. Use semantic chunking, sentence boundaries, or document-structure-aware splitting.
FAQ

Questions about RAG.

RAG retrieves knowledge at query time and injects it into the model's context. Fine-tuning bakes knowledge into the model's weights via additional training. RAG is preferred when your knowledge changes (legal, medical, financial), when you need citations, or when you have access control requirements. Fine-tuning is preferred when you need to change the model's style or behavior consistently.

For prototypes, no: you can use Postgres with pgvector or even an in-memory FAISS index. For production at scale, a dedicated vector database (Pinecone, Qdrant, Weaviate) provides better latency, hybrid search, and operational features. Choose based on scale: pgvector handles up to ~10M vectors; dedicated databases scale to billions.

Use a golden dataset of question-answer-document triples curated by subject matter experts. Run automated metrics like RAGAS (faithfulness, answer relevancy, context precision/recall). Layer in LLM-as-judge for nuanced quality assessment. Sample human review for calibration. Track these metrics over time: RAG quality silently degrades as documents and queries evolve.

Yes, but it has to be designed in. The pattern is filter-first retrieval: before similarity search, filter the index by the user's permissions (RBAC, ABAC, or row-level security). This ensures an engineering user can't retrieve HR documents even if they're the most semantically relevant result. Bolting permissions on after retrieval is broken: sensitive context still passes through the model.

Inference cost dominates: ~$0.001-$0.01 per query for embeddings + LLM generation, depending on model choice. Storage cost is negligible (millions of vectors fit in <$100/mo on managed databases). Retrieval cost varies: Pinecone serverless is pay-per-query (~$0.0001/query); self-hosted Qdrant or pgvector is fixed infrastructure cost.

Work with BearPlex

Need help implementing RAG?

BearPlex builds production AI systems that use RAG for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.