Do I need a vector database for RAG?

For prototypes, no: you can use Postgres with pgvector or even an in-memory FAISS index. For production at scale, a dedicated vector database (Pinecone, Qdrant, Weaviate) provides better latency, hybrid search, and operational features. Choose based on scale: pgvector handles up to ~10M vectors; dedicated databases scale to billions.

How do I evaluate RAG accuracy?

Use a golden dataset of question-answer-document triples curated by subject matter experts. Run automated metrics like RAGAS (faithfulness, answer relevancy, context precision/recall). Layer in LLM-as-judge for nuanced quality assessment. Sample human review for calibration. Track these metrics over time: RAG quality silently degrades as documents and queries evolve.

Can RAG handle role-based access control?

Yes, but it has to be designed in. The pattern is filter-first retrieval: before similarity search, filter the index by the user's permissions (RBAC, ABAC, or row-level security). This ensures an engineering user can't retrieve HR documents even if they're the most semantically relevant result. Bolting permissions on after retrieval is broken: sensitive context still passes through the model.

What's the typical cost of running RAG?

Inference cost dominates: ~$0.001-$0.01 per query for embeddings + LLM generation, depending on model choice. Storage cost is negligible (millions of vectors fit in <$100/mo on managed databases). Retrieval cost varies: Pinecone serverless is pay-per-query (~$0.0001/query); self-hosted Qdrant or pgvector is fixed infrastructure cost.

Start a conversation

AI engineering glossary

What is RAG (Retrieval Augmented Generation)?

Q: What's the difference between RAG and fine-tuning?

RAG retrieves knowledge at query time and injects it into the model's context. Fine-tuning bakes knowledge into the model's weights via additional training. RAG is preferred when your knowledge changes (legal, medical, financial), when you need citations, or when you have access control requirements. Fine-tuning is preferred when you need to change the model's style or behavior consistently.

RAG (Retrieval Augmented Generation) is an AI architecture that retrieves relevant documents from a knowledge base and injects them into a large language model's context window before generating an answer: grounding responses in source material instead of relying purely on the model's parametric memory.

Last updated 2026-04-27BearPlex AI Engineering Team

Overview

RAG was introduced in a 2020 Facebook AI paper and has become the dominant pattern for production enterprise AI. The architecture solves the two biggest problems with raw LLMs: hallucinations (the model inventing facts) and stale knowledge (the model only knows what it was trained on). By retrieving fresh, authoritative documents at query time and forcing the model to ground its answers in those documents, RAG enables AI systems that cite sources, respect document permissions, and stay current as your knowledge base evolves.

How RAG works

A RAG system has three core components: an indexing pipeline that processes source documents into searchable chunks; a retrieval layer that finds the most relevant chunks for a given query; and a generation step that injects those chunks into the LLM's prompt. At query time: (1) the user's question is converted to an embedding vector; (2) the vector is matched against the indexed document chunks via similarity search (often combined with keyword search for hybrid retrieval); (3) the top-K matches are reranked for relevance; (4) the matches are inserted into a prompt template along with the user's question; (5) the LLM generates an answer grounded in those specific documents, ideally with inline citations.

When to use RAG vs alternatives

RAG is the right choice when your knowledge base changes frequently, when you need citations, when you have role-based access controls, or when your data is too large to fit in a model's context window. Fine-tuning is better when you need to change the model's style or behavior rather than its knowledge. Long-context models (1M+ tokens) can sometimes replace RAG for small static knowledge bases, but cost-per-query and retrieval precision usually favor RAG at scale.

RAG architectures (basic to advanced)

Naive RAG: split documents → embed → store → retrieve top-K → answer. Works for prototypes but breaks at scale. Advanced RAG: adds query rewriting, hybrid search (dense + sparse), cross-encoder reranking, and citation tracking. GraphRAG: builds a knowledge graph over documents and uses it to traverse multi-hop relationships, strong for complex reasoning over interconnected data. Agentic RAG: gives the LLM tools to perform multiple retrieval steps, refine queries, and decide when it has enough context.

Use cases

Customer support AI grounded in product documentation, past tickets, and resolution playbooks
Legal contract analysis with citations to specific clauses
Healthcare clinical decision support with cited evidence from medical literature
Internal knowledge management for enterprise wikis, Confluence, SharePoint
Compliance and regulatory Q&A with citation back to source regulations
Sales enablement with retrieval over call recordings, win/loss notes, and pricing playbooks

Examples in production

Anthropic

Anthropic's Claude provides 'Citations' as a first-class API feature, generating responses with verifiable references to source documents.

Source

OpenAI

OpenAI's Assistants API includes a built-in file search retrieval system implementing RAG patterns over uploaded documents.

Source

Perplexity AI

Perplexity is a consumer-facing RAG system over the web: every answer comes with cited sources, demonstrating the architecture at search-engine scale.

Source

Microsoft Copilot

Microsoft 365 Copilot uses RAG (called Microsoft Graph grounding) to retrieve from a user's emails, files, and calendar before answering.

Source

RAG compared to alternatives

Alternative	Choose RAG when	Choose alternative when
Fine-tuning Modifies the model's weights via additional training on domain examples	RAG when your knowledge changes frequently, when you need citations, or when you have role-based access requirements.	Fine-tuning when you need to change the model's style, format, or behavior, and you have 1000+ high-quality training examples.
Long-context models LLMs with very large context windows (1M+ tokens) that can fit entire knowledge bases	RAG for cost efficiency at scale, fresh data, and access control. RAG queries cost cents; long-context queries cost dollars.	Long-context for small static knowledge bases (< 1M tokens) where retrieval engineering overhead exceeds inference cost savings.
Prompt-only with browsing Letting the LLM use a web search tool at runtime	RAG when you need consistent, controllable, repeatable answers from your specific document set.	Browsing when you genuinely need open-web knowledge and accept variability in source quality.

Common pitfalls

Treating retrieval as solved: chunking strategy, hybrid search, and reranking each move accuracy by 10-30 points. Most RAG systems leave 20+ points on the table by skipping these steps.
Skipping evaluation: without a golden dataset and faithfulness metrics (RAGAS, LLM-as-judge), you have no way to detect when retrieval quality silently degrades.
Forgetting access control: most RAG tutorials show retrieval without permissions. In production, role-based document filtering must happen BEFORE similarity search, not after.
Over-retrieving: dumping 50 chunks into a 100K context window doesn't help; it actually hurts (lost-in-the-middle effect). Optimal retrieval typically returns 3-7 high-quality chunks.
Ignoring chunking strategy: arbitrary 512-token chunks destroy semantic boundaries. Use semantic chunking, sentence boundaries, or document-structure-aware splitting.

Related BearPlex services

RAG & Knowledge Systems Model Engineering & Fine-Tuning Autonomous AI Agents

Full AI glossary

FAQ

Questions about RAG.

RAG retrieves knowledge at query time and injects it into the model's context. Fine-tuning bakes knowledge into the model's weights via additional training. RAG is preferred when your knowledge changes (legal, medical, financial), when you need citations, or when you have access control requirements. Fine-tuning is preferred when you need to change the model's style or behavior consistently.

Need help implementing RAG?

BearPlex builds production AI systems that use RAG for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is RAG (Retrieval Augmented Generation)?

Overview

How RAG works

When to use RAG vs alternatives

RAG architectures (basic to advanced)

Use cases

Examples in production

Anthropic

OpenAI

Perplexity AI

Microsoft Copilot

RAG compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about RAG.

Related reading

Need help implementing RAG?