Do we need overlap between chunks?

Usually yes. Without overlap, a fact straddling a chunk boundary (e.g., 'X is defined as...' in chunk N, the actual definition in chunk N+1) might fail to retrieve when needed. 10-20% overlap (50-100 tokens for 500-token chunks) catches most boundary issues without significant index bloat.

Should we use semantic chunking or simpler approaches?

For most production cases, recursive character splitting on paragraph/sentence boundaries is enough. Semantic chunking (using embedding similarity to find topic boundaries) is meaningfully better for documents with clear topic shifts (research papers, policy documents) but adds preprocessing cost. Decide based on document type and measured quality lift on your eval set.

What about chunking code or structured data?

Use structure-aware chunking. For code, chunk on function and class boundaries (tree-sitter or AST-based parsers). For markdown, chunk on header sections. For JSON, chunk on logical record boundaries. Generic text chunkers will split mid-function or mid-record, breaking semantic units that should stay together.

Start a conversation

AI engineering glossary

What is Document Chunking in RAG?

Chunking is the preprocessing step in RAG pipelines where source documents are split into smaller passages (typically 200-800 tokens each) before embedding and indexing: enabling fine-grained retrieval that returns just the relevant section of a long document rather than the entire document.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Chunking is the most underestimated component of RAG quality. The chunking strategy directly determines what units of content can be retrieved: too small and chunks lose context, too large and irrelevant content drowns relevant content. Bad chunking is the single most common cause of mediocre RAG quality we see when auditing client systems. Frameworks like LlamaIndex and LangChain provide chunking utilities, but the right chunking strategy is heavily document-type dependent: legal contracts chunk differently from API documentation, which chunks differently from chat transcripts. At BearPlex, chunking strategy is something we tune empirically against eval data, not a parameter we set once and forget.

Common chunking strategies

(1) Fixed-size chunking: split documents into equal-size chunks (e.g., 500 tokens each) with optional overlap (50-100 tokens). Simplest, often a reasonable baseline. Loses semantic boundaries: a chunk might end mid-sentence. (2) Recursive character splitting: split on natural boundaries (paragraphs, sentences, lines) hierarchically until chunks are below target size. Better than fixed-size for natural-language documents. (3) Semantic chunking: use embedding similarity between adjacent sentences to detect topic boundaries; break at boundaries. Higher quality, more expensive to compute. (4) Document-structure chunking: use document structure (markdown headers, HTML elements, code function boundaries) to chunk along natural unit boundaries. Best for structured documents (markdown, HTML, code). (5) Hierarchical chunking: store chunks at multiple granularities (paragraph, section, document) and retrieve at the right level for the query.

Choosing chunk size

Two competing forces: (1) Smaller chunks = more precise retrieval (the LLM gets exactly the relevant passage, not paragraphs of surrounding noise) but lose context (a chunk about 'the policy' might not say which policy); (2) Larger chunks = more context (the chunk includes enough surrounding content to be self-contained) but lower precision (irrelevant content competes for the LLM's attention). Production sweet spot for most use cases: 300-600 tokens per chunk with 50-100 token overlap. Adjust based on document type: shorter for FAQ-style content, longer for legal/policy documents that require surrounding context.

Beyond basic chunking

Production RAG systems often add: (1) Chunk metadata, document title, section header, page number, last-updated date, permissions; available at retrieval time for filtering and provided to the LLM as context; (2) Hypothetical questions: for each chunk, generate likely questions it could answer; embed those questions for retrieval; improves retrieval recall on long-tail queries; (3) Summary embedding: embed both the full chunk and a summary; query against summary, return full chunk; (4) Parent-child chunking: index small chunks for precise retrieval, return larger parent chunk to LLM for context. The right pattern is empirical: measure on your eval data, don't pick by gut feel.

Use cases

Preprocessing documents before embedding for RAG
Building searchable knowledge bases from PDFs, web content, or internal documents
Indexing long-form content (legal contracts, research papers, policy manuals)
Code search where chunking should respect function and class boundaries
Meeting transcript indexing where chunking should respect speaker turns

Examples in production

LlamaIndex

Provides comprehensive chunking utilities (SentenceSplitter, SemanticSplitterNodeParser, MarkdownNodeParser) for production RAG pipelines.

Source

LangChain

RecursiveCharacterTextSplitter is one of the most widely-used chunkers in production: splits on natural boundaries with configurable overlap.

Source

Anthropic Contextual Retrieval

Anthropic's Contextual Retrieval pattern (2024) prepends each chunk with LLM-generated context describing how the chunk relates to its source document: improves retrieval recall by 35% in benchmarks.

Source

Chunking compared to alternatives

Alternative	Choose Chunking when	Choose alternative when
Whole-document indexing Embed entire documents without chunking	Use chunking for any document longer than ~500 tokens: gives much more precise retrieval	Whole-document indexing only for very short documents (FAQs, single-paragraph items)
Sentence-level indexing Embed and index individual sentences	Use chunk-level indexing: sentences usually lack enough context for retrieval to work well	Sentence-level indexing for very specific use cases (legal precedent matching) where granularity matters more than context

Common pitfalls

Default 1000-token chunks for everything: works for some content, terrible for others
No overlap between chunks: facts straddling chunk boundaries get lost
Chunking before cleaning: embedded HTML, navigation menus, footers pollute chunks
Chunking without preserving document structure (headers, sections): context gets stripped
Setting chunk size once and never re-evaluating: chunking should be tuned on eval data

Related BearPlex services

RAG & Knowledge Systems

Full AI glossary

FAQ

Questions about Chunking.

300-600 tokens for most use cases, with 50-100 token overlap. Adjust based on document type and empirical evaluation: shorter (200-400) for Q&A-style content; longer (600-1000) for legal or policy documents that require surrounding context. Always measure on your eval data: generic recommendations don't beat empirical tuning.

Need help implementing Chunking?

BearPlex builds production AI systems that use Chunking for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is Document Chunking in RAG?

Overview

Common chunking strategies

Choosing chunk size

Beyond basic chunking

Use cases

Examples in production

LlamaIndex

LangChain

Anthropic Contextual Retrieval

Chunking compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about Chunking.

Related reading

Need help implementing Chunking?