What is Document Chunking in RAG?
Chunking is the preprocessing step in RAG pipelines where source documents are split into smaller passages (typically 200-800 tokens each) before embedding and indexing: enabling fine-grained retrieval that returns just the relevant section of a long document rather than the entire document.
Overview
Chunking is the most underestimated component of RAG quality. The chunking strategy directly determines what units of content can be retrieved: too small and chunks lose context, too large and irrelevant content drowns relevant content. Bad chunking is the single most common cause of mediocre RAG quality we see when auditing client systems. Frameworks like LlamaIndex and LangChain provide chunking utilities, but the right chunking strategy is heavily document-type dependent: legal contracts chunk differently from API documentation, which chunks differently from chat transcripts. At BearPlex, chunking strategy is something we tune empirically against eval data, not a parameter we set once and forget.
Common chunking strategies
(1) Fixed-size chunking: split documents into equal-size chunks (e.g., 500 tokens each) with optional overlap (50-100 tokens). Simplest, often a reasonable baseline. Loses semantic boundaries: a chunk might end mid-sentence. (2) Recursive character splitting: split on natural boundaries (paragraphs, sentences, lines) hierarchically until chunks are below target size. Better than fixed-size for natural-language documents. (3) Semantic chunking: use embedding similarity between adjacent sentences to detect topic boundaries; break at boundaries. Higher quality, more expensive to compute. (4) Document-structure chunking: use document structure (markdown headers, HTML elements, code function boundaries) to chunk along natural unit boundaries. Best for structured documents (markdown, HTML, code). (5) Hierarchical chunking: store chunks at multiple granularities (paragraph, section, document) and retrieve at the right level for the query.
Choosing chunk size
Two competing forces: (1) Smaller chunks = more precise retrieval (the LLM gets exactly the relevant passage, not paragraphs of surrounding noise) but lose context (a chunk about 'the policy' might not say which policy); (2) Larger chunks = more context (the chunk includes enough surrounding content to be self-contained) but lower precision (irrelevant content competes for the LLM's attention). Production sweet spot for most use cases: 300-600 tokens per chunk with 50-100 token overlap. Adjust based on document type: shorter for FAQ-style content, longer for legal/policy documents that require surrounding context.
Beyond basic chunking
Production RAG systems often add: (1) Chunk metadata, document title, section header, page number, last-updated date, permissions; available at retrieval time for filtering and provided to the LLM as context; (2) Hypothetical questions: for each chunk, generate likely questions it could answer; embed those questions for retrieval; improves retrieval recall on long-tail queries; (3) Summary embedding: embed both the full chunk and a summary; query against summary, return full chunk; (4) Parent-child chunking: index small chunks for precise retrieval, return larger parent chunk to LLM for context. The right pattern is empirical: measure on your eval data, don't pick by gut feel.
Use cases
- Preprocessing documents before embedding for RAG
- Building searchable knowledge bases from PDFs, web content, or internal documents
- Indexing long-form content (legal contracts, research papers, policy manuals)
- Code search where chunking should respect function and class boundaries
- Meeting transcript indexing where chunking should respect speaker turns
Examples in production
LlamaIndex
Provides comprehensive chunking utilities (SentenceSplitter, SemanticSplitterNodeParser, MarkdownNodeParser) for production RAG pipelines.
SourceLangChain
RecursiveCharacterTextSplitter is one of the most widely-used chunkers in production: splits on natural boundaries with configurable overlap.
SourceAnthropic Contextual Retrieval
Anthropic's Contextual Retrieval pattern (2024) prepends each chunk with LLM-generated context describing how the chunk relates to its source document: improves retrieval recall by 35% in benchmarks.
SourceChunking compared to alternatives
| Alternative | Choose Chunking when | Choose alternative when |
|---|---|---|
Whole-document indexing Embed entire documents without chunking | Use chunking for any document longer than ~500 tokens: gives much more precise retrieval | Whole-document indexing only for very short documents (FAQs, single-paragraph items) |
Sentence-level indexing Embed and index individual sentences | Use chunk-level indexing: sentences usually lack enough context for retrieval to work well | Sentence-level indexing for very specific use cases (legal precedent matching) where granularity matters more than context |
Common pitfalls
- Default 1000-token chunks for everything: works for some content, terrible for others
- No overlap between chunks: facts straddling chunk boundaries get lost
- Chunking before cleaning: embedded HTML, navigation menus, footers pollute chunks
- Chunking without preserving document structure (headers, sections): context gets stripped
- Setting chunk size once and never re-evaluating: chunking should be tuned on eval data
Questions about Chunking.
Usually yes. Without overlap, a fact straddling a chunk boundary (e.g., 'X is defined as...' in chunk N, the actual definition in chunk N+1) might fail to retrieve when needed. 10-20% overlap (50-100 tokens for 500-token chunks) catches most boundary issues without significant index bloat.
For most production cases, recursive character splitting on paragraph/sentence boundaries is enough. Semantic chunking (using embedding similarity to find topic boundaries) is meaningfully better for documents with clear topic shifts (research papers, policy documents) but adds preprocessing cost. Decide based on document type and measured quality lift on your eval set.
Use structure-aware chunking. For code, chunk on function and class boundaries (tree-sitter or AST-based parsers). For markdown, chunk on header sections. For JSON, chunk on logical record boundaries. Generic text chunkers will split mid-function or mid-record, breaking semantic units that should stay together.
Need help implementing Chunking?
BearPlex builds production AI systems that use Chunking for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.