Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Most RAG systems fail at retrieval, not generation. Engineers blame the model. The problem is upstream.
The standard debugging sequence is backwards: tweak the prompt, upgrade the model, add a reranker. The actual fix is almost always the same — the wrong chunk landed in the context window. Fix what gets retrieved, and generation quality follows automatically.
This paper covers the decisions that determine retrieval quality: how you chunk, how you index, and how you search. These are not configuration details. They are the architecture.
---
An embedding model maps a piece of text to a fixed-size vector. That vector is a lossy compression of the text's meaning. When you retrieve by vector similarity, you retrieve the texts whose compressed representations are closest to the compressed representation of your query.
This works when one chunk carries one coherent idea. It breaks when a chunk mixes topics, or when the idea you need is spread across multiple chunks, or when a chunk is so small that its meaning only makes sense in context.
Chunking is the process of deciding what a "piece of text" is. Every RAG failure traces back to a chunk that was too large, too small, or cut at the wrong boundary.
Split the document into chunks of N tokens with an overlap of M tokens. Simple, fast, deterministic.
The overlap is load-bearing. Without it, a sentence split across a chunk boundary disappears from both chunks. A 10–20% overlap (e.g., 50 tokens of overlap on a 512-token chunk) restores continuity at the boundary.
Fixed-size chunking is computationally efficient — no embedding calls during ingestion, no boundary detection logic. For high-volume pipelines or datasets with uniform text structure (transcripts, logs, database records), it is often the right choice.
The failure mode is boundary blindness. A 512-token window cut mid-paragraph discards semantic structure the document author encoded deliberately. The chunk may be grammatically complete but semantically truncated.
A 2025 NAACL paper evaluated chunking strategies across multiple retrieval benchmarks and found that fixed-size chunking with well-chosen parameters performed comparably to more expensive methods when using high-quality embeddings. The recommendation was direct: "Just use fixed-size chunking in practice" — with the caveat that embedding model quality matters more than chunking sophistication.
Split by a priority hierarchy of separators: double newlines first (paragraph boundaries), then single newlines, then sentences, then tokens. Apply recursively until chunks fall within the target size.
This is the LangChain RecursiveCharacterTextSplitter default and the sensible baseline for structured prose documents (articles, documentation, legal text). It respects the document's own structure — paragraphs stay together unless they exceed the size limit, in which case they split at sentence boundaries.
Recursive chunking adds no embedding overhead. It is strictly better than fixed-size chunking for structured documents, at near-zero additional cost. It should be the default for document ingestion pipelines.
Embed every sentence. Compute pairwise similarity between consecutive sentences. Split where similarity drops below a threshold.
The result is chunks that correspond to topical coherence rather than token count. Each chunk covers one idea.
The cost is real: you pay one embedding API call per sentence during ingestion. For a 100-page document with 2,000 sentences, that is 2,000 embedding calls — an order of magnitude more expensive than recursive splitting. Index build time increases proportionally.
A systematic evaluation published at NAACL 2025 found that semantic chunking rarely justifies its computational cost. When high-quality embeddings are used, chunking strategy matters less than expected. The marginal retrieval gains from semantic chunking do not reliably offset the indexing overhead.
The right context for semantic chunking: infrequent indexing (you index once, query many times), retrieval quality is the dominant concern, and ingestion cost is bounded. For real-time or high-volume ingestion, recursive splitting is more practical.
Late chunking inverts the standard pipeline. Instead of chunking first and embedding each chunk independently, you embed the entire document using a long-context embedding model — retaining full document context at the token level — and then split the resulting token embeddings into chunks.
The key property: each chunk's embedding carries information from the full document, not just its local window. A chunk that contains a pronoun ("it") retains context about what "it" refers to, even if the referent appeared 2,000 tokens earlier.
Jina AI introduced late chunking in 2024 with empirical evidence showing consistent improvements on retrieval benchmarks, particularly on long documents where standard chunking loses inter-chunk context. The improvement is largest when documents are long and contain dense cross-references — legal documents, technical manuals, scientific papers.
Late chunking requires a long-context embedding model (e.g., jina-embeddings-v3, nomic-embed-text-v1.5). It is not applicable to standard sentence transformers with a 512-token context limit.
Anthropic published contextual retrieval in 2024 as a complementary technique to any chunking strategy. Before indexing, prepend each chunk with a brief LLM-generated summary of its position in the document: "This chunk is from Section 3 of a technical paper on transformer inference. The section covers attention mechanisms, specifically multi-head attention computation."
This addresses Document-Level Retrieval Mismatch — the failure where the retriever selects a chunk from the wrong document because two documents share superficial lexical overlap. The prepended context gives the embedding something to distinguish the chunk's provenance, not just its local content.
Anthropic reported a 49% reduction in retrieval failures using contextual retrieval combined with BM25. The cost is one LLM inference call per chunk during indexing — significant at scale, but amortized over many queries.
Chunk size is not a dial you tune until it feels right. It encodes an assumption about where the answer lives.
Small chunks (128–256 tokens) capture precise spans. Retrieval recall is high — the right sentence is likely in the index — but each chunk lacks the context to be interpreted in isolation. The answer exists; it cannot be understood.
Large chunks (1,024+ tokens) carry context but compress multiple topics into one vector. The embedding averages across the chunk's content, diluting the signal for any specific sub-topic. You retrieve the right page but not the right answer.
The critical mismatch: your embedding model has a context window too. Most sentence transformers have a 512-token maximum context. A 1,024-token chunk fed to a 512-token embedding model gets silently truncated — the second half of the chunk is discarded. You index what you think you indexed, but the embeddings only cover half of it.
Check your embedding model's context window before choosing chunk size. They must be compatible. This is the single most common silent misconfiguration in production RAG systems.
A 2025 multi-dataset analysis on chunk size and retrieval found that optimal chunk size varies significantly by document type and answer locality. Datasets with short, precise answers favor small chunks; datasets requiring multi-sentence reasoning favor larger chunks. The practical implication: tune chunk size against your actual data, not against a default.
---
Once chunked and embedded, your vectors live in an index. The index determines query latency and retrieval recall. Both matter in production.
Exact nearest-neighbor search. Every query computes cosine similarity against every vector in the index. Recall is 100% — you will always find the closest vector. Latency scales linearly with corpus size.
Flat indexes are correct for development, prototyping, and small corpora (under ~100K vectors). They become unusable at production scale. At 10M vectors, a single query touches every vector in the index — hundreds of milliseconds of compute per request.
Hierarchical Navigable Small World is a graph-based approximate nearest-neighbor index. It trades a small recall loss (typically 1–5%) for dramatically faster query time — sub-millisecond on millions of vectors.
HNSW is the dominant production index. Qdrant, Weaviate, Pinecone, and ChromaDB all use it. Two parameters determine its behavior:
ef_construct: the number of candidate nodes considered when building the graph. Higher values improve recall at the cost of build time. Production default: 200.m: the number of bidirectional links per node. Higher values improve recall but increase memory. Production default: 16.Build time with high ef_construct can be significant for large corpora — plan for it. Query time remains low regardless.
HNSW is the right default for production deployments with more than 100K vectors.
Dense retrieval (vector search) captures semantic similarity. A query for "transformer memory efficiency" retrieves chunks about KV cache optimization, even if those chunks never use the word "transformer."
It fails on exact matches. Product codes, proper nouns, rare technical terms, and precise identifiers have thin embedding signal — the model has seen few examples and represents them poorly. A query for "TK-421" returns whatever is semantically nearest, not the chunk that contains "TK-421."
BM25 catches what dense retrieval misses. BM25 is a term-frequency scoring algorithm — it retrieves documents that share exact tokens with the query. It has no understanding of meaning, but its signal is precise for exact-match lookups.
Production RAG systems benefit from running both in parallel and fusing the results. The ACL 2025 paper on operational advice for dense and sparse retrievers found that the optimal index choice depends on corpus size and query characteristics — with hybrid approaches consistently outperforming single-mode retrieval across diverse query types.
---
Hybrid search runs a dense vector query and a BM25 sparse query in parallel, then fuses the two ranked result lists into a single ranking.
The standard fusion algorithm is Reciprocal Rank Fusion (RRF). For each document, its RRF score is the sum of 1 / (k + rank_i) across result lists, where k is a smoothing constant (typically 60). RRF is parameter-light, robust to score scale differences between the two lists, and consistently effective.
Measured results on hybrid search in production RAG systems show a substantial lift over single-mode retrieval. One 2026 analysis on Qdrant with a 100K document corpus reported:
| Configuration | Recall@5 | MRR |
|---|---|---|
| Dense-only | 0.72 | 0.61 |
| BM25-only | 0.68 | 0.57 |
| Hybrid RRF | 0.84 | 0.74 |
| Hybrid RRF + reranker | 0.91 | 0.83 |
The lift from hybrid to hybrid + reranker (a cross-encoder scoring each retrieved chunk against the query) is real but expensive. Reranking adds one forward pass per retrieved chunk. For top-10 retrieval before reranking, that is 10 model calls per query. Batch and parallelize if you add a reranker.
Hybrid search is the fastest single improvement you can make to an existing dense-only RAG pipeline. Add a BM25 index alongside your vector store, fuse with RRF at k=60, measure. Most production systems see a 30–50% improvement in retrieval recall.
---
The embedding model is a first-class architectural choice, not a default you set once and forget.
An embedding model trained on general web text represents "revenue growth" and "sales increase" as nearby vectors. It represents "HER2-positive" and "receptor tyrosine kinase" as nearby vectors only if its training corpus included medical literature.
Using a general-purpose embedding model for domain-specific content — legal, medical, financial, scientific — without evaluating its performance on that domain is one of the most common production RAG failures. The system retrieves chunks that are semantically similar in general language but wrong for your domain.
Evaluate embedding models against your actual corpus before committing to an index. Changing the embedding model after indexing requires re-embedding the entire corpus. Track which model version produced each set of vectors. When you upgrade, re-index.
The MTEB (Massive Text Embedding Benchmark) leaderboard provides standardized scores across retrieval, classification, and clustering tasks. Use it as a starting point for model selection, not as a final answer — MTEB scores do not substitute for evaluation on your own data.
---
The most important operational principle in RAG is to measure retrieval separately from generation.
Build a retrieval evaluation set: a sample of queries with known-correct chunk identifiers. Compute Recall@k (is the right chunk in the top-k results?) and MRR (where does the right chunk rank?). Run this against every chunking and indexing change.
If your LLM produces wrong answers, check whether the right chunk was retrieved before changing anything else. In most cases it was not. Tuning the generation prompt to fix a retrieval failure does not work — the model cannot generate correct answers from incorrect context.
Fix retrieval first. The generation model is the last place to look.
---
| Scenario | Chunking | Index | Retrieval |
|---|---|---|---|
| Uniform short docs (FAQs, records) | Fixed-size, 256–512 tokens | HNSW | Dense-only |
| Mixed structured docs | Recursive | HNSW | Dense-only |
| Long technical/legal docs | Late chunking or contextual | HNSW | Hybrid RRF |
| High exact-match query load | Recursive + contextual | HNSW + BM25 | Hybrid RRF |
| Quality-critical, slow ingestion | Semantic or late chunking | HNSW | Hybrid RRF + reranker |
There is no universal optimal configuration. The right combination depends on your document structure, query distribution, and latency budget. The table above is a starting point, not a prescription.
What is universal: measure retrieval quality directly, tune against that measurement, and treat chunking and indexing as first-class architectural decisions — not defaults you accept from a framework.
---
1. Caspari, A. et al. (2025). "Is Semantic Chunking Worth the Computational Cost?" Findings of NAACL 2025. https://aclanthology.org/2025.findings-naacl.114.pdf
2. Aumiller, D. et al. (2026). "A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity." arXiv:2603.06976. https://arxiv.org/html/2603.06976
3. Karpukhin, V. et al. (2026). "Chunking Methods on Retrieval-Augmented Generation – Effectiveness Evaluation Against Computational Cost and Limitations." arXiv:2606.00881. https://arxiv.org/html/2606.00881
4. Günther, M. et al. (2024). "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models." Jina AI. arXiv:2409.04701. https://arxiv.org/pdf/2409.04701
5. Anthropic. (2024). "Contextual Retrieval." Anthropic Blog. https://www.anthropic.com/news/contextual-retrieval
6. Lin, J. et al. (2025). "Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?" ACL 2025 Industry Track. https://aclanthology.org/2025.acl-industry.61.pdf
7. Huang, Z. (2026). "Hybrid Search for RAG: Combining BM25 and Dense Vector Search." Denser AI Blog. https://denser.ai/blog/hybrid-search-for-rag/
8. TopReviewed.ai. (2026). "Hybrid Search RAG in Production: BM25 + Dense Vectors + RRF (With Measured Results)." https://topreviewed.ai/blog/hybrid-search-rag-in-production-bm25-dense-vectors-rrf-with-measured-results
9. Casey, M. (2025). "RAG Failure Modes: Common Pitfalls and Solutions." Snorkel AI Blog. https://snorkel.ai/blog/retrieval-augmented-generation-rag-failure-modes-and-how-to-fix-them/
10. Teymoori, A. (2025). "Production RAG Systems with Hybrid Search." https://amirteymoori.com/building-production-rag-systems-with-hybrid-search-in-2025/