Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Most RAG debugging sessions look the same. The LLM gives a wrong answer. The engineer tweaks the prompt. Nothing changes. They try a smarter model. Still wrong.
The actual problem was settled at chunk boundaries, three steps earlier.
When a RAG pipeline fails, the cause is almost always retrieval — not generation. The LLM can only work with what it receives. If the retrieved chunks don't contain the answer, no amount of prompt engineering will surface it. The model hallucinates because you gave it nothing better to say.
Anthropic quantified this directly: combining contextual embeddings, BM25, and reranking reduced top-20-chunk retrieval failure rate from 5.7% to 1.9% — a 67% improvement. The generation model didn't change. The retrieval did.
Cosine similarity is not relevance. It is a geometric proxy for semantic overlap, trained on a specific objective, on a specific corpus, regularized in a specific way. Outside that distribution, it drifts.
A 2026 paper on bi-encoder failure modes found 100% failure rates on negation, numerical, and temporal queries. Not degraded performance — total failure. The embedding model encoded topic proximity, not logical relationship. Cross-encoders, which compare query and document jointly, eliminate these failures entirely. But cross-encoders are slow. So most pipelines use bi-encoders and accept the blind spots.
Knowing what your embedding model cannot do is as important as knowing what it can.
Fixed-size chunking is a convenience, not a design. Splitting a document into 512-token windows with 64-token overlap is a default borrowed from early experiments, not a considered decision for your data.
A 2025 NAACL paper found that semantic chunking — widely believed to be worth the computational cost — does not consistently outperform fixed-size chunking. The gains depend on data structure, query type, and embedding model. There is no free win. Sentence chunking matches semantic chunking for many tasks at a fraction of the cost, with optimal context length peaking around 2,500 tokens.
The right chunk size is determined by your documents and your queries. Not by the default in your framework.
Increasing retrieved context (top-k) improves recall but hurts precision. More chunks mean more noise. Optimal k for long-document QA saturates around 10 — beyond that, irrelevant content enters the context and degrades generation accuracy. NeurIPS 2024 work on RankRAG demonstrated this saturation empirically.
This is a fundamental tension: you cannot retrieve broadly and precisely at the same time with a single-vector model. Single-vector embeddings have a dimensional limit — certain document subsets are simply unreachable regardless of query formulation.
1. Evaluate retrieval separately. Measure chunk recall@k before measuring answer quality. If recall is low, fix retrieval. Do not touch the prompt.
2. Add a reranker. A cross-encoder reranker over top-50 candidates, then passing top-10 to the LLM, consistently outperforms vanilla top-10 retrieval.
3. Test your embedding model on your query types. Negation queries, numerical comparisons, temporal reasoning — run these against your retriever before deploying.
4. Tune chunk size to your corpus. Start with sentence-level chunks. Measure. Adjust.
5. Use hybrid retrieval. BM25 catches keyword matches that dense embeddings miss. Combine them.
The LLM is not the bottleneck. The retriever is. Fix that first.