Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Embeddings Are Not Optional

inspiration | devinfo.dev | May 31, 2026 | devinfo.dev:2026.0017

Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.

Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.

An embedding model converts text into a fixed-length vector of floats. That vector encodes meaning — not syntax, not keywords, but semantic content. Two sentences about the same concept land close together in vector space, even if they share no words.

This matters because retrieval quality sets a hard ceiling on generation quality. A language model cannot reason over context it never received. Garbage in, hallucination out. The embedding model determines what goes in.

The MTEB Signal

The Massive Text Embedding Benchmark (MTEB) evaluates models across 56+ tasks in 8 categories: retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarization, and bitext mining. It covers 112+ languages.

The leaderboard as of early 2026 shows the top open-weight performers: Harrier-OSS-v1-27B (74.3), NV-Embed-v2 (72.31), Qwen3-Embedding-8B (70.58). These numbers compress enormous variation — a model that ranks well on semantic similarity may rank poorly on retrieval. Read task-specific scores, not the aggregate.

For most engineers, the aggregate is a trap. Pick the benchmark category that matches your actual workload.

What Runs Locally

Three models stand out for local inference:

nomic-embed-text-v2 (MoE architecture, 305M parameters, ~500MB): scores 52.86 on BEIR and 65.80 on MIRACL. Supports 100+ languages. Runs in Ollama. Needs 4–8GB RAM. First mixture-of-experts architecture applied to text embeddings.

BGE-M3 (568M parameters, >2GB): scores 48.80 on BEIR, 69.20 on MIRACL. Multi-granularity — supports dense, sparse, and multi-vector retrieval from a single model. Best if you need multilingual retrieval quality above all else.

all-MiniLM-L6-v2 (~300MB, 384 dimensions): the fast, lean baseline. Not a leaderboard contender. Runs on 4GB RAM. Adequate for constrained environments or prototyping.

Nomic Embed v1.5 introduced Matryoshka Representation Learning — you can truncate embedding dimensions at inference time with minimal accuracy loss. Useful when storage or index size is a constraint.

The Right Mental Model

Embedding models are not a preprocessing step. They are the indexing layer of your AI system. Choose them with the same care you choose a database engine.

Dimensions, context window, retrieval task performance, inference speed, and memory footprint are not footnotes — they are the spec. A 1024-dimension model indexing 10 million documents costs real memory. A model with a 512-token context window silently truncates anything longer.

Match the model to the workload. Run it locally. Know its numbers.

References

1. Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL 2023. https://arxiv.org/abs/2210.07316

2. Nussbaum, Z. et al. (2024). Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv:2402.01613. https://arxiv.org/abs/2402.01613

3. Nomic AI. (2025). Nomic Embed Text V2: An Open Source, Multilingual, Mixture-of-Experts Embedding Model. https://www.nomic.ai/blog/posts/nomic-embed-text-v2

4. Chen, J. et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216. https://arxiv.org/abs/2402.03216

5. MTEB Leaderboard (2026). Hugging Face. https://huggingface.co/spaces/mteb/leaderboard

6. Awesome Agents. (2026). Embedding Model Leaderboard: MTEB Rankings March 2026. https://awesomeagents.ai/leaderboards/embedding-model-leaderboard-mteb-march-2026/