An embedding is not a summary of meaning. It's a position. That distinction sounds pedantic until it explains every weird thing your retriever does.
When an embedding model turns a chunk of text into a vector of, say, 1,024 numbers, it's not compressing the meaning into those numbers the way a human would write an abstract. It's placing the text at a point in space such that texts the model considers similar end up nearby. The whole apparatus — the embedding model, the distance metric, the index — exists to answer one geometric question fast: given this query point, what are the nearest document points? Everything in this post is a consequence of taking that literally.
What the model is actually optimizing
Embedding models are trained on pairs. "This question goes with this answer." "This sentence paraphrases that one." Contrastive training pulls matched pairs together in the space and pushes mismatched ones apart. After enough pairs, the geometry encodes whatever notion of similarity the training data rewarded.
That's why an embedding tuned for one job can flop at another. A model trained mostly on web Q&A learns that questions and their answers belong together — great for search. A model trained on duplicate-detection learns that paraphrases belong together — which is subtly different, and will rank a reworded version of your query above the passage that actually answers it. There is no universal "semantic similarity." There's only the similarity some model was paid to learn.
So "which embedding model?" isn't a leaderboard question, even though MTEB — the Massive Text Embedding Benchmark — gives you a tempting leaderboard. MTEB is the right starting point and the wrong stopping point. Use it to build a shortlist; the top of the board around now is a mix of open models like BGE-M3, E5, and GTE, and closed APIs like OpenAI's text-embedding-3-large, Cohere's Embed line, and Voyage. Then ignore the ranking and test the shortlist on your data, because a model that wins on academic retrieval can lose on your legal contracts or your support tickets or your codebase.
A few things that matter more than leaderboard rank:
- Domain. Code, legal, biomedical, multilingual — specialized or instruction-tuned models often beat a higher-ranked generalist on the domain they were built for.
- Asymmetry. Many models distinguish a short query from a long passage and want you to prefix or call them differently. Embed both sides the same way and you quietly lose accuracy.
- Max sequence length. If the model truncates at 512 tokens and your chunks run 800, you're embedding the first half and silently discarding the rest.
Dimensions, and the cost nobody budgets for
Bigger vectors can hold more, and they cost more — in storage, in memory, and in every distance computation you'll ever run. A 3,072-dimension vector isn't twice the price of 1,536; across millions of chunks it's the difference between a search index that fits in RAM and one that doesn't.
The interesting development is that you often don't have to choose up front. Models trained with Matryoshka representation learning pack a coarse-to-fine ordering into the vector: the first 256 numbers are a usable embedding on their own, the first 512 are better, the full 1,536 best. You can truncate to a shorter prefix and keep most of the quality. That turns dimension into a tuning knob — index at full width, search a truncated width, or store short and rerank long. If your model supports it, it's close to a free lever.
Distance: pick the one your model was trained for
Three metrics show up: cosine similarity (angle between vectors), dot product (angle and magnitude), and Euclidean distance (straight-line gap). The right answer is almost never a matter of taste — it's a matter of matching whatever the model was trained with. Most modern text embedders normalize to unit length and train for cosine, at which point cosine and dot product rank things identically. Use what the model card says. Picking a metric the model wasn't trained on is a quiet way to degrade everything downstream and never know why.
The index is an approximation, on purpose
Here's the part that surprises people: your vector store does not find the true nearest neighbors. It finds probably the nearest neighbors, fast, and calls it close enough.
Exact search means comparing the query against every vector — fine for ten thousand chunks, hopeless at fifty million. So vector stores use approximate nearest neighbor (ANN) indexes that trade a sliver of accuracy for orders of magnitude of speed.
The dominant index is HNSW — hierarchical navigable small worlds — which builds a layered graph you can hop through to reach a query's neighborhood in a few jumps. It's fast and accurate, and it's memory-hungry, because the graph lives in RAM. IVF splits the space into clusters and only searches the few nearest ones; cheaper memory, a bit more tuning. Flat is exact brute force, perfectly fine until your corpus isn't small. Every one of these exposes knobs — HNSW's ef_search, IVF's number of probes — that trade recall for latency. Turn them down and search gets faster and starts missing real neighbors. There's no free setting; there's only the recall you've decided you can live with.
Quantization is the other lever. Scalar or product quantization shrinks each vector to a fraction of its size so more of the index fits in memory, at some cost to precision — usually recovered by re-scoring the top candidates at full precision. Binary quantization is the aggressive end: one bit per dimension, search screaming fast, then rerank. For large corpora this is how people keep the index affordable.
Choosing a store without overthinking it
The vector store is the least interesting decision here, which is freeing. They mostly implement the same HNSW/IVF ideas with different operational stories.
- Already on Postgres and under, say, a few million vectors? pgvector keeps your vectors next to your relational data and your transactions. One fewer system to run. Reach for it first.
- Need a dedicated engine with strong filtered search and quantization built in? Qdrant, Weaviate, Milvus all qualify; Pinecone if you'd rather not run it yourself.
- Prototyping locally? Chroma, LanceDB, or raw FAISS are fine and you can graduate later.
The thing that actually bites in production isn't the engine — it's metadata filtering. Real queries are "find this, but only in docs from this team, in this date range, that this user can see." A store that filters after the ANN search can return you ten neighbors and zero survivors. One that filters during the graph traversal stays accurate. Test filtered recall before you commit, because it's the failure mode the benchmarks never show you.
The through-line
Pick the embedding model for your domain and verify it on your data, not the leaderboard's. Match the distance metric to the model. Treat the index as a recall-vs-latency dial you set on purpose, not a default you inherit. And remember the index is approximate by design — when retrieval misses, it might not be the model's judgment that failed but the approximation you didn't know you'd turned down too far.
None of these is the glamorous part of RAG. All of them quietly cap how good the rest of the pipeline can be, because every reranker and every clever prompt downstream can only work with the neighbors this layer agreed to return.
Leave a Reply
Your email address will not be published.