Pick a chunk size of 500 tokens and you've made a decision worth more than your choice of embedding model — and you probably made it by copying a tutorial. That's the part that bothers me about chunking. It's the cheapest thing to get wrong and the last thing anyone tunes. People will swap vector databases and benchmark four rerankers before they'll question the splitter they pasted in on day one.
So let's question the splitter. Not with a "best chunk size" number — there isn't one, and anyone who gives you one is selling something — but with the actual decisions that move retrieval quality, and the order to make them in.
Why does chunk size matter so much?
Because a chunk is two things at once, and they pull in opposite directions.
A chunk is the unit you embed. One vector has to summarize everything in it. Pack 1,500 tokens covering four subtopics into a chunk and its vector becomes a muddy average — close to nothing in particular, so it loses to sharper, smaller chunks even when it holds the answer.
A chunk is also the unit you retrieve and hand to the model. Make it tiny — a sentence or two — and the embedding is crisp, but the model gets a fragment with no surrounding argument. It reads "this raised the threshold to 80%" and has no idea what this is.
Small chunks: precise to find, starved of context. Large chunks: rich in context, blurry to find. Every strategy below is a different bet on how to escape that tradeoff instead of just sitting in the middle of it.
Fixed-size: the baseline everyone starts with
Chop the text every N tokens, add an overlap of maybe 10–20% so a sentence straddling a boundary survives in at least one piece. It's fast, predictable, and completely blind. It will cut through the middle of a table, split a numbered list from its heading, and separate a definition from its use. The overlap is a band-aid over the fact that the splitter can't see structure.
Fixed-size isn't wrong, exactly — it's the thing to beat. If you're starting today, start here, write down what it scores, then earn every change.
Recursive: respect the document's seams
The upgrade most people should make first costs almost nothing. Recursive character splitting (the default in LangChain for a reason) tries a priority list of separators — paragraph breaks first, then line breaks, then sentences, then words — and only falls to a coarser cut when a piece is still over the size limit. The effect is that it breaks on natural seams when it can and only butchers text when it must.
For prose and most documents this gets you 80% of the benefit of anything fancier. For structured formats it's better still to split on the structure itself: Markdown by header, code by function or class, HTML by section. The principle is the same — let the document's own boundaries be the chunk boundaries.
Semantic chunking: split where the meaning turns
The idea is elegant: instead of cutting at a fixed length, embed sentence by sentence and start a new chunk wherever the meaning shifts — where consecutive sentences stop being similar to each other. Topic boundaries become chunk boundaries. In principle you get chunks that are each "about one thing," which is exactly what the embedding step wants.
In practice semantic chunking is more expensive (you embed everything just to decide where to cut) and the wins are inconsistent. On clean, topically-segmented text it helps. On messy, repetitive, or list-heavy text it can do worse than a plain recursive split. Treat it as a hypothesis to test on your corpus, not a default to reach for. The honest version of "semantic chunking is better" is "semantic chunking is sometimes better, and you have to measure to find out."
The move that actually matters: give the chunk its context back
Here's the technique I'd reach for before tuning chunk size to the second decimal. The deepest failure in chunking isn't size — it's that a chunk gets severed from what makes it interpretable. "The policy raised it to 80%" is a useless chunk on its own. Which policy? Eighty percent of what?
So put the context back. Before embedding each chunk, prepend a short, model-generated blurb that situates it: which document, which section, what it's about. Anthropic called their 2024 version of this contextual retrieval and reported it cut retrieval failures meaningfully, because now the embedded text carries its own coordinates. A related trick, late chunking (Jina, 2024), runs a long-context embedding model over the whole document first and pools the token vectors into chunks afterward — so each chunk's vector already "saw" its neighbors before being split off.
Both attack the same root cause from different ends: the embedded representation should know more than the literal words inside the boundary. That's a bigger lever than 400-vs-600 tokens will ever be.
So what do you actually do?
In rough priority order, because order is the part tutorials skip:
- Start with recursive splitting on natural boundaries. Not fixed-size. The cost is zero and it stops the worst cuts.
- Split structured formats on their structure. Markdown by header, code by function. Don't tokenize a table into confetti.
- Add context to each chunk — a prepended section path or a one-line summary — before you touch anything more exotic. This is the highest-ROI change on the list.
- Only then tune size, and tune it against a retrieval metric on your own questions, not vibes. Smaller for dense reference material and fact lookup; larger for narrative or argumentative text where context matters more than precision.
- Try semantic chunking last, as an experiment, and keep it only if the number says keep it.
The thread through all of it: chunking isn't a preprocessing chore you finish and forget. It decides what your retriever is even able to find. A perfect embedding model and a state-of-the-art reranker can't recover an answer that got split across two chunks and stripped of its context at index time. Get this layer right and everything downstream has something to work with. Get it wrong and you spend the rest of the pipeline compensating for a decision you made in the first forty lines.
No tidy rule, then. Just an order of operations and a refusal to let the splitter be the one part of the system nobody looks at.
Leave a Reply
Your email address will not be published.