"A picture is worth a thousand words" is wrong by about an order of magnitude. To a transformer, a picture is worth a few hundred tokens — and that number, not any vibe about images being rich, is the thing that explains the cost, the latency, and the failure modes of every multimodal model you'll use. Once you know how a pixel grid becomes tokens, the rest of multimodal stops feeling like magic and starts feeling like accounting.
The headline shift is that this is no longer a bolt-on. Frontier models now treat vision, and increasingly audio, as native inputs sharing one model — not a separate captioner duct-taped to a text model. That changes what you can build, especially for agents. But it runs on a mechanism worth understanding first.
How an image becomes tokens
A transformer only consumes a sequence of vectors. Text gets there through a tokenizer. Images get there through a trick introduced by the Vision Transformer (ViT) in 2020, whose title says it outright: an image is worth 16×16 words.
Chop the image into a grid of fixed-size patches — say 16×16 pixels each. A 224×224 image becomes a 14×14 grid, so 196 patches. Flatten each patch's pixels into a vector, run it through a small learned linear layer, and you've got a patch embedding — a vector living in the same kind of space as a word embedding. Add a positional encoding so the model knows where each patch sat in the grid. Now you have a sequence of ~196 vectors, and the transformer doesn't care that they came from pixels instead of letters. To self-attention, they're just tokens.
That's the whole conceptual move. An image token isn't a magic visual primitive; it's a patch of pixels projected into the embedding space and treated like any other token. Higher resolution means more patches means more tokens — which is why sending a 4K screenshot can cost as much as a long paragraph, and why models tile large images into regions and process each.
Wiring vision into a language model
There are two broad ways to get those visual tokens talking to a language model, and the field has tried both.
Encoder-plus-projector. Train (or borrow) a vision encoder — CLIP's image encoder is the classic choice, from the 2021 paper that learned to align images and text in a shared space — then learn a small projector that maps its output into the language model's embedding space. The LLM consumes projected image tokens as if they were a foreign-language prefix to the prompt. LLaVA (2023) showed you could build a capable vision-language model this cheaply: freeze a strong vision encoder, freeze a strong LLM, and just train the bridge between them on instruction data. Flamingo (2022) had earlier shown how to interleave image and text and do few-shot visual reasoning by inserting cross-attention layers into a frozen LLM.
Native multimodal. Train the model on text and images (and audio) together from early on, so there's no seam — one model, one set of weights, multiple input types entering the same token stream. This is the direction frontier models converged on, and it's why "multimodal" stopped being a product tier and became the default.
For audio, the recipe rhymes: turn the waveform into a spectrogram and patch it like an image, or quantize it into discrete acoustic tokens with a learned codec. Either way you're back to a sequence of vectors, and the transformer treats them like everything else.
What it unlocks for agents
The interesting consequence isn't better captioning. It's that an agent can now see. Three capabilities matter in practice:
- Screenshots as a universal interface. An agent that can read a rendered screen can operate software that has no API — click the button it sees, read the dialog that popped up, notice the spinner that means "still loading." Vision turns the GUI into something an agent can act on directly.
- Documents as they actually exist. Invoices, forms, slide decks, scanned PDFs, charts. A vision model reads the layout — that this number is in the "total" column, that this figure's caption belongs to that figure — which pure OCR-then-text throws away.
- Grounding for the physical and visual world. Diagrams, UI mockups, photos of a whiteboard. The model can reason over them in the same breath as text.
For an agent loop, that means a step can be "look at the current screen" the way another step is "call this tool." The perception and the reasoning live in one model, so there's no lossy hand-off to a separate vision service.
Where it still breaks
Treating images as tokens is powerful and lossy in specific, predictable ways. Know them before you ship:
- Fine print and dense text. Patchification has a resolution budget. Tiny text in a high-res image can fall below it, and the model guesses — confidently — at characters it can't actually resolve. Tiling and higher-res modes help; they cost tokens.
- Counting and precise spatial logic. "How many people are in this photo?" and "which box is to the left of which" are famously shaky. Attention blends; it doesn't tally. Exact counts and rigid geometry are not its strength.
- Hallucinated detail. Same root cause as text hallucination — a plausible continuation beats an honest "I can't tell." A vision model will sometimes describe an object that fits the scene's gist but isn't in the pixels.
- Token economics. High-resolution images are expensive. If you only need the text off a clean document, OCR-then-text can be cheaper than sending the raw image. Send pixels when layout or visual content actually carries information.
The practical mindset: an image is a budget of tokens, and resolution is the dial that trades cost against how much detail survives the patchification. Spend the budget where the visual information lives — the chart, the screenshot, the form — and don't pay image prices for text you could send as text. Multimodal-by-default doesn't mean send everything as an image. It means you finally can when it helps, and the skill is knowing when it does.
Leave a Reply
Your email address will not be published.