The Transformer, Intuitively

Most explanations of the transformer open with a wall of matrices. That's the wrong end of the telescope. The matrices are how it's computed; they're not what it is. Strip them away and the transformer is built on a single, almost mundane idea — a lookup where the keys match fuzzily instead of exactly. Get that idea and the rest is bookkeeping.

So I'm going to skip the linear algebra for as long as I can and come back for it only when it earns its place.

What was broken before

Before 2017, sequence models were recurrent. An RNN reads a sentence one word at a time, carrying a hidden state forward like a person reading with a single sticky note they keep rewriting. Two problems followed from that design, and both were fatal at scale.

First, it's sequential. Word 50 can't be processed until words 1 through 49 are done. You can't parallelize across the sequence, which means you can't pour modern GPU compute at it efficiently.

Second, the sticky note is small. By the time the model reaches the end of a long paragraph, the beginning has been overwritten many times. Information decays with distance. People bolted on tricks — LSTMs, attention-over-RNN (Bahdanau, 2014, which let a decoder peek back at all encoder states) — but the recurrence remained the bottleneck.

The 2017 paper's title was the thesis: Attention Is All You Need. Drop the recurrence entirely. Keep only the peeking-back mechanism, and apply it everywhere.

The one idea: a soft dictionary

Think of a Python dict. You ask for a key, you get exactly its value. Attention is the blurry version of that.

Every token produces three vectors by multiplying its embedding against three learned weight matrices:

a query — what this token is looking for,
a key — what this token offers as a label,
a value — the actual content it would contribute.

To update a token, you take its query and compare it against the key of every token in the sequence. Close match → high score. The scores get softmaxed into weights that sum to one, and the token's new representation is the weighted average of all the values. Not "fetch the one matching entry" but "fetch a blend of everything, weighted by how well each one matches."

That's it. That's attention. A word like it in "the trophy didn't fit in the suitcase because it was too small" sends out a query that matches the key for suitcase more than trophy, and so it pulls in the suitcase's meaning. Nobody hand-coded that. The weight matrices learned to produce queries and keys that line up the right way.

One attention head: a query matched against keys, then blended over values. — Attention as a soft dictionary lookup — a query scores every key, then blends the values.

The actual formula is one line — scaled dot-product attention — and it's worth seeing once, because it really is this small:

import numpy as np

def attention(Q, K, V):
    d_k = K.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)   # how well each query matches each key
    weights = softmax(scores, axis=-1)  # turn into a distribution per query
    return weights @ V                  # blend the values

The √d_k divisor keeps the scores from blowing up as vectors get longer. That's the whole engine. Everything else in a transformer exists to support, stack, or stabilize this operation.

Why "self" and why "multi-head"

In self-attention, the queries, keys, and values all come from the same sequence — every token attends to every other token, including itself. That's what lets the model build context-dependent meaning: the representation of bank becomes different in river bank versus savings bank because it attended to different neighbors.

One set of query/key/value matrices can only learn one kind of relationship. So transformers run several in parallel — multi-head attention. One head might track grammatical subject-verb links, another might follow coreference (it → suitcase), another might just copy nearby tokens. They run at the same time, their outputs get concatenated, and a final matrix mixes them. More heads, more relationship types learned at once. We don't usually know what each head does until we go looking — and interpretability research that pries these heads open is still very much open work.

The parts I skipped, briefly

Because attention is order-blind — shuffle the tokens and the blend is identical — you have to inject position somehow. The original paper added fixed sinusoidal positional encodings; most modern models use rotary embeddings (RoPE, 2021), which rotate the query and key vectors by an angle proportional to position, so relative distance falls out of the dot product naturally. That single change is a big reason context windows grew.

Around each attention layer sits a small feed-forward network (applied to each position independently), plus residual connections and layer normalization that keep gradients sane through dozens of stacked layers. Stack the block — attention, then feed-forward, both wrapped in residuals — sixty or a hundred times, and you have a frontier model's spine. The block doesn't change. You just repeat it and widen it.

Why it won

It's not that attention is smarter than recurrence in some deep sense. It's that attention is parallel and distance-flat. Every token reaches every other token in one step, no matter how far apart, and all those comparisons happen simultaneously on hardware built for exactly that kind of dense matrix math. RNNs asked GPUs to wait their turn. Transformers let them run flat out. When an architecture both models long-range dependencies better and trains an order of magnitude faster, scale does the rest — and scale is the story of everything that followed.

The math is real and you'll want it eventually. But the intuition holds the whole structure up: a token asks a question, every other token answers in proportion to how well it fits, and the answers get blended back in. Do that in parallel, with several questions at once, stacked a hundred deep. That's the machine.