Paper Notes: Attention Is All You Need

This is the paper that introduced the Transformer — the architecture that now sits underneath GPT, BERT, Gemini, and basically every large model you've heard of. I read it as part of trying to properly understand why modern AI works the way it does, not just use it. These are my notes — written to make sense to me, so probably written plainly.

Paper: Vaswani et al., 2017 — arxiv.org/abs/1706.03762

## background — what existed before this

Before Transformers, sequence tasks (translation, text generation, etc.) were handled by RNNs and LSTMs. These process tokens one at a time, left to right. That sounds fine until you realise the problem: to understand word 50 in a sentence, the model has to pass information through words 1 through 49 in sequence. By the time it gets there, a lot of that early context has faded — this is the "vanishing gradient" problem.

Attention mechanisms were already being bolted onto RNNs to help with this — letting the model look back at earlier tokens directly. The insight of this paper was: what if we throw away the RNN entirely and just use attention?

## the core idea — self-attention

The key mechanism is self-attention. For every token in a sequence, the model asks: "which other tokens in this sequence should I pay attention to when computing my representation?" And it asks this for every token simultaneously, in parallel.

Here's how it actually works. Each token gets turned into three vectors:

Q — Query: "what am I looking for?"
K — Key: "what do I contain?"
V — Value: "what do I actually pass forward?"

To compute attention, you take the dot product of a Query with all the Keys, scale it, run a softmax to get weights, then take a weighted sum of the Values. In one formula:

Attention(Q, K, V) = softmax( QK^T / sqrt(d_k) ) * V

The sqrt(d_k) scaling stops the dot products from getting too large and pushing the softmax into regions with tiny gradients. Small detail, but it matters.

### why this is powerful

In an RNN, the distance between two tokens matters — information has to travel through every token in between. In self-attention, every token attends to every other token directly. The distance is always 1. A word at position 1 and a word at position 200 can directly influence each other.

### multi-head attention

Instead of doing attention once, the paper runs it h times in parallel with different learned Q/K/V projections. Each "head" can learn to attend to different kinds of relationships — one head might track syntactic dependencies, another might track coreference, etc. The outputs of all heads are concatenated and projected back down.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q*W_Qi, K*W_Ki, V*W_Vi)

## the full architecture

The Transformer is an encoder-decoder architecture. For translation: the encoder reads the source sentence, the decoder generates the target sentence one token at a time.

Encoder: stack of 6 identical layers, each with multi-head self-attention + feed-forward network + residual connections + layer norm
Decoder: same stack but with an extra cross-attention layer that attends to the encoder output, and masked self-attention so it can't look ahead

The masking in the decoder is important — during training the full target sequence is fed in, but each position can only see positions before it. Otherwise the model would just copy the answer.

### positional encoding

Self-attention has no concept of order — it just looks at a bag of tokens and attends between them. To give the model a sense of position, the paper adds a positional encoding to each token embedding before feeding it in. They use sine and cosine functions at different frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

The reason for this specific formula: the model can learn to attend to relative positions, because PE(pos+k) can be expressed as a linear function of PE(pos). Later work (RoPE, ALiBi) improved on this, but for 2017 it was elegant.

## what surprised me

How much of the architecture is just engineering decisions layered on top of one core idea. The core idea — replace recurrence with attention — is one paragraph. Everything else (multi-head, scaling, positional encoding, residuals, layer norm, the feed-forward sublayers) is making that one idea stable and trainable at scale.

Also how readable the paper actually is. I was expecting dense math. It's surprisingly well-written — the authors clearly wanted people to understand it.

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely." — the abstract. They were not understating it.