
The attention mechanism is the mathematical core of transformer models that enables each token in the input to dynamically assess and weigh the relevance of every other token, allowing the model to understand context, resolve ambiguity, and capture long-range dependencies in language. Introduced in the 2017 paper "Attention Is All You Need," attention replaced the sequential processing of earlier architectures (RNNs, LSTMs) with parallel computation over entire sequences, enabling the massive scale of modern LLMs. At its core, attention answers the question: "When processing this token, how much should I pay attention to each other token in the sequence?" The answer comes from a learned similarity function that produces attention scores — a weighted combination of the values of all other tokens, where the weights reflect contextual relevance.
Why it matters
The attention mechanism is the breakthrough that made Large Language Models possible. Before attention, neural networks processed sequences one element at a time, bottlenecking all previous context through a single fixed-size vector — which lost information over longer sequences. Attention allows every token to directly attend to every other token, regardless of distance, enabling models to understand that "it" in paragraph four refers to the "company" mentioned in paragraph one. This capability scales with context window size: a 200K-token context window works precisely because attention can connect any two tokens across that entire span. Understanding attention also explains key practical constraints: attention has O(n²) computational complexity in sequence length, meaning doubling the context window quadruples the compute cost. This is why longer prompts cost disproportionately more, and why techniques like sparse attention and prompt optimization have significant economic value.
How it works
Attention uses three learned linear transformations to convert each token's representation into a Query (Q), Key (K), and Value (V) vector. The attention score between any two tokens is computed as the dot product of the query of one token with the key of the other, scaled by the square root of the key dimension: Attention(Q,K,V) = softmax(QK^T/√d)V. This produces attention weights — a probability distribution indicating how much each token should influence the current token's updated representation. In practice, transformers use multi-head attention: instead of computing a single attention function, they run multiple attention operations in parallel (typically 8-96 heads), each learning to focus on different types of relationships — one head might capture syntactic dependencies, another semantic similarity, another coreference. The outputs of all heads are concatenated and linearly projected to form the final representation. Modern LLMs also use causal (masked) attention in their decoder layers, preventing tokens from attending to future positions during generation.
Example
When a language model processes the sentence "The bank approved the loan because the company had strong financials," the attention mechanism enables "bank" to strongly attend to "loan," "approved," and "financials" — disambiguating that this is a financial institution, not a riverbank. Meanwhile, "the company" strongly attends to "financials" and "bank," establishing the semantic relationship. Different attention heads capture different aspects: one head tracks subject-verb agreement ("bank approved"), another tracks coreference ("company…strong financials"), and another manages the causal relationship ("because"). An enterprise deploying an LLM with a 128K-token context window to process technical manuals directly benefits from attention's ability to connect a troubleshooting step on page 50 to a component definition on page 3 — a connection impossible with pre-attention architectures. However, they also bear the O(n²) cost: processing a 100K-token document requires 10 billion attention computations per layer, explaining why long-context inference is measurably more expensive.