Encoder-Decoder Architecture Explained | AI Dictionary

What is an Encoder-Decoder Architecture?

An encoder-decoder architecture is a neural network design with two distinct components: an encoder that reads and compresses input into an internal representation, and a decoder that uses that representation to produce output. The transformer family includes three variants: encoder-only, decoder-only, and full encoder-decoder models.

Why It Matters

Understanding encoder-decoder architectures explains why different AI models excel at different tasks. BERT (encoder-only) is great for understanding and classification. GPT (decoder-only) excels at text generation. T5 (encoder-decoder) handles translation and summarization. Knowing the architecture helps you choose the right model for a given task.

How It Works

The three transformer variants:

1. Encoder-only (e.g., BERT, RoBERTa):

Processes the full input bidirectionally (sees all tokens at once)
Produces rich contextual representations of the input
Best for: classification, named entity recognition, semantic similarity
Not good for: generating new text

2. Decoder-only (e.g., GPT, Claude, LLaMA):

Processes tokens left-to-right (autoregressive)
Each token can only attend to previous tokens (causal attention)
Best for: text generation, chat, code completion
The dominant architecture for modern LLMs

3. Encoder-decoder (e.g., T5, BART, mBART):

Encoder reads the full input bidirectionally
Decoder generates output autoregressively, attending to both previous output tokens and the encoder's representation
Best for: translation, summarization, question answering with structured input
Cross-attention connects encoder output to the decoder

The original transformer paper ("Attention Is All You Need") described the full encoder-decoder model for translation. The community then discovered that each half was powerful on its own.

Example

Google Translate uses an encoder-decoder model: the encoder reads the English sentence "I love AI" and creates an internal meaning representation. The decoder then generates the Dutch translation "Ik hou van AI" from that representation, one token at a time.

What is an Encoder-Decoder Architecture?

Why It Matters

How It Works

The three transformer variants:

1. Encoder-only (e.g., BERT, RoBERTa):

Processes the full input bidirectionally (sees all tokens at once)
Produces rich contextual representations of the input
Best for: classification, named entity recognition, semantic similarity
Not good for: generating new text

2. Decoder-only (e.g., GPT, Claude, LLaMA):

Processes tokens left-to-right (autoregressive)
Each token can only attend to previous tokens (causal attention)
Best for: text generation, chat, code completion
The dominant architecture for modern LLMs

3. Encoder-decoder (e.g., T5, BART, mBART):

Encoder reads the full input bidirectionally
Decoder generates output autoregressively, attending to both previous output tokens and the encoder's representation
Best for: translation, summarization, question answering with structured input
Cross-attention connects encoder output to the decoder

The original transformer paper ("Attention Is All You Need") described the full encoder-decoder model for translation. The community then discovered that each half was powerful on its own.

What is an Encoder-Decoder Architecture?

What is an Encoder-Decoder Architecture?

Why It Matters

How It Works

Example

Sources

What is an Encoder-Decoder Architecture?

What is an Encoder-Decoder Architecture?

Why It Matters

How It Works

Example

Sources