Encoder-only, Decoder-only, and Encoder-Decoder Models¶

▶ Try in Colab

When people say "Transformer model", they often mean one of three related model families:

Encoder-only models such as BERT
Decoder-only models such as GPT
Encoder-decoder models such as T5 and BART

They all use the Transformer idea, but they are built for different jobs. The main differences are:

what part of the Transformer they keep
what information each token is allowed to see
how they are trained
what tasks they are best at

This distinction is important because the architecture strongly shapes what the model can learn efficiently.

The Big Picture¶

It is useful to think of the three families in a simple way:

Encoder-only models are good at understanding text.
Decoder-only models are good at generating text one token at a time.
Encoder-decoder models are good at turning one sequence into another sequence.

For example:

sentiment classification: usually encoder-only
chat and open-ended writing: usually decoder-only
translation and summarization: usually encoder-decoder

1. Encoder-only Models¶

Encoder-only models keep the Transformer encoder stack.

In the encoder, each token can attend to all other tokens in the input sentence. This is called bidirectional attention.

If the input is:

The patient did not report chest pain.

the word "pain" can look at "did not report" on its left and "chest" on its right. This makes encoder-only models strong at building a rich representation of the whole sentence.

Model structure¶

input tokens go through token embeddings and positional embeddings
a stack of encoder blocks processes the full input
the output is a contextual representation for every token
for many tasks, we add a task-specific head on top, such as a classifier

How they are trained¶

Encoder-only models are usually trained with a masked language modeling objective.

We hide some input words and ask the model to predict them.

Example:

The patient has [MASK] blood pressure.

The model uses both left and right context to guess the missing word.

This training setup is useful for learning deep language representations, but it is not naturally aligned with long free-form text generation.

When to use encoder-only models¶

Use encoder-only models when the main goal is to understand or classify text.

Common tasks:

sentence classification
token classification such as named entity recognition
information extraction
semantic similarity
retrieval and reranking

Typical examples¶

BERT
RoBERTa
DistilBERT
BioBERT

2. Decoder-only Models¶

Decoder-only models keep the Transformer decoder stack, but in modern large language models the key feature is causal masking.

At position \(t\), the model can only attend to tokens from positions \(\leq t\). It cannot look into the future.

If the text is:

The diagnosis is

the model predicts the next token, then uses that new token to predict the following one, and so on.

Model structure¶

input tokens enter a stack of decoder blocks
each block uses masked self-attention
the hidden state at each position is used to predict the next token
generation is autoregressive, meaning one token is produced at a time

How they are trained¶

Decoder-only models are usually trained with a next-token prediction objective.

Given tokens \((x_1, x_2, \dots, x_{t-1})\), the model predicts \(x_t\).

In practice, the model learns to maximize:

\[ p(x_1, x_2, \dots, x_n) = \prod_{t=1}^{n} p(x_t \mid x_{<t}) \]

This objective matches generation exactly: training and inference follow the same left-to-right pattern.

When to use decoder-only models¶

Use decoder-only models when you want the model to generate text, code, or other sequences.

Common tasks:

chat assistants
question answering in a generative style
code generation
report drafting
instruction following
open-ended completion

Typical examples¶

GPT family
Llama family
Mistral family
Gemma family

3. Encoder-Decoder Models¶

Encoder-decoder models use both parts of the original Transformer.

the encoder reads the input sequence
the decoder generates the output sequence

This is sometimes called a sequence-to-sequence or seq2seq model.

Model structure¶

The encoder first builds a representation of the source input.

Then the decoder generates the target output while attending to:

previous output tokens through masked self-attention
the encoder output through cross-attention

This is a natural design when the input and output are different but related.

Example:

input: a long clinical note
output: a short discharge summary

How they are trained¶

Encoder-decoder models are usually trained to map an input sequence to a target sequence.

Examples:

source sentence \(\rightarrow\) translated sentence
article \(\rightarrow\) summary
question \(\rightarrow\) answer

During training, the decoder sees the correct previous output tokens and learns to predict the next one. This is often called teacher forcing.

When to use encoder-decoder models¶

Use encoder-decoder models when the task is clearly input-to-output transformation.

Common tasks:

translation
summarization
paraphrasing
structured text generation from a source document
converting one modality-specific text format into another

Typical examples¶

T5
FLAN-T5
BART

Side-by-side Comparison¶

Family	Attention pattern	Main training objective	Best at	Common output style
Encoder-only	Bidirectional	Masked language modeling	Understanding the input	Labels, token tags, embeddings
Decoder-only	Causal, left-to-right	Next-token prediction	Free-form generation	Generated text token by token
Encoder-decoder	Encoder is bidirectional, decoder is causal with cross-attention	Sequence-to-sequence training	Mapping one sequence to another	Generated target sequence

How Their Training Objectives Differ¶

The training objective matters as much as the architecture.

Encoder-only: fill in missing pieces¶

The model learns by recovering masked tokens. This encourages strong contextual representations because the model must use both sides of the sentence.

Decoder-only: continue the sequence¶

The model learns to continue text from left to right. This is a very direct objective for text generation.

Encoder-decoder: transform one sequence into another¶

The model learns a conditional distribution:

\[ p(y \mid x) \]

where \(x\) is the input sequence and \(y\) is the output sequence.

This is ideal for tasks where the output depends strongly on a source input.

When Should You Use Which?¶

Here is a practical rule of thumb.

Choose encoder-only if:¶

you need strong text representations
you are doing classification or extraction
you want efficiency for understanding tasks

Choose decoder-only if:¶

you need flexible text generation
you want one model interface for many prompting tasks
you are building a chatbot, coding assistant, or general-purpose LLM

Choose encoder-decoder if:¶

the problem is naturally sequence in, sequence out
the input should be fully understood before generating the output
the output should stay tightly grounded in the source input

Why Did Decoder-only Models Become So Successful?¶

This is an important question. Decoder-only models were not the only Transformer design, but they became dominant for large language models.

There are several reasons.

1. Their training objective matches generation directly¶

Decoder-only models are trained to predict the next token, which is exactly what they do at inference time.

This alignment is powerful. There is no gap between "pretraining task" and "generation task". The model simply learns to continue sequences better and better.

2. The objective scales well with large text corpora¶

Almost any text on the internet can be turned into next-token prediction training data. You do not need labels. You only need sequences of tokens.

That makes it easy to scale data collection and pretraining.

3. One interface can solve many tasks¶

With decoder-only models, many tasks can be written as prompting:

"Summarize this note: ..."
"Translate this sentence to Spanish: ..."
"Answer this question: ..."
"Write Python code that does ..."

Instead of building a separate head for each task, we can often use one model and phrase the task in text.

4. In-context learning emerged at scale¶

Large decoder-only models showed a surprising ability to learn from examples placed inside the prompt.

For example, if the prompt contains a few question-answer pairs, the model may continue with the correct pattern on a new example. This is called in-context learning.

This property made decoder-only models much more general-purpose than many earlier NLP systems.

5. They are convenient for product development¶

From an engineering perspective, decoder-only models provide a simple pattern:

give the model a prompt
generate tokens
stop when a condition is met

This simplicity made them attractive for chat systems, assistants, code tools, and agent-like workflows.

6. Instruction tuning and RLHF fit naturally on top¶

After pretraining, decoder-only models can be further adapted with:

supervised fine-tuning on instruction-response pairs
preference learning such as RLHF or DPO

These methods improved helpfulness and alignment without changing the core left-to-right generation framework.

Important Caveat¶

Decoder-only models are not always the best choice.

For classification, retrieval, and many structured prediction tasks, encoder-only models can still be more efficient and easier to use. For translation and summarization, encoder-decoder models can still be very strong because their architecture is directly matched to the task.

So the success of decoder-only models does not mean the other two families are obsolete. It mostly means decoder-only models became the most flexible general-purpose foundation for language generation.

Summary¶

Encoder-only models read the whole input and are strong at understanding tasks.
Decoder-only models generate one token at a time and are strong at open-ended generation.
Encoder-decoder models read an input sequence and generate an output sequence, making them strong for translation and summarization.

If you remember just one idea, remember this:

the architecture and training objective should match the task you want to solve.

That is why these three model families coexist, and why choosing the right one matters.