Transformer¶

▶ Try in Colab

The Transformer is a neural network architecture that is used for natural language processing tasks. It was introduced in the paper Attention is All You Need based on the attention mechanism. We will first introduce the architecture of the Transformer and then the training process.

Transformer block¶

The Transformer block is the main building block of the Transformer.

Let \(X \in \mathbb{R}^{n \times d}\) be the input matrix, where \(n\) is the sequence length and \(d\) is the embedding dimension. The Transformer block consists of the following operations:

Multi-Head Attention:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

For multi-head attention with \(h\) heads:

\[ \text{MultiHead}(X) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O \]

where each head is:

\[ \text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V) \]

Residual Connection and Layer Normalization (Add & Norm):

\[ X' = \text{LayerNorm}(X + \text{MultiHead}(X)) \]

where the layer normalization is similar to the batch normalization but instead of computing the mean and variance over the batch, we compute them over the embedding dimension. In PyTorch, you can use torch.nn.LayerNorm to implement the layer normalization.

Feed-Forward Network (FFN):

\[ \text{FFN}(X') = \text{ReLU}(X'W_1 + b_1)W_2 + b_2 \]

Second Residual Connection and Layer Normalization (Add & Norm):

\[ \text{Output} = \text{LayerNorm}(X' + \text{FFN}(X')) \]

Using PyTorch torch.nn.MultiheadAttention, we can implement the Transformer block as follows:

import torch.nn as nn

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden_dim):
        """
        Args:
            embed_dim (int): Dimensionality of the input embeddings.
            num_heads (int): Number of attention heads.
            ff_hidden_dim (int): Hidden layer dimensionality in the feed-forward network.
        """
        super(TransformerBlock, self).__init__()

        # Multi-head attention layer. We use batch_first=True so that input shape is (batch_size, sequence_length, embed_dim).
        self.mha = nn.MultiheadAttention(embed_dim=embed_dim, 
                                         num_heads=num_heads,
                                         batch_first=True)

        # First layer normalization applied after the multi-head attention residual addition.
        self.attention_norm = nn.LayerNorm(embed_dim)

        # Feed-forward network: two linear layers with ReLU activation.
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden_dim),
            nn.ReLU(),
            nn.Linear(ff_hidden_dim, embed_dim)
        )

        # Second layer normalization after the feed-forward residual addition.
        self.ffn_norm = nn.LayerNorm(embed_dim)


    def forward(self, x, attn_mask=None, key_padding_mask=None):
        # Apply Multi-Head Attention (self-attention) where Q = K = V = x.
        # nn.MultiheadAttention returns (attn_output, attn_weights); unpack accordingly.
        attn_output, _ = self.mha(x, x, x, need_weights=False)

        # First residual connection and layer normalization.
        # X' = LayerNorm(x + attn_output)
        x = self.attention_norm(x + attn_output)
        # Feed-Forward Network (FFN)
        ffn_output = self.ffn(x)
        # Second residual connection and layer normalization.
        # Output = LayerNorm(x + ffn_output)
        output = self.ffn_norm(x + ffn_output)
        return output

Transformer encoder¶

The Transformer encoder is a stack of multiple Transformer blocks and connect to a final fully connected layer for classification output.

Using the TransformerBlock we defined above, we can build the encoder as follows:

class TransformerEncoder(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden_dim, num_layers):
        super(TransformerEncoder, self).__init__()
        self.blocks = nn.ModuleList([TransformerBlock(embed_dim, num_heads, ff_hidden_dim) for _ in range(num_layers)])

    def forward(self, x, attn_mask=None, key_padding_mask=None):
        for block in self.blocks:
            x = block(x, attn_mask, key_padding_mask)
        return x

Model	Layers	Hidden Size	Attention Heads	Feedforward Size	Parameters
BERT-Base	12	768	12	3072	110M
BERT-Large	24	1024	16	4096	340M
DistilBERT	6	768	12	3072	66M

Transformer decoder¶

The Transformer decoder is similar to the encoder but with a key difference: it uses masked self-attention in its first sublayer. This masking prevents the decoder from attending to future positions during training, which is essential for autoregressive generation.

Masked Self-Attention¶

In the decoder's masked self-attention, we modify the attention mechanism to ensure that the prediction for position \(i\) can only depend on known outputs at positions less than \(i\). This is achieved by masking future positions in the attention weights:

\[ \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V \]

where \(M\) is a mask matrix with:

\[ M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases} \]

When we apply softmax to a row containing \(-\infty\) values, those positions effectively receive \(0\) attention weight, preventing information flow from future tokens.

The Transformer decoder stacks multiple masked self-attention layers. Modern generative language models like GPT-2 and GPT-3 use the decoder-only architecture with a stack of masked self-attention layers followed by a feed-forward network.

There is an interactive visualization of the transformer in transformer-explainer.

GPT-2 is so far the last open-sourced model from OpenAI. It has 124M, 355M, and 774M parameters for small, medium, and large models, respectively.

Model	Layers	Hidden Size	Attention Heads	Feedforward Size	Parameters
GPT-2 Small	12	768	12	3072	124M
GPT-2 Medium	24	1024	16	4096	355M
GPT-2 Large	36	1280	20	5120	774M

From Decoder Output to Token Generation¶

After the final transformer decoder layer, we have a hidden state vector for each position. To generate the next token, we need to convert this vector into a probability distribution over the vocabulary. This happens in two steps: a linear projection (the language model head) and a softmax (optionally with temperature).

Logits: Projecting to Vocabulary Space¶

Let \(\mathbf{h}_L \in \mathbb{R}^d\) be the hidden state at the last decoder position (i.e., after all transformer blocks). The vocabulary has size \(V\) (e.g., 50,257 for GPT-2). We apply a linear layer without bias:

\[ \mathbf{z} = \mathbf{h}_L \, W_{\text{lm}} \in \mathbb{R}^V \]

where \(W_{\text{lm}} \in \mathbb{R}^{d \times V}\) is the language model head. The vector \(\mathbf{z} = (z_1, z_2, \ldots, z_V)\) contains one score (called a logit) per vocabulary token. These logits are unnormalized: higher values mean the model considers that token more likely.

Softmax: Logits to Probabilities¶

The softmax function converts logits into a valid probability distribution:

\[ p_i = \frac{e^{z_i}}{\sum_{j=1}^{V} e^{z_j}} = \frac{e^{z_i}}{Z} \]

where \(Z = \sum_{j=1}^{V} e^{z_j}\) is the partition function (normalizing constant). Each \(p_i \in (0, 1)\) and \(\sum_{i=1}^{V} p_i = 1\). The model then samples the next token from this distribution.

Temperature¶

During generation, we often introduce a temperature hyperparameter \(T > 0\) to control how sharply we pick tokens. We divide the logits by \(T\) before applying softmax:

\[ p_i(T) = \frac{e^{z_i / T}}{\sum_{j=1}^{V} e^{z_j / T}} \]

Effect of temperature:

Temperature	Effect	Typical use
\(T = 1\)	Standard softmax; no change	Default, balanced behavior
\(T < 1\) (e.g. 0.5, 0.2)	Sharper distribution: high-probability tokens get more mass, low-probability ones get suppressed	More deterministic, conservative output
\(T > 1\) (e.g. 1.2, 2.0)	Flatter distribution: probabilities become more uniform	More diverse, creative output

Intuitively: with \(T \to 0\), we approach argmax (always pick the highest-scoring token); with \(T \to \infty\), we approach uniform random sampling over the vocabulary.

Putting It Together¶

For autoregressive generation at each step:

Run the decoder on the current input sequence to get \(\mathbf{h}_L\).
Compute logits: \(\mathbf{z} = \mathbf{h}_L \, W_{\text{lm}}\).
(Optional) Apply temperature: \(\tilde{z}_i = z_i / T\).
Apply softmax to get \(p_i = e^{\tilde{z}_i} / \sum_j e^{\tilde{z}_j}\).
Sample the next token from this distribution (e.g., multinomial sampling or argmax for greedy decoding).

Below is sample code for softmax (with temperature) and sampling. We assume logits is a tensor of shape (batch_size, vocab_size) from the model's language model head:

import torch
import torch.nn.functional as F

def logits_to_probs(logits, temperature=1.0):
    """Apply temperature scaling and softmax to get probabilities."""
    # temperature: divide logits by T before softmax
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    return probs

def sample_next_token(probs, method="multinomial"):
    """
    Sample the next token from the probability distribution.

    Args:
        probs: (batch_size, vocab_size) - probability distribution over vocabulary
        method: "greedy" for argmax, "multinomial" for random sampling
    """
    if method == "greedy":
        # Always pick the highest-probability token (T → 0 behavior)
        next_token = probs.argmax(dim=-1)
    elif method == "multinomial":
        # Sample according to the distribution
        next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
    return next_token

# Example: get logits from a GPT-2 forward pass, then generate next token
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The capital of France is", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # (batch_size, seq_len, vocab_size)

# Take the last position (prediction for next token after "is")
last_logits = logits[:, -1, :]  # (batch_size, vocab_size)

# Softmax with temperature
probs = logits_to_probs(last_logits, temperature=0.8)
print("Top-5 token probabilities:", probs[0].topk(5))

# Sampling
next_token_greedy = sample_next_token(probs, method="greedy")
next_token_sampled = sample_next_token(probs, method="multinomial")
print("Greedy:", tokenizer.decode(next_token_greedy))
print("Sampled:", tokenizer.decode(next_token_sampled))

In practice, Hugging Face handles this internally when you call model.generate():

gen_ids = dec_model.generate(**prompt, max_new_tokens=20, temperature=0.8)

Encoder-Decoder Transformer¶

The encoder-decoder transformer is a variant of the Transformer that uses both encoder and decoder blocks. It is used for sequence-to-sequence tasks such as translation and summarization.

Choosing Transformer Architecture¶

We list below the best use cases for each type of transformer architecture.

Encoder-Only Models

Best for: Understanding and analyzing input text (classification, entity recognition, sentiment analysis)
Examples: BERT, RoBERTa, DistilBERT
Characteristics: Bidirectional attention (can see full context in both directions)
Use when: Your task requires deep understanding of input text without generating new text

Decoder-Only Models

Best for: Text generation tasks (completion, creative writing, chat)
Examples: GPT-2
Characteristics: Autoregressive generation with masked self-attention
Use when: Your primary goal is to generate coherent, contextually relevant text

Encoder-Decoder Models

Best for: Sequence-to-sequence tasks (translation, summarization)
Examples: T5, BART, Pegasus
Characteristics: Encoder processes input, decoder generates output based on encoder representations
Use when: Your task involves transforming one sequence into another related sequence

Why Did Decoder-only Models Become So Successful?¶

This is an important question. Decoder-only models were not the only Transformer design, but they became dominant for large language models.

There are several reasons.

Their training objective matches generation directly¶

Decoder-only models are trained to predict the next token, which is exactly what they do at inference time.

This alignment is powerful. There is no gap between "pretraining task" and "generation task". The model simply learns to continue sequences better and better.

The objective scales well with large text corpora¶

Almost any text on the internet can be turned into next-token prediction training data. You do not need labels. You only need sequences of tokens.

That makes it easy to scale data collection and pretraining.

One interface can solve many tasks¶

With decoder-only models, many tasks can be written as prompting:

"Summarize this note: ..."
"Translate this sentence to Spanish: ..."
"Answer this question: ..."
"Write Python code that does ..."

Instead of building a separate head for each task, we can often use one model and phrase the task in text.

In-context learning emerged at scale¶

Large decoder-only models showed a surprising ability to learn from examples placed inside the prompt.

For example, if the prompt contains a few question-answer pairs, the model may continue with the correct pattern on a new example. This is called in-context learning.

This property made decoder-only models much more general-purpose than many earlier NLP systems.

They are convenient for product development¶

From an engineering perspective, decoder-only models provide a simple pattern:

give the model a prompt
generate tokens
stop when a condition is met

This simplicity made them attractive for chat systems, assistants, code tools, and agent-like workflows.

Instruction tuning and RLHF fit naturally on top¶

After pretraining, decoder-only models can be further adapted with:

supervised fine-tuning on instruction-response pairs
preference learning such as RLHF or DPO

These methods improved helpfulness and alignment without changing the core left-to-right generation framework.

Transformer with Hugging Face¶

Rather than building transformer blocks from scratch, Hugging Face 🤗 gives you production-ready encoder-only, decoder-only, and encoder-decoder models in a single line. The table below maps each architecture to the right AutoModel class.

Architecture	HF class	Example model
Encoder-only	`AutoModel` / `AutoModelForSequenceClassification`	`bert-base-uncased`
Decoder-only	`AutoModelForCausalLM`	`gpt2`
Encoder-Decoder	`AutoModelForSeq2SeqLM`	`t5-small`, `facebook/bart-large-cnn`

import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, AutoModelForSeq2SeqLM

#  Encoder-only (BERT) 
enc_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
enc_model = AutoModel.from_pretrained("bert-base-uncased")

inputs = enc_tokenizer("Hello, how are you?", return_tensors="pt")
enc_out = enc_model(**inputs)
print(enc_out.last_hidden_state.shape)  # (batch, seq_len, 768)

#  Decoder-only (GPT-2) 
dec_tokenizer = AutoTokenizer.from_pretrained("gpt2")
dec_model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = dec_tokenizer("Once upon a time", return_tensors="pt")
gen_ids = dec_model.generate(**prompt, max_new_tokens=20)
print(dec_tokenizer.decode(gen_ids[0], skip_special_tokens=True))

#  Encoder-Decoder (T5) 
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

src_ids = t5_tokenizer("translate English to French: Hello world", return_tensors="pt")
tgt_ids = t5_tokenizer("Bonjour le monde", return_tensors="pt").input_ids
t5_out = t5_model(**src_ids, labels=tgt_ids)
print(t5_out.loss)   # cross-entropy loss for training

For low-level research or custom architectures you can still use PyTorch's nn.TransformerEncoder/Decoder directly:

import torch
import torch.nn as nn

# Encoder
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
src = torch.rand(32, 10, 512)  # (batch, seq_len, d_model) with batch_first=True
enc_out = encoder(src)

# Decoder
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
tgt = torch.rand(32, 10, 512)
dec_out = decoder(tgt, enc_out)

# Full encoder-decoder in one call
transformer = nn.Transformer(d_model=512, nhead=8,
                              num_encoder_layers=6, num_decoder_layers=6,
                              batch_first=True)
out = transformer(src, tgt)