Module 06: Transformer

Introduction

The transformer decoder block is the building block of GPT-style language models. Stack 12-96 of these blocks, and you get models like GPT-2, GPT-3, or LLaMA.

In this module, we’ll combine everything we’ve built so far:

  • Multi-head attention from Module 05
  • Feed-forward networks (mini neural networks for each token)
  • Layer normalization (stabilizes training)
  • Residual connections (enables deep networks)

Each block performs two main operations:

  1. Multi-head attention: Tokens communicate with each other
  2. Feed-forward network: Each token is processed independently

Decoder-Only vs Encoder-Decoder

This module implements decoder-only transformers (GPT-style). There are two main transformer architectures:

Architecture Examples Use Case Attention
Decoder-only GPT, LLaMA, Claude Text generation Causal (can’t see future)
Encoder-Decoder T5, BART, original Transformer Translation, summarization Bidirectional encoder + causal decoder

We focus on decoder-only because it’s simpler and powers most modern LLMs.

What You’ll Learn

By the end of this module, you will be able to:

  • Understand the complete GPT-style transformer architecture
  • Implement LayerNorm, GELU, and feed-forward networks from scratch
  • Build a full transformer block with residual connections
  • Assemble a complete language model from components
  • Calculate parameter counts for different model sizes

Complete Model Architecture

TipInteractive Architecture Walkthrough

Use the slider above to step through the forward pass. Each stage shows how tensor shapes transform as data flows through the model: - Input: Raw token IDs (integers) - Embeddings: Dense vectors capturing meaning and position - Blocks: Iterative refinement through attention and FFN - Output: Probability distribution over vocabulary

Single Transformer Block (Pre-Norm)

The key innovation is the residual connections (the + nodes). Instead of y = f(x), we compute y = x + f(x). This:

  • Helps gradients flow through deep networks
  • Makes it easy to learn identity (just set f(x) = 0)
  • Enables training of 100+ layer networks

The Components

In this section, we build each component from scratch before showing the PyTorch equivalents. The pattern is: understand the math, implement it simply, then see how PyTorch optimizes it.

LayerNorm from Scratch

The Idea: Neural network activations can drift to extreme values during training, causing gradients to explode or vanish. Layer normalization fixes this by normalizing each token’s embedding to have mean 0 and variance 1, then applying learnable scale and shift parameters.

The Formula:

\[\text{LayerNorm}(x) = \gamma \times \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

where:

  • \(\mu\) = mean across the embedding dimension
  • \(\sigma^2\) = variance across the embedding dimension
  • \(\gamma\) (gamma) = learnable scale parameter (initialized to 1)
  • \(\beta\) (beta) = learnable shift parameter (initialized to 0)
  • \(\epsilon\) = small constant for numerical stability (typically 1e-5)

From Scratch Implementation:

import numpy as np
import torch
import torch.nn as nn

class LayerNormScratch:
    """Layer normalization from scratch using NumPy-style operations."""

    def __init__(self, dim, eps=1e-5):
        # Learnable parameters
        self.gamma = np.ones((dim,))   # scale (initialized to 1)
        self.beta = np.zeros((dim,))   # shift (initialized to 0)
        self.eps = eps

    def __call__(self, x):
        """
        Args:
            x: input array of shape (..., dim)
        Returns:
            normalized array of same shape
        """
        # Step 1: Compute mean across last dimension
        mean = x.mean(axis=-1, keepdims=True)

        # Step 2: Compute variance across last dimension
        var = ((x - mean) ** 2).mean(axis=-1, keepdims=True)

        # Step 3: Normalize (the "norm" in LayerNorm)
        x_norm = (x - mean) / np.sqrt(var + self.eps)

        # Step 4: Scale and shift with learnable parameters
        return self.gamma * x_norm + self.beta


# Test our from-scratch implementation
x = np.array([[2.0, 4.0, 6.0, 8.0],
              [1.0, 2.0, 3.0, 4.0]])

ln_scratch = LayerNormScratch(dim=4)
out_scratch = ln_scratch(x)

print("LayerNorm from Scratch:")
print(f"  Input:\n{x}")
print(f"  Output:\n{np.round(out_scratch, 4)}")
print(f"  Output mean per row: {out_scratch.mean(axis=-1).round(6)}")
print(f"  Output std per row: {out_scratch.std(axis=-1).round(4)}")

PyTorch’s nn.LayerNorm:

# PyTorch's optimized implementation
ln_pytorch = nn.LayerNorm(4, elementwise_affine=True)

# Initialize to match our scratch version (gamma=1, beta=0)
nn.init.ones_(ln_pytorch.weight)
nn.init.zeros_(ln_pytorch.bias)

x_torch = torch.tensor(x, dtype=torch.float32)
out_pytorch = ln_pytorch(x_torch)

print("PyTorch LayerNorm:")
print(f"  Output:\n{out_pytorch.detach().numpy().round(4)}")
print(f"  Matches scratch: {np.allclose(out_scratch, out_pytorch.detach().numpy(), atol=1e-5)}")

Key Insight: LayerNorm is just normalize-scale-shift. The learnable \(\gamma\) and \(\beta\) let the network undo the normalization if needed, but start from a stable baseline. Unlike BatchNorm, LayerNorm normalizes across features (embedding dimension) rather than across batch, making it suitable for variable-length sequences.

Dropout: Regularization by Noise

The Idea: During training, randomly “drop” (zero out) some activations. This prevents the network from relying too heavily on any single feature and encourages redundancy. The key trick: scale remaining values by \(\frac{1}{1-p}\) so the expected value stays the same.

Why it works:

  • Forces the network to learn redundant representations
  • Acts like training an ensemble of sub-networks
  • At inference time, use all neurons (no dropout)

From Scratch Implementation:

class DropoutScratch:
    """Dropout from scratch."""

    def __init__(self, p=0.1):
        """
        Args:
            p: probability of dropping each element (not keeping!)
        """
        self.p = p

    def __call__(self, x, training=True):
        """
        Args:
            x: input array
            training: if False, return x unchanged
        Returns:
            x with dropout applied (if training)
        """
        if not training or self.p == 0:
            return x

        # Create random mask: True where we KEEP the value
        keep_prob = 1 - self.p
        mask = np.random.random(x.shape) < keep_prob

        # Apply mask and scale by 1/(1-p)
        # This keeps the expected value the same:
        # E[x * mask / keep_prob] = x * keep_prob / keep_prob = x
        return x * mask / keep_prob


# Demonstrate dropout
np.random.seed(42)
x = np.ones((2, 8))

dropout = DropoutScratch(p=0.5)

print("Dropout from Scratch (p=0.5):")
print(f"  Input (all 1s): {x[0]}")

# Apply dropout multiple times to see the randomness
for i in range(3):
    np.random.seed(i)
    out = dropout(x, training=True)
    print(f"  Trial {i+1}: {out[0].round(2)}")
    print(f"    Mean: {out[0].mean():.2f} (should be ~1.0 on average)")

The Scaling Trick Explained:

# Why divide by (1-p)?
# Without scaling, dropout reduces expected output
# With scaling, expected output stays the same

p = 0.5
np.random.seed(0)
x = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

# Without scaling
mask = np.random.random(x.shape) < (1-p)
out_no_scale = x * mask
print(f"Without scaling: {out_no_scale} -> mean = {out_no_scale.mean():.2f}")

# With scaling (divide by keep probability)
np.random.seed(0)
mask = np.random.random(x.shape) < (1-p)
out_scaled = x * mask / (1-p)
print(f"With scaling:    {out_scaled} -> mean = {out_scaled.mean():.2f}")
print(f"\nThe scaling keeps expected value at 1.0 despite dropping 50% of values")

PyTorch’s nn.Dropout:

# PyTorch handles training/mode automatically
dropout_pytorch = nn.Dropout(p=0.5)

x_torch = torch.ones(2, 8)

# Training mode (dropout active)
dropout_pytorch.train()
torch.manual_seed(42)
out_train = dropout_pytorch(x_torch)
print(f"PyTorch Dropout (training): {out_train[0].numpy()}")

# Inference mode (dropout disabled)
dropout_pytorch.eval()
out_inference = dropout_pytorch(x_torch)
print(f"PyTorch Dropout (inference): {out_inference[0].numpy()}")

Key Insight: Dropout is just random masking with scaling. The \(\frac{1}{1-p}\) factor during training means we do not need to modify anything at inference time - the expected value is already correct.

Residual Connections: The Highway for Gradients

The Idea: Instead of computing \(y = f(x)\), compute \(y = x + f(x)\). This “skip connection” lets gradients flow directly through the network, solving the vanishing gradient problem in deep networks.

Why it helps:

import matplotlib.pyplot as plt

# Visualize gradient flow with and without residuals
def gradient_flow_demo():
    """Show how residuals help gradients in deep networks."""

    # Simulate a function that shrinks gradients (common in deep nets)
    def layer_gradient(g, shrink=0.8):
        return g * shrink

    # Without residual: gradients multiply
    # d(f(f(f(x))))/dx = f'(x) * f'(f(x)) * f'(f(f(x)))
    gradients_no_residual = [1.0]
    for _ in range(20):
        gradients_no_residual.append(layer_gradient(gradients_no_residual[-1]))

    # With residual: gradients add
    # d(x + f(x))/dx = 1 + f'(x)  (the 1 always flows through!)
    gradients_with_residual = [1.0]
    for _ in range(20):
        # Gradient through residual = 1 (skip) + shrink (through f)
        g = 1.0 + layer_gradient(gradients_with_residual[-1]) * 0.1
        gradients_with_residual.append(min(g, gradients_with_residual[-1] * 1.05))

    plt.figure(figsize=(10, 5))
    plt.semilogy(gradients_no_residual, 'b-o', label='Without residual', markersize=4)
    plt.semilogy(gradients_with_residual, 'r-o', label='With residual', markersize=4)
    plt.xlabel('Layer depth')
    plt.ylabel('Gradient magnitude (log scale)')
    plt.title('Residual Connections Prevent Vanishing Gradients')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

    print("Without residuals: gradients vanish exponentially")
    print(f"  After 20 layers: {gradients_no_residual[-1]:.6f}")
    print("With residuals: gradients stay healthy")
    print(f"  After 20 layers: {gradients_with_residual[-1]:.2f}")

gradient_flow_demo()

Pre-Norm vs Post-Norm:

The original Transformer used “post-norm”: normalize after the residual addition.

# Post-Norm (original Transformer)
x = LayerNorm(x + Attention(x))
x = LayerNorm(x + FFN(x))

Modern LLMs use “pre-norm”: normalize before each sublayer.

# Pre-Norm (GPT-2, LLaMA, modern LLMs)
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Why Pre-Norm is Better:

# Demonstrate the stability difference

def simulate_forward_pass(num_layers, prenorm=True):
    """Simulate activation magnitudes through layers."""
    x = 1.0  # Starting activation magnitude

    for _ in range(num_layers):
        if prenorm:
            # Pre-norm: normalize first, then residual keeps things bounded
            normed = 1.0  # After LayerNorm, magnitude is ~1
            sublayer_out = normed * 0.5  # Sublayer output
            x = x + sublayer_out  # Residual addition
        else:
            # Post-norm: residual can grow, then we normalize
            sublayer_out = x * 0.5
            x = x + sublayer_out  # Can grow unboundedly before norm
            x = 1.0  # LayerNorm resets to ~1

    return x

print("Activation stability comparison:")
print(f"  Pre-norm after 24 layers:  ~{simulate_forward_pass(24, prenorm=True):.1f}")
print(f"  Post-norm after 24 layers: ~{simulate_forward_pass(24, prenorm=False):.1f}")
print("\nPre-norm has a cleaner gradient path because the skip connection")
print("bypasses normalization - gradients flow directly from output to input.")

Key Insight: Residual connections transform y = f(x) into y = x + f(x). The gradient of this is dy/dx = 1 + df/dx. That + 1 is crucial - it means gradients always have a direct path through the network, even if df/dx is tiny.

The Full Transformer Block from Scratch

The Idea: Now we assemble all the pieces into a complete transformer block:

  1. LayerNorm + Multi-Head Attention + Residual
  2. LayerNorm + Feed-Forward Network + Residual
class FeedForwardScratch:
    """Simple feed-forward network from scratch."""

    def __init__(self, embed_dim, ff_dim):
        # Initialize weights with small random values
        scale = 0.02
        self.w1 = np.random.randn(embed_dim, ff_dim) * scale
        self.b1 = np.zeros(ff_dim)
        self.w2 = np.random.randn(ff_dim, embed_dim) * scale
        self.b2 = np.zeros(embed_dim)

    def gelu(self, x):
        """GELU activation: x * Phi(x) where Phi is standard normal CDF."""
        return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

    def __call__(self, x):
        # Up projection: embed_dim -> ff_dim
        h = x @ self.w1 + self.b1
        # Activation
        h = self.gelu(h)
        # Down projection: ff_dim -> embed_dim
        return h @ self.w2 + self.b2


# NOTE: This uses a SIMPLIFIED attention (just linear projection) to focus on
# the overall block structure. Real attention with Q, K, V is in attention.py
class TransformerBlockScratch:
    """
    A complete transformer block from scratch (with simplified attention).

    Architecture (Pre-Norm):
        x = x + Attention(LayerNorm(x))
        x = x + FeedForward(LayerNorm(x))

    WARNING: The attention here is simplified to a linear projection for
    demonstration purposes. See m05_attention for full attention implementation.
    """

    def __init__(self, embed_dim, num_heads, ff_dim, dropout_p=0.1):
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Layer norms
        self.ln1 = LayerNormScratch(embed_dim)
        self.ln2 = LayerNormScratch(embed_dim)

        # Attention projections (simplified: no actual attention computation)
        # In a full implementation, this would include Q, K, V projections
        scale = 0.02
        self.attn_proj = np.random.randn(embed_dim, embed_dim) * scale

        # Feed-forward network
        self.ff = FeedForwardScratch(embed_dim, ff_dim)

        # Dropout
        self.dropout = DropoutScratch(dropout_p)

    def __call__(self, x, training=True):
        """
        Args:
            x: input of shape (batch, seq, embed_dim)
            training: whether to apply dropout
        Returns:
            output of shape (batch, seq, embed_dim)
        """
        # === Attention sub-block ===
        # 1. Layer norm (pre-norm)
        normed = self.ln1(x)

        # 2. Attention (simplified: just a linear projection for demo)
        # Real implementation would compute Q, K, V and attention weights
        attn_out = normed @ self.attn_proj

        # 3. Dropout
        attn_out = self.dropout(attn_out, training=training)

        # 4. Residual connection
        x = x + attn_out

        # === Feed-forward sub-block ===
        # 1. Layer norm (pre-norm)
        normed = self.ln2(x)

        # 2. Feed-forward network
        ff_out = self.ff(normed)

        # 3. Dropout
        ff_out = self.dropout(ff_out, training=training)

        # 4. Residual connection
        x = x + ff_out

        return x


# Test the from-scratch transformer block
np.random.seed(42)
block_scratch = TransformerBlockScratch(
    embed_dim=64,
    num_heads=4,
    ff_dim=256,
    dropout_p=0.0  # Disable dropout for reproducibility
)

x = np.random.randn(2, 8, 64)  # batch=2, seq=8, embed=64
out = block_scratch(x, training=False)

print("Transformer Block from Scratch:")
print(f"  Input shape:  {x.shape}")
print(f"  Output shape: {out.shape}")
print(f"  Input mean:   {x.mean():.4f}")
print(f"  Output mean:  {out.mean():.4f}")
print(f"\nThe block transforms each token while preserving shape.")
print("Residual connections keep the output close to input initially.")

The Complete Picture:

# Visualize the transformer block structure
print("""
Transformer Block (Pre-Norm Architecture):
==========================================

    Input x
        |
        +------------------+
        |                  |
        v                  |
    LayerNorm              |
        |                  |
        v                  |
    Multi-Head Attention   |
        |                  |
        v                  |
    Dropout                |
        |                  |
        +--------(+)-------+  <- Residual connection
                  |
        +------------------+
        |                  |
        v                  |
    LayerNorm              |
        |                  |
        v                  |
    Feed-Forward           |
        |                  |
        v                  |
    Dropout                |
        |                  |
        +--------(+)-------+  <- Residual connection
                  |
                  v
              Output
""")

Key Insight: A Transformer block is just attention + MLP + residuals + norms. That is it. The magic comes from stacking many of these simple blocks and training on lots of data.

PyTorch Transformer Modules

PyTorch provides optimized versions of everything we built from scratch.

Comparison Table:

Component From Scratch PyTorch
LayerNorm Manual mean/var nn.LayerNorm
Dropout Random mask + scale nn.Dropout
FFN Two linear layers + GELU Custom or nn.Sequential
Full Block Manual assembly nn.TransformerDecoderLayer
# PyTorch's TransformerDecoderLayer
# Note: This is for encoder-decoder models; for decoder-only like GPT,
# we typically build our own (as in transformer.py)

from torch.nn import TransformerDecoderLayer

# Create a decoder layer similar to our scratch implementation
pytorch_block = TransformerDecoderLayer(
    d_model=64,
    nhead=4,
    dim_feedforward=256,
    dropout=0.1,
    activation='gelu',
    batch_first=True,
    norm_first=True  # Pre-norm architecture
)

x_torch = torch.randn(2, 8, 64)

# For decoder-only, we use self-attention (memory = x)
pytorch_block.eval()
out_pytorch = pytorch_block(x_torch, x_torch)

print("PyTorch TransformerDecoderLayer:")
print(f"  Input shape:  {tuple(x_torch.shape)}")
print(f"  Output shape: {tuple(out_pytorch.shape)}")
print(f"  Parameters:   {sum(p.numel() for p in pytorch_block.parameters()):,}")

When to Use What:

  • Learning: Build from scratch to understand every step
  • Production: Use PyTorch’s optimized modules
  • Custom architectures: Mix both - understand the components, then optimize
# Our module's TransformerBlock (production quality)
from transformer import TransformerBlock

our_block = TransformerBlock(
    embed_dim=64,
    num_heads=4,
    ff_dim=256,
    dropout=0.1
)

x_torch = torch.randn(2, 8, 64)
our_block.eval()
out_ours = our_block(x_torch)

print("Our TransformerBlock (from transformer.py):")
print(f"  Input shape:  {tuple(x_torch.shape)}")
print(f"  Output shape: {tuple(out_ours.shape)}")
print(f"  Parameters:   {sum(p.numel() for p in our_block.parameters()):,}")
print("\nThis is what we use for training - it includes proper")
print("causal attention, not the simplified version in scratch code.")

More Component Details

Layer Normalization (PyTorch Details)

Normalizes activations across the embedding dimension:

\[\text{LayerNorm}(x) = \gamma \times \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

where \(\mu\) and \(\sigma^2\) are the mean and variance across the embedding dimension, and \(\gamma\), \(\beta\) are learnable parameters.

Why it helps:

  • Stabilizes activations: Prevents values from exploding or vanishing
  • Faster training: More stable gradients
  • Independent per token: Each token normalized separately

Feed-Forward Network (FFN)

The FFN is a mini neural network applied to each token independently:

  • 4x expansion: More capacity to learn complex transformations
  • GELU activation: Smoother than ReLU, better gradients
  • Same for all tokens: Unlike attention, no mixing between positions

Pre-Norm vs Post-Norm

We use Pre-Norm (LayerNorm before attention/FFN) rather than Post-Norm:

# Pre-Norm (GPT-2, LLaMA, modern LLMs)
x = x + Attention(LayerNorm(x))

# Post-Norm (original Transformer paper)
x = LayerNorm(x + Attention(x))

Why Pre-Norm is preferred:

  1. Cleaner gradient path: The residual connection bypasses normalization, so gradients flow directly
  2. More stable training: Especially important for deep networks (24+ layers)
  3. Requires final LayerNorm: Since the last block’s output isn’t normalized, we add a final LayerNorm before the output projection

Post-Norm can achieve slightly better final performance with careful hyperparameter tuning, but Pre-Norm is more robust and easier to train.

Code Walkthrough

Let’s build and explore transformer blocks:

import sys
import importlib.util
from pathlib import Path

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")

GELU Activation

GPT uses GELU instead of ReLU. Let’s see why:

import torch.nn.functional as F

x = torch.linspace(-3, 3, 100)
gelu_out = F.gelu(x)
relu_out = torch.relu(x)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(x.numpy(), relu_out.numpy(), 'b-', label='ReLU', linewidth=2)
plt.plot(x.numpy(), gelu_out.numpy(), 'r-', label='GELU', linewidth=2)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('x')
plt.ylabel('Activation')
plt.legend()
plt.title('GELU vs ReLU')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
x_zoom = torch.linspace(-1, 1, 100)
plt.plot(x_zoom.numpy(), torch.relu(x_zoom).numpy(), 'b-', label='ReLU', linewidth=2)
plt.plot(x_zoom.numpy(), F.gelu(x_zoom).numpy(), 'r-', label='GELU', linewidth=2)
plt.xlabel('x')
plt.title('Zoomed: GELU is smooth at 0')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key difference: GELU is smooth everywhere!")
print("ReLU has a sharp corner at x=0, which can cause gradient issues.")

GELU formula: \(\text{GELU}(x) = x \cdot \Phi(x)\) where \(\Phi\) is the standard normal CDF.

Approximation used in practice: \(0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3)))\)

Activation function choices in modern LLMs:

Model Activation Notes
GPT-2, BERT GELU Smooth, good gradients
LLaMA, Mistral SwiGLU Gated variant, better performance
GPT-3 GELU Same as GPT-2

SwiGLU (used in LLaMA) is a gated linear unit: \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes xW_2\). It requires an extra linear layer but often improves performance. Our implementation uses standard GELU to match GPT-2.

Layer Normalization Demo

# Manual LayerNorm demonstration
x = torch.tensor([[2.0, 4.0, 6.0, 8.0]])

mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, unbiased=False, keepdim=True)
normalized = (x - mean) / torch.sqrt(var + 1e-5)

print("Manual LayerNorm:")
print(f"  Input: {x.numpy().tolist()}")
print(f"  Mean: {mean.item():.2f}")
print(f"  Variance: {var.item():.2f}")
print(f"  Normalized: {normalized.numpy().round(2).tolist()}")
print(f"  New mean: {normalized.mean().item():.4f}")
print(f"  New std: {normalized.std().item():.4f}")
# PyTorch LayerNorm (with learnable gamma and beta)
ln = nn.LayerNorm(4)
pytorch_normalized = ln(x)

print(f"PyTorch LayerNorm output: {pytorch_normalized.detach().numpy().round(2).tolist()}")
print("(gamma and beta are learnable parameters)")

Residual Connections Demo

# Demonstrate gradient flow with and without residuals
def simple_layer(x):
    """A simple transformation that shrinks values."""
    return x * 0.5 + 0.1

# Stack 10 layers WITHOUT residual
x = torch.tensor([1.0])
outputs_no_residual = [x.item()]
for _ in range(10):
    x = simple_layer(x)
    outputs_no_residual.append(x.item())

# Stack 10 layers WITH residual
x = torch.tensor([1.0])
outputs_with_residual = [x.item()]
for _ in range(10):
    x = x + simple_layer(x) * 0.1  # x + f(x)
    outputs_with_residual.append(x.item())

plt.figure(figsize=(10, 4))
plt.plot(outputs_no_residual, 'b-o', label='Without residual')
plt.plot(outputs_with_residual, 'r-o', label='With residual')
plt.xlabel('Layer')
plt.ylabel('Value')
plt.title('Effect of Residual Connections')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Without residual: value shrinks to {outputs_no_residual[-1]:.4f}")
print(f"With residual: value stays near {outputs_with_residual[-1]:.4f}")

Weight Initialization

Proper initialization is critical for training deep networks. Our implementation uses:

  • Embeddings: Normal distribution with std=0.02
  • Linear layers in FFN: Normal distribution with std=0.02, biases initialized to 0
  • Attention projections: Xavier uniform initialization
# Demonstrate the importance of initialization
import torch.nn as nn

# Bad initialization - too large
bad_linear = nn.Linear(768, 768)
nn.init.normal_(bad_linear.weight, std=1.0)  # Too large!

# Good initialization - small weights
good_linear = nn.Linear(768, 768)
nn.init.normal_(good_linear.weight, std=0.02)  # GPT-2 style

x = torch.randn(1, 10, 768)
bad_out = bad_linear(x)
good_out = good_linear(x)

print("Effect of initialization on output magnitude:")
print(f"  Bad init (std=1.0):  output std = {bad_out.std().item():.2f}")
print(f"  Good init (std=0.02): output std = {good_out.std().item():.2f}")
print("\nLarge outputs can cause exploding gradients and NaN losses!")

GPT-2’s initialization trick: Scale the final projection in each residual block by \(1/\sqrt{2N}\) where N is the number of layers. This keeps the variance stable as depth increases.

Building a Transformer Block

from transformer import (
    FeedForward,
    TransformerBlock,
    GPTModel,
    create_gpt_tiny,
    create_gpt_small,
)

# Create a transformer block
embed_dim = 64
num_heads = 4
ff_dim = 256

block = TransformerBlock(
    embed_dim=embed_dim,
    num_heads=num_heads,
    ff_dim=ff_dim,
    dropout=0.0
)

print(f"Transformer Block:")
print(f"  Embed dim: {embed_dim}")
print(f"  Num heads: {num_heads}")
print(f"  Head dim: {embed_dim // num_heads}")
print(f"  FF dim: {ff_dim}")
print(f"\nTotal parameters: {sum(p.numel() for p in block.parameters()):,}")
# Forward pass
x = torch.randn(1, 8, embed_dim)  # batch=1, seq=8
output, attention = block(x, return_attention=True)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention shape: {attention.shape}")
# Visualize attention patterns from the block
fig, axes = plt.subplots(1, 4, figsize=(14, 3))

for head in range(4):
    ax = axes[head]
    w = attention[0, head].detach().numpy()
    ax.imshow(w, cmap='Blues', vmin=0, vmax=w.max())
    ax.set_title(f'Head {head}')
    ax.set_xlabel('Key')
    ax.set_ylabel('Query')

plt.suptitle('Attention Patterns in Transformer Block (Causal Masked)', fontsize=12)
plt.tight_layout()
plt.show()

Complete GPT Model

# Create a tiny GPT model
model = create_gpt_tiny(vocab_size=1000)

print("GPT Tiny Model:")
print(f"  Vocab size: {model.vocab_size}")
print(f"  Embed dim: {model.embed_dim}")
print(f"  Num layers: {len(model.blocks)}")
print(f"  Max seq len: {model.max_seq_len}")
print(f"\nTotal parameters: {model.num_params:,}")
# Parameter breakdown
counts = model.count_parameters()

print("Parameter breakdown:")
for name, count in counts.items():
    if count > 0:
        pct = 100 * count / counts['total']
        print(f"  {name}: {count:,} ({pct:.1f}%)")

# Visualize
labels = [k for k, v in counts.items() if v > 0 and k != 'total']
sizes = [counts[k] for k in labels]

plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title(f'Parameter Distribution ({model.num_params:,} total)')
plt.show()
# Forward pass
token_ids = torch.randint(0, 1000, (2, 32))  # batch=2, seq=32
logits = model(token_ids)

print(f"Input token IDs: {token_ids.shape}")
print(f"Output logits: {logits.shape}")
print(f"  (batch=2, seq=32, vocab=1000)")

# Get predictions
probs = torch.softmax(logits[0, -1], dim=-1)
top_5 = torch.topk(probs, 5)

print("\nTop 5 predicted next tokens (untrained, so random):")
for i, (idx, prob) in enumerate(zip(top_5.indices, top_5.values)):
    print(f"  {i+1}. Token {idx.item()}: {prob.item()*100:.2f}%")

Hidden States Through Layers

# Get hidden states from all layers
logits, hidden_states = model(token_ids, return_hidden_states=True)

print(f"Number of hidden states: {len(hidden_states)}")
print(f"  (1 after embedding + {len(model.blocks)} after each block)")

# Show how representations change through layers
norms = [h.norm(dim=-1).mean().item() for h in hidden_states]

plt.figure(figsize=(10, 4))
plt.plot(range(len(norms)), norms, 'b-o')
plt.xlabel('Layer')
plt.ylabel('Average Embedding Norm')
plt.title('Embedding Norms Through the Network')
plt.xticks(range(len(norms)), ['Embed'] + [f'Block {i}' for i in range(len(model.blocks))])
plt.grid(True, alpha=0.3)
plt.show()

Weight Tying

GPT shares weights between token embedding and output projection:

# Check weight tying
print("Weight Tying:")
print(f"  Token embedding weight id: {id(model.token_embedding.weight)}")
print(f"  LM head weight id: {id(model.lm_head.weight)}")
print(f"  Are they the same object? {model.token_embedding.weight is model.lm_head.weight}")

# This saves parameters!
vocab_size = 1000
embed_dim = 128
saved_params = vocab_size * embed_dim
print(f"\nParameters saved by weight tying: {saved_params:,}")

Model Sizes Comparison

Model Layers Heads Embed Dim Params
Tiny (ours) 4 4 128 ~1M
Small (ours) 6 6 384 ~10M
GPT-2 Small 12 12 768 117M
GPT-2 Medium 24 16 1024 345M
GPT-2 Large 36 20 1280 774M
GPT-2 XL 48 25 1600 1.5B

Parameter Counting Formulas

Understanding where parameters come from helps with model sizing:

Per Transformer Block:

  • Attention Q, K, V projections: \(3 \times d \times d\) (where \(d\) = embed_dim)
  • Attention output projection: \(d \times d\)
  • FFN first linear: \(d \times 4d\)
  • FFN second linear: \(4d \times d\)
  • LayerNorm (x2): \(2 \times 2d\) (gamma and beta for each)

Total per block: \(\approx 12d^2\) parameters

Full Model:

  • Token embedding: \(V \times d\) (V = vocab size)
  • Position embedding: \(L \times d\) (L = max sequence length)
  • N transformer blocks: \(N \times 12d^2\)
  • Final LayerNorm: \(2d\)
  • LM head: 0 (weight-tied with token embedding)

Approximate formula: \(\text{Params} \approx V \times d + 12Nd^2\)

For GPT-2 Small (V=50257, d=768, N=12): \(50257 \times 768 + 12 \times 12 \times 768^2 \approx 117M\)

Scaling laws: more layers, heads, and dimensions lead to better performance (but diminishing returns and higher compute cost).

Architectural Variations

Modern LLMs have evolved beyond the original GPT-2 architecture. Here are key variations:

Normalization

Variant Used By Description
LayerNorm GPT-2, GPT-3 Normalize across embedding dimension
RMSNorm LLaMA, Mistral Simpler: just divide by RMS, no mean subtraction
Pre-Norm Most modern LLMs Normalize before sublayer (more stable)

Position Embeddings

Variant Used By Description
Learned absolute GPT-2 Separate embedding for each position
Rotary (RoPE) LLaMA, Mistral Encode position in attention via rotation
ALiBi BLOOM Add position bias to attention scores

Our implementation uses learned absolute position embeddings (GPT-2 style), which are simple but limit the model to the maximum trained sequence length.

Feed-Forward Networks

Variant Used By Expansion Activation
Standard GPT-2 4x GELU
SwiGLU LLaMA 8/3x (after gating) SiLU (Swish)

Common Pitfalls

When implementing or training transformers, watch out for:

  1. Forgetting the causal mask: Without it, the model can “cheat” by looking at future tokens during training, leading to poor generation at inference time.

  2. Wrong normalization axis: LayerNorm should normalize across the embedding dimension (last axis), not the sequence or batch dimensions.

  3. Residual connection placement: Make sure to add the residual after dropout but before the next LayerNorm in Pre-Norm architecture.

  4. Large learning rates: Transformers are sensitive to learning rate. Start with 1e-4 to 3e-4 for Adam, use warmup.

  5. Numerical instability: Use float32 for training initially. Half precision (fp16/bf16) requires careful scaling.

  6. Forgetting final LayerNorm: In Pre-Norm, the output of the last block isn’t normalized. The final LayerNorm before the LM head is essential.

Interactive Exploration

Experiment with transformer architecture choices to understand where parameters come from and how they scale.

TipTry This
  1. FFN dominates: Set embed_dim=768, layers=12. Notice the Feed-Forward bars are ~2x the Attention bars (because FFN has 8d² params vs Attention’s 4d²).

  2. Embedding cost at small scale: With vocab=50257 and embed_dim=768, token embeddings are ~38M params - a large fraction for small models.

  3. Scaling law: Double embed_dim from 512 to 1024. Total params roughly quadruple (because most params scale with d²).

  4. Load GPT-2 presets and see how the 117M, 345M, 774M models break down.

  5. Head dimension check: Try numHeads that doesn’t divide embedDim evenly - you’ll see a warning.

Exercises

Exercise 1: Build a Custom Block

# Create a transformer block with different configurations
custom_block = TransformerBlock(
    embed_dim=128,
    num_heads=8,
    ff_dim=512,  # 4x expansion
    dropout=0.1
)

# Test it
x = torch.randn(4, 16, 128)  # batch=4, seq=16
output = custom_block(x)
print(f"Custom block: {x.shape} -> {output.shape}")
print(f"Parameters: {sum(p.numel() for p in custom_block.parameters()):,}")

Exercise 2: Compare Model Scales

# Compare tiny vs small model
tiny = create_gpt_tiny(vocab_size=10000)
small = create_gpt_small(vocab_size=10000)

print(f"{'Model':<10} {'Embed':<8} {'Layers':<8} {'Heads':<8} {'Params':<15}")
print("-" * 50)
print(f"{'Tiny':<10} {tiny.embed_dim:<8} {len(tiny.blocks):<8} {tiny.blocks[0].attention.mha.num_heads:<8} {tiny.num_params:,}")
print(f"{'Small':<10} {small.embed_dim:<8} {len(small.blocks):<8} {small.blocks[0].attention.mha.num_heads:<8} {small.num_params:,}")

Exercise 3: Information Flow

# See how a single token's representation changes through layers
model = create_gpt_tiny(vocab_size=100)
token_ids = torch.randint(0, 100, (1, 8))

_, hidden = model(token_ids, return_hidden_states=True)

# Track first token through layers
first_token_norms = [h[0, 0].norm().item() for h in hidden]

plt.figure(figsize=(8, 4))
plt.bar(range(len(first_token_norms)), first_token_norms)
plt.xlabel('Layer')
plt.ylabel('Embedding Norm (first token)')
plt.title('First Token Representation Through Layers')
plt.xticks(range(len(first_token_norms)), ['Embed'] + [f'Block {i}' for i in range(len(model.blocks))])
plt.show()

Summary

Key takeaways:

  1. Transformer architecture: Input embeddings -> N transformer blocks -> Final LayerNorm -> Output projection

  2. Each block has two sublayers:

    • Multi-head attention (tokens communicate)
    • Feed-forward network (tokens processed independently)
  3. Pre-Norm architecture: LayerNorm before each sublayer, with a “clean” residual path for stable gradients

  4. Layer normalization: Normalizes across the embedding dimension, keeping activations in a stable range

  5. Residual connections: x + f(x) enables gradient flow through very deep networks (100+ layers)

  6. Feed-forward networks: 4x expansion with GELU activation provides computational capacity

  7. Weight tying: Sharing token embedding and output projection reduces parameters and improves performance

  8. Initialization matters: Small initial weights (std=0.02) prevent exploding activations

  9. Parameter scaling: Total params \(\approx V \times d + 12Nd^2\) (dominated by FFN for large models)

  10. Architectural variations: Modern LLMs (LLaMA, Mistral) use RMSNorm, RoPE, and SwiGLU for better efficiency

What’s Next

In Module 07: Training, we’ll train our transformer on actual data using cross-entropy loss, learning rate scheduling, and gradient accumulation.