Module 04: Embeddings

Introduction

Embeddings convert token IDs into dense vectors. In vector space, similar tokens cluster together.

The model learns embedding vectors during training. The model does not treat token ID 42 as a bare number; it assigns a dense vector like [0.2, -0.5, 0.8, ...] that encodes semantic content.

Why embeddings matter for LLMs:

  • Similarity: Similar words have similar vectors (“cat” and “dog” are close)
  • Composition: Vectors can be combined meaningfully
  • Learning: The model learns these representations during training
  • Position: We also embed WHERE tokens are in the sequence

Two types of embeddings in transformers:

  1. Token embeddings: What the token means
  2. Positional embeddings: Where the token is in the sequence

What You’ll Learn

After this module, you can:

  • Explain how embedding lookups work as matrix multiplication
  • Implement token and positional embeddings from scratch
  • Understand how gradients flow through embedding layers
  • Choose between learned and sinusoidal positional embeddings
  • Recognize the role of embeddings in the transformer architecture

Prerequisites

This module requires familiarity with:

Memory and Scale Considerations

In most language models, embeddings constitute the largest single component. The parameter count equals vocab_size × embed_dim:

Model Vocab Size Embed Dim Embedding Params Memory (fp32)
GPT-2 Small 50,257 768 38.6M 147 MB
LLaMA 2 7B 32,000 4,096 131M 500 MB
LLaMA 3 8B 128,256 4,096 525M 2 GB
GPT-4 (est.) ~100,000 ~12,288 ~1.2B ~4.7 GB
LLaMA 3 70B 128,256 8,192 1.05B 4 GB

Vocabulary size drives embedding memory and demands careful consideration. A larger vocabulary means each token carries more information (fewer tokens per text), but the embedding table grows proportionally.

Intuition: Coordinates in Meaning Space

Embeddings are coordinates in “meaning space”:

Token: "cat"  -> [0.8, 0.1, 0.9, ...]   <- captures "animal", "pet", etc.
Token: "dog"  -> [0.7, 0.2, 0.8, ...]   <- similar to cat
Token: "code" -> [0.1, 0.9, 0.2, ...]   <- very different

Distance("cat", "dog") < Distance("cat", "code")

Positional embeddings add “where in the sequence” information:

Position 0: [1.0, 0.0, 0.5, ...]   <- "I'm first"
Position 1: [0.9, 0.1, 0.4, ...]   <- "I'm second"
Position 2: [0.8, 0.2, 0.3, ...]   <- "I'm third"

The model combines them:

Final embedding = Token embedding + Positional embedding

Embedding Architecture

The diagram below shows how embeddings work in a transformer. Step through the pipeline to trace token IDs as they transform into dense vector representations:

TipTry This

Use the slider to step through the embedding pipeline. Notice how token IDs become vectors through table lookup, then get combined with position information.

Embeddings Are Just Lookup Tables

The key insight: embedding lookup is sparse matrix multiplication. When we say “look up embedding for token 3”, we’re actually doing:

  1. Create a one-hot vector: token 3 in vocab of 10 becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
  2. Multiply by the weight matrix: one_hot @ W
  3. Only one row of W participates (the row for token 3)

This is exactly what happens mathematically. PyTorch skips one-hot creation as an optimization, but the matrix multiplication view reveals how gradients flow.

From Scratch: One-Hot Embedding Lookup

Let’s build an embedding layer using explicit one-hot vectors and matrix multiplication:

import numpy as np
import torch
import torch.nn as nn

class ScratchEmbedding:
    """Embedding layer using explicit one-hot multiplication.

    This shows what's really happening: embedding lookup is
    just sparse matrix multiplication.
    """

    def __init__(self, vocab_size: int, embed_dim: int):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        # Weight matrix: each row is the embedding for a token
        self.W = np.random.randn(vocab_size, embed_dim) * 0.02

    def __call__(self, token_ids: np.ndarray) -> np.ndarray:
        """
        token_ids: shape (batch, seq_len) - integer token IDs
        returns: shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Create one-hot encodings: (batch, seq_len, vocab_size)
        one_hot = np.zeros((batch_size, seq_len, self.vocab_size))

        # Set the appropriate positions to 1
        for b in range(batch_size):
            for t in range(seq_len):
                one_hot[b, t, token_ids[b, t]] = 1.0

        # Matrix multiply: (batch, seq_len, vocab_size) @ (vocab_size, embed_dim)
        # = (batch, seq_len, embed_dim)
        embeddings = one_hot @ self.W

        return embeddings

# Test it
vocab_size, embed_dim = 10, 4
scratch_emb = ScratchEmbedding(vocab_size, embed_dim)

# Sample tokens: batch of 2, sequence length 3
token_ids = np.array([[3, 7, 1],
                      [5, 3, 9]])

result = scratch_emb(token_ids)
print(f"Token IDs shape: {token_ids.shape}")
print(f"Embeddings shape: {result.shape}")
print(f"\nToken 3's embedding (row 3 of W):")
print(f"  From lookup: {result[0, 0]}")
print(f"  Direct W[3]: {scratch_emb.W[3]}")
print(f"  Match: {np.allclose(result[0, 0], scratch_emb.W[3])}")
Token IDs shape: (2, 3)
Embeddings shape: (2, 3, 4)

Token 3's embedding (row 3 of W):
  From lookup: [-0.01596401  0.02334137 -0.00067434  0.01795087]
  Direct W[3]: [-0.01596401  0.02334137 -0.00067434  0.01795087]
  Match: True

The one-hot multiplication selects exactly one row from W. Watch the math:

# Visualize the one-hot multiplication
token_id = 3
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1.0

print("One-hot vector for token 3:")
print(f"  {one_hot}")
print(f"\nWeight matrix W (10 x 4):")
print(f"  Row 0: {scratch_emb.W[0]}")
print(f"  Row 1: {scratch_emb.W[1]}")
print(f"  Row 2: {scratch_emb.W[2]}")
print(f"  Row 3: {scratch_emb.W[3]}  <-- selected")
print(f"  ...")
print(f"\none_hot @ W = {one_hot @ scratch_emb.W}")
print(f"W[3] directly = {scratch_emb.W[3]}")
One-hot vector for token 3:
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]

Weight matrix W (10 x 4):
  Row 0: [-0.02989842  0.04040301 -0.0013854  -0.03810892]
  Row 1: [-0.03675501  0.00448419 -0.0115205  -0.00566359]
  Row 2: [-0.02009978  0.02561311  0.03978262  0.0102098 ]
  Row 3: [-0.01596401  0.02334137 -0.00067434  0.01795087]  <-- selected
  ...

one_hot @ W = [-0.01596401  0.02334137 -0.00067434  0.01795087]
W[3] directly = [-0.01596401  0.02334137 -0.00067434  0.01795087]

PyTorch’s nn.Embedding

PyTorch provides the same functionality but optimized - it skips creating the one-hot vector entirely:

# PyTorch equivalent
torch_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

# Copy weights from scratch version for comparison
with torch.no_grad():
    torch_emb.weight.copy_(torch.from_numpy(scratch_emb.W).float())

# Same token IDs
token_ids_torch = torch.tensor(token_ids)
result_torch = torch_emb(token_ids_torch)

print(f"Scratch result (token 3): {result[0, 0]}")
print(f"PyTorch result (token 3): {result_torch[0, 0].detach().numpy()}")
print(f"Match: {np.allclose(result, result_torch.detach().numpy())}")
Scratch result (token 3): [-0.01596401  0.02334137 -0.00067434  0.01795087]
PyTorch result (token 3): [-0.01596401  0.02334137 -0.00067434  0.01795087]
Match: True
TipKey Insight

nn.Embedding is just an optimized lookup - no one-hot materialization. But mathematically, it’s identical to one-hot times weight matrix. Understanding this helps when debugging gradient flow.

Making Lookups Differentiable

How do gradients flow through an embedding lookup? The answer comes directly from the matrix multiplication view.

From Scratch: Gradient Flow

When we compute output = one_hot @ W, the gradient with respect to W follows standard matrix calculus:

dL/dW = one_hot.T @ dL/doutput

Only the selected rows receive gradients. If we looked up tokens [3, 7, 1], only rows 3, 7, and 1 of W get updated during training.

class ScratchEmbeddingWithGrad:
    """Embedding with gradient computation."""

    def __init__(self, vocab_size: int, embed_dim: int):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.W = np.random.randn(vocab_size, embed_dim) * 0.02
        self.grad_W = None
        self._last_one_hot = None  # Store for backward

    def forward(self, token_ids: np.ndarray) -> np.ndarray:
        batch_size, seq_len = token_ids.shape

        # Create one-hot: (batch, seq_len, vocab_size)
        one_hot = np.zeros((batch_size, seq_len, self.vocab_size))
        for b in range(batch_size):
            for t in range(seq_len):
                one_hot[b, t, token_ids[b, t]] = 1.0

        self._last_one_hot = one_hot
        return one_hot @ self.W

    def backward(self, grad_output: np.ndarray):
        """
        grad_output: shape (batch, seq_len, embed_dim) - gradient from next layer
        """
        # dL/dW = one_hot.T @ grad_output
        # Reshape for batch matmul: (batch, vocab_size, seq_len) @ (batch, seq_len, embed_dim)
        one_hot_T = self._last_one_hot.transpose(0, 2, 1)  # (batch, vocab, seq_len)

        # Accumulate gradients across batch
        self.grad_W = np.zeros_like(self.W)
        for b in range(grad_output.shape[0]):
            self.grad_W += one_hot_T[b] @ grad_output[b]

        return self.grad_W

# Demonstrate gradient flow
emb = ScratchEmbeddingWithGrad(vocab_size=10, embed_dim=4)
token_ids = np.array([[3, 7, 1]])  # batch=1, seq_len=3

# Forward
output = emb.forward(token_ids)

# Simulate gradient from loss (all ones for simplicity)
grad_from_loss = np.ones_like(output)

# Backward
grad_W = emb.backward(grad_from_loss)

print("Gradient magnitude per row of W:")
for i in range(10):
    magnitude = np.abs(grad_W[i]).sum()
    marker = " <-- used" if i in [3, 7, 1] else ""
    print(f"  Row {i}: {magnitude:.4f}{marker}")

print("\nOnly rows 1, 3, 7 received gradients!")
Gradient magnitude per row of W:
  Row 0: 0.0000
  Row 1: 4.0000 <-- used
  Row 2: 0.0000
  Row 3: 4.0000 <-- used
  Row 4: 0.0000
  Row 5: 0.0000
  Row 6: 0.0000
  Row 7: 4.0000 <-- used
  Row 8: 0.0000
  Row 9: 0.0000

Only rows 1, 3, 7 received gradients!

PyTorch: Automatic Gradient Tracking

PyTorch handles this automatically when requires_grad=True:

# PyTorch does this automatically
torch_emb = nn.Embedding(10, 4)

token_ids = torch.tensor([[3, 7, 1]])
output = torch_emb(token_ids)

# Fake loss: sum of embeddings
loss = output.sum()
loss.backward()

print("PyTorch gradient magnitude per row:")
for i in range(10):
    magnitude = torch_emb.weight.grad[i].abs().sum().item()
    marker = " <-- used" if i in [3, 7, 1] else ""
    print(f"  Row {i}: {magnitude:.4f}{marker}")
PyTorch gradient magnitude per row:
  Row 0: 0.0000
  Row 1: 4.0000 <-- used
  Row 2: 0.0000
  Row 3: 4.0000 <-- used
  Row 4: 0.0000
  Row 5: 0.0000
  Row 6: 0.0000
  Row 7: 4.0000 <-- used
  Row 8: 0.0000
  Row 9: 0.0000
NoteSparse Updates

This “sparse gradient” property is why embedding layers can have millions of parameters but train efficiently - each batch only updates a small subset of rows.

Positional Information

Attention (without positional embeddings) is permutation-equivariant: if you reorder the input tokens, the attention scores simply reorder to match. The relationship between “cat” and “sat” is the same regardless of whether they’re at positions [0,1] or [1,0]. This means the model can’t distinguish “the cat sat” from “sat the cat” — a critical limitation since word order carries meaning.

Position embeddings solve this by giving each position a learnable vector that gets added to the token embedding.

From Scratch: Learnable Position Embeddings

Position embeddings are just another lookup table, indexed by position instead of token ID:

class ScratchPositionEmbedding:
    """Learnable position embeddings - same as token embeddings but indexed by position."""

    def __init__(self, max_seq_len: int, embed_dim: int):
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim
        # Each position gets its own learnable vector
        self.W = np.random.randn(max_seq_len, embed_dim) * 0.02

    def __call__(self, seq_len: int) -> np.ndarray:
        """
        seq_len: how many positions to return
        returns: shape (seq_len, embed_dim)
        """
        # Just slice the first seq_len positions
        return self.W[:seq_len]


class ScratchCombinedEmbedding:
    """Token embeddings + position embeddings."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
        self.token_emb = ScratchEmbedding(vocab_size, embed_dim)
        self.pos_emb = ScratchPositionEmbedding(max_seq_len, embed_dim)

    def __call__(self, token_ids: np.ndarray) -> np.ndarray:
        """
        token_ids: shape (batch, seq_len)
        returns: shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Get token embeddings: (batch, seq_len, embed_dim)
        tok_emb = self.token_emb(token_ids)

        # Get position embeddings: (seq_len, embed_dim)
        pos_emb = self.pos_emb(seq_len)

        # Add position embeddings (broadcasts over batch dimension)
        return tok_emb + pos_emb

# Test combined embedding
combined = ScratchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)

# Same token (ID=42) at different positions
tokens = np.array([[42, 42, 42, 42]])  # Same token, 4 positions
embeddings = combined(tokens)

print("Same token (42) at different positions:")
for pos in range(4):
    print(f"  Position {pos}: {embeddings[0, pos, :4]}...")

print("\nAll different due to position embeddings!")
Same token (42) at different positions:
  Position 0: [ 0.02047942 -0.01433417 -0.00140515 -0.00449191]...
  Position 1: [-0.02315795 -0.00529424 -0.02574222  0.01374025]...
  Position 2: [-0.01883963 -0.05258541 -0.01737721  0.03494184]...
  Position 3: [-0.00709432 -0.02265589  0.00666619  0.05086564]...

All different due to position embeddings!

Why Position Matters: Attention is Permutation-Invariant

Without position embeddings, attention treats tokens as an unordered set:

# Demonstrate permutation invariance
def simple_attention_scores(embeddings):
    """Compute raw attention scores (Q @ K.T) without position."""
    # In real attention, Q = emb @ W_q, K = emb @ W_k
    # For simplicity, use embeddings directly
    return embeddings @ embeddings.T

# Create two orderings of the same tokens
token_emb = ScratchEmbedding(vocab_size=10, embed_dim=8)

# "the cat sat" = tokens [1, 5, 7]
order1 = np.array([[1, 5, 7]])
# "sat the cat" = tokens [7, 1, 5]
order2 = np.array([[7, 1, 5]])

emb1 = token_emb(order1)[0]  # (3, 8)
emb2 = token_emb(order2)[0]  # (3, 8)

scores1 = simple_attention_scores(emb1)
scores2 = simple_attention_scores(emb2)

# Pass data to OJS
ojs_define(
    attention_scores1=scores1.tolist(),
    attention_scores2=scores2.tolist()
)

PyTorch: Combined Token + Position Embedding

class PyTorchCombinedEmbedding(nn.Module):
    """Standard transformer embedding: token + position."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len = token_ids.shape

        # Token embeddings
        tok_emb = self.token_emb(token_ids)

        # Position embeddings (create position indices)
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.pos_emb(positions)

        return tok_emb + pos_emb

# Compare scratch vs PyTorch
pytorch_combined = PyTorchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)

tokens_torch = torch.tensor([[42, 42, 42, 42]])
embeddings_torch = pytorch_combined(tokens_torch)

print("PyTorch: Same token (42) at different positions:")
for pos in range(4):
    print(f"  Position {pos}: {embeddings_torch[0, pos, :4].tolist()}")
PyTorch: Same token (42) at different positions:
  Position 0: [-0.1875818967819214, -1.412695050239563, -0.3101547360420227, 1.7676509618759155]
  Position 1: [-0.08385199308395386, 0.3031729459762573, 1.5947062969207764, 0.42350661754608154]
  Position 2: [0.8799778819084167, 1.5273566246032715, 0.5429931879043579, -1.7162342071533203]
  Position 3: [0.9340940713882446, 2.3949697017669678, 0.48528745770454407, 1.65337073802948]
TipKey Insight

Position embeddings are just another embedding table - they work identically to token embeddings but are indexed by position. The “magic” is simply: final = token_emb[token_id] + pos_emb[position].

The Math

Token Embeddings

Simple lookup table: E[token_id] = embedding_vector

Mathematically equivalent to one-hot multiplication:

one_hot = [0, 0, 0, 1, 0, ...]  # 1 at position token_id
embedding = one_hot @ E         # selects row token_id from E

Positional Embeddings

Several approaches encode position:

1. Learned positional embeddings (GPT-2, BERT):

# Position table: (max_seq_len, embed_dim)
P = torch.randn(max_seq_len, embed_dim)
positions = P[:seq_len]  # Get positions for current sequence

Each position gets a trainable vector. Simple and effective, but cannot generalize to positions beyond max_seq_len.

2. Sinusoidal positional embeddings (original Transformer):

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

This creates a unique pattern for each position using waves of different frequencies. The key insight: PE(pos+k) equals a linear transformation of PE(pos), so the model can learn relative positions directly.

3. Rotary Position Embedding - RoPE (LLaMA, Mistral): Rather than adding position embeddings to token embeddings, RoPE rotates the query and key vectors based on position. The rotation angle depends on both position and dimension, encoding relative positions naturally in the attention computation. This approach extrapolates well beyond training sequence lengths.

4. ALiBi - Attention with Linear Biases (BLOOM): Instead of adding position information to embeddings, ALiBi adds a position-dependent bias directly to the attention scores: closer tokens get higher scores. Attention applies this bias directly, bypassing the embedding layer.

Combined Embeddings

# Input: token_ids of shape (batch, seq_len)
token_emb = token_embedding[token_ids]     # (batch, seq_len, embed_dim)
pos_emb = position_embedding[:seq_len]      # (seq_len, embed_dim)
x = token_emb + pos_emb                     # (batch, seq_len, embed_dim)

Same Token, Different Positions

The same token (“the”) appears at multiple positions in a sentence. Even though it has the same token embedding, the final embedding differs because of position.

TipTry This
  1. Same position: Set both positions to the same value (e.g., 0 and 0). The similarity becomes 1.0 (identical).

  2. Adjacent positions: Compare positions 0 and 1. They are very similar (>0.95) because position embeddings change gradually.

  3. Distant positions: Compare positions 0 and 5. The similarity drops because position embeddings diverge.

The key insight: same token ID + different position = different final embedding. This is how the model knows word order matters.

Code Walkthrough

Let’s explore embeddings interactively:

import torch
import torch.nn as nn
import numpy as np

print(f"PyTorch version: {torch.__version__}")
PyTorch version: 2.10.0+cu128

Token Embeddings Basics

A token embedding is just a lookup table: token ID -> vector

# Create a simple token embedding
vocab_size = 100
embed_dim = 32

# nn.Embedding is PyTorch's lookup table
token_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embed_dim}")
print(f"Total parameters: {vocab_size * embed_dim:,}")
print(f"Embedding table shape: {token_emb.weight.shape}")
Vocabulary size: 100
Embedding dimension: 32
Total parameters: 3,200
Embedding table shape: torch.Size([100, 32])
# Look up embeddings for some tokens
token_ids = torch.tensor([[5, 10, 15, 20]])
embeddings = token_emb(token_ids)

print(f"Input token IDs: {token_ids[0].tolist()}")
print(f"Output shape: {tuple(embeddings.shape)}")
print(f"\nToken 5's embedding (first 8 dims):")
print(f"  {embeddings[0, 0, :8].tolist()}")
Input token IDs: [5, 10, 15, 20]
Output shape: (1, 4, 32)

Token 5's embedding (first 8 dims):
  [0.35363224148750305, -0.6603528261184692, -0.47760874032974243, 1.9065693616867065, 0.05114458128809929, -1.4478737115859985, 1.489136815071106, 0.17601217329502106]
# Same token always gets the same embedding
e1 = token_emb(torch.tensor([[42]]))
e2 = token_emb(torch.tensor([[42]]))

print(f"Token 42 embedding (call 1): {e1[0, 0, :4].tolist()}")
print(f"Token 42 embedding (call 2): {e2[0, 0, :4].tolist()}")
print(f"Equal: {torch.allclose(e1, e2)}")
Token 42 embedding (call 1): [-0.10501593351364136, 1.1953333616256714, 0.7824326753616333, 0.2594313621520996]
Token 42 embedding (call 2): [-0.10501593351364136, 1.1953333616256714, 0.7824326753616333, 0.2594313621520996]
Equal: True

Sinusoidal Positional Encoding

The original Transformer uses sin/cos functions to encode position. The key idea is to create a unique “fingerprint” for each position using waves of different frequencies:

  • Low-frequency components (high dimensions): Change slowly across positions, capturing coarse position
  • High-frequency components (low dimensions): Change rapidly, capturing fine-grained position

This is analogous to how Fourier series can represent any periodic function as a sum of sines and cosines:

import math

def create_sinusoidal_encoding(max_seq_len: int, embed_dim: int) -> torch.Tensor:
    """Create sinusoidal positional encoding matrix."""
    pe = torch.zeros(max_seq_len, embed_dim)
    position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

    # Compute div_term: 10000^(2i/d) = exp(2i * -log(10000) / d)
    div_term = torch.exp(
        torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
    )

    # Apply sin to even indices, cos to odd indices
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# Create positional encoding
pe = create_sinusoidal_encoding(max_seq_len=128, embed_dim=64)
print(f"Positional encoding shape: {pe.shape}")
Positional encoding shape: torch.Size([128, 64])
# Prepare data for OJS visualization
pe_heatmap_data = pe[:50].numpy().tolist()

# Extract specific dimensions for line plot
pe_dim_lines = {
    'dim0': pe[:50, 0].numpy().tolist(),
    'dim1': pe[:50, 1].numpy().tolist(),
    'dim10': pe[:50, 10].numpy().tolist(),
    'dim11': pe[:50, 11].numpy().tolist(),
    'dim30': pe[:50, 30].numpy().tolist(),
    'dim31': pe[:50, 31].numpy().tolist()
}

# Compute cosine similarity between all position pairs
pe_subset = pe[:20]
pe_norm = pe_subset / pe_subset.norm(dim=1, keepdim=True)
similarity = pe_norm @ pe_norm.T
pe_similarity_data = similarity.numpy().tolist()

ojs_define(
    pe_heatmap=pe_heatmap_data,
    pe_lines=pe_dim_lines,
    pe_similarity=pe_similarity_data
)

Notice that:

  1. Nearby positions are similar: Positions 5 and 6 are more similar than positions 5 and 15
  2. The pattern is symmetric: sim(i, j) = sim(j, i)
  3. Each position is unique: No two positions have identical encodings

The sinusoidal encoding also has a key mathematical property: for any fixed offset k, the encoding PE(pos+k) can be expressed as a linear transformation of PE(pos). This helps the model learn relative positions (e.g., “this token is 3 positions before that token”).

Combined Transformer Embedding

We add token embeddings and positional embeddings together. The implementation in embeddings.py handles:

  1. Scaling by sqrt(embed_dim): Token embeddings are multiplied by sqrt(embed_dim) before adding positional embeddings. This prevents the positional signal from dominating when embed_dim is large (since embeddings are typically initialized with small values like std=0.02).

  2. Initialization: We initialize embeddings from a normal distribution with small standard deviation (0.02). Small initialization prevents exploding gradients.

  3. Padding token handling: The embedding for the padding token (usually ID 0) is set to zeros and excluded from gradient updates.

  4. Dropout: Dropout follows the combination step for regularization.

class TransformerEmbedding(nn.Module):
    """Combined token + positional embedding."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int, dropout: float = 0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        seq_len = token_ids.shape[1]

        # Get token embeddings and scale
        token_emb = self.token_embedding(token_ids) * self.scale

        # Get positional embeddings
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.position_embedding(positions)

        # Combine and apply dropout
        return self.dropout(token_emb + pos_emb)

# Create embedding layer
emb = TransformerEmbedding(
    vocab_size=1000,
    embed_dim=64,
    max_seq_len=128,
    dropout=0.0  # Disable for visualization
)

# Process some tokens
tokens = torch.randint(0, 1000, (1, 10))
output = emb(tokens)

print(f"Input tokens: {tokens[0].tolist()}")
print(f"Output shape: {tuple(output.shape)}")
Input tokens: [812, 732, 151, 586, 549, 610, 449, 44, 397, 620]
Output shape: (1, 10, 64)
# Show that same token at different positions has different embeddings
# Put token 42 at positions 0, 5, and 9
tokens = torch.tensor([[42, 1, 2, 3, 4, 42, 6, 7, 8, 42]])
output = emb(tokens)

# Get the embeddings for token 42 at each position
pos_0 = output[0, 0].detach()
pos_5 = output[0, 5].detach()
pos_9 = output[0, 9].detach()

print("Token 42 at different positions:")
print(f"  Position 0: {pos_0[:4].tolist()}")
print(f"  Position 5: {pos_5[:4].tolist()}")
print(f"  Position 9: {pos_9[:4].tolist()}")
print(f"\nAll different due to positional encoding!")
Token 42 at different positions:
  Position 0: [-11.515233039855957, -0.3167179822921753, 0.5003924369812012, -0.6841390132904053]
  Position 5: [-9.903129577636719, -0.2589734196662903, 1.3149402141571045, -0.7201709151268005]
  Position 9: [-10.547170639038086, 0.5124090313911438, 1.126982569694519, -1.363930583000183]

All different due to positional encoding!
# Prepare data for OJS visualization
token_only = emb.token_embedding(tokens)[0].detach().numpy()
positions = torch.arange(10)
pos_only = emb.position_embedding(positions).detach().numpy()
combined = output[0].detach().numpy()

ojs_define(
    token_emb_only=token_only.tolist(),
    pos_emb_only=pos_only.tolist(),
    combined_emb=combined.tolist()
)

Embedding Similarity

Embeddings capture meaning - similar tokens should have similar embeddings:

# Let's simulate "training" by manually setting some embeddings to be similar
# In practice, these patterns emerge from training on real text

vocab_size = 20
embed_dim = 16

token_emb = nn.Embedding(vocab_size, embed_dim)

# Manually set some tokens to have similar embeddings
# (simulating what would happen after training on related words)
with torch.no_grad():
    # Tokens 0-4: "numbers" (similar to each other)
    base_number = torch.randn(embed_dim)
    for i in range(5):
        token_emb.weight[i] = base_number + torch.randn(embed_dim) * 0.1

    # Tokens 5-9: "letters" (similar to each other, different from numbers)
    base_letter = torch.randn(embed_dim)
    for i in range(5, 10):
        token_emb.weight[i] = base_letter + torch.randn(embed_dim) * 0.1

# Compute all pairwise similarities
all_embeds = token_emb.weight[:10]
all_embeds_norm = all_embeds / all_embeds.norm(dim=1, keepdim=True)
similarity_matrix = (all_embeds_norm @ all_embeds_norm.T).detach().numpy()

# PCA via SVD for 2D visualization (no sklearn needed)
embeddings_np = all_embeds.detach().numpy()
centered = embeddings_np - embeddings_np.mean(axis=0)
U, S, Vt = np.linalg.svd(centered, full_matrices=False)
emb_2d = U[:, :2] * S[:2]  # Project to 2D
variance_explained = (S**2) / (S**2).sum()

ojs_define(
    token_similarity=similarity_matrix.tolist(),
    pca_coords=emb_2d.tolist(),
    pca_variance=[float(variance_explained[0]), float(variance_explained[1])]
)

print("Notice: Numbers (N) are similar to each other, letters (L) are similar to each other,")
print("but numbers and letters are different from each other.")

2D Visualization with PCA

Interactive Exploration

Explore how sinusoidal position encodings create unique patterns for each position. The key insight: low dimensions change rapidly (high frequency), while high dimensions change slowly (low frequency).

```{ojs}
//| echo: false

// Theme colors for light/dark mode - use diagramTheme from _diagram-lib.qmd
// This provides consistent theming with CSS variables
theme = {
  textPrimary: diagramTheme.nodeText,
  textMuted: diagramTheme.edgeStroke,
  ruleStroke: diagramTheme.nodeStroke,
  highlightStroke: diagramTheme.nodeText,
  lineBlue: diagramTheme.accent,
  dotSin: diagramTheme.accent,
  dotCos: diagramTheme.highlight,
  compareSecondary: diagramTheme.highlight,
  statusGreen: diagramTheme.highlight,
  statusAmber: diagramTheme.accent,
  statusGray: diagramTheme.edgeStroke
}
```
ImportantOJS Syntax Error (line 2569, column 12)Unexpected token
TipTry This
  1. Frequency gradient: Look at the heatmap from left to right. Low dimensions (left) have rapid oscillation, high dimensions (right) change slowly.

  2. Adjacent positions: Set positions to 0 and 1. Notice high similarity (≈0.99+). The encodings are almost identical, differing only slightly.

  3. Distant positions: Compare positions 0 and 32. Similarity drops significantly because more dimension-waves have cycled.

  4. Unique fingerprints: Slide through different positions in the line plot. Each position has a unique “fingerprint” pattern.

  5. Sin/Cos pairs: In the line plot, blue dots are sin (even dims), orange dots are cos (odd dims). They’re 90° out of phase.

Exercises

Exercise 1: Compare Learned vs Sinusoidal Positional Embeddings

Learned and sinusoidal embeddings have different tradeoffs:

Aspect Learned Sinusoidal
Training Updated via backprop Fixed (no parameters)
Extrapolation Cannot generalize beyond max_seq_len Can theoretically extrapolate
Memory Adds parameters Zero parameter overhead
Performance Outperforms sinusoidal in most benchmarks Good baseline
# Compare learned vs sinusoidal positional embeddings
learned = nn.Embedding(50, 32)  # Learned (random initialization)
sinusoidal = create_sinusoidal_encoding(50, 32)  # Fixed pattern

ojs_define(
    learned_pe=learned.weight.detach().numpy().tolist(),
    sinusoidal_pe=sinusoidal.numpy().tolist()
)

print("Learned embeddings start random but are trained to capture position.")
print("Sinusoidal embeddings have a fixed pattern that encodes relative positions.")

Exercise 2: Effect of Embedding Dimension

The embedding dimension affects both model capacity and computational cost. Larger dimensions can represent more nuanced semantic distinctions but require more memory and computation in every layer of the model.

The sqrt(embed_dim) scaling factor is crucial: without it, the magnitude of embeddings would vary significantly with dimension, since randomly initialized vectors of higher dimension have larger expected norms.

# What happens with different embedding dimensions?

for dim in [8, 32, 128, 512]:
    emb = TransformerEmbedding(
        vocab_size=1000,
        embed_dim=dim,
        max_seq_len=128,
        dropout=0.0
    )
    tokens = torch.randint(0, 1000, (1, 32))
    output = emb(tokens)

    # Compute variance of output
    variance = output.var().item()
    print(f"Embed dim {dim:3d}: output variance = {variance:.4f}")

print("\nThe scale factor (sqrt(embed_dim)) helps keep variance stable!")
Embed dim   8: output variance = 9.4354
Embed dim  32: output variance = 31.3584
Embed dim 128: output variance = 125.7513
Embed dim 512: output variance = 509.7528

The scale factor (sqrt(embed_dim)) helps keep variance stable!

Exercise 3: Memory Usage of Embeddings

Embedding memory determines model sizing. With weight tying (sharing embeddings between input and output layers), you pay this cost once. Without weight tying, you pay twice.

# Memory usage of embeddings

configs = [
    {"vocab": 1000, "dim": 64, "name": "Tiny"},
    {"vocab": 8000, "dim": 256, "name": "Small"},
    {"vocab": 32000, "dim": 512, "name": "Medium"},
    {"vocab": 50000, "dim": 768, "name": "Large (GPT-2)"},
    {"vocab": 100000, "dim": 4096, "name": "Large (LLaMA)"},
]

print("Embedding Table Memory Usage:")
print("=" * 60)

for cfg in configs:
    params = cfg["vocab"] * cfg["dim"]
    memory_mb = params * 4 / (1024 * 1024)  # 4 bytes per float32
    memory_fp16 = memory_mb / 2  # fp16/bf16 halves memory
    print(f"{cfg['name']:15s}: {cfg['vocab']:6d} vocab x {cfg['dim']:4d} dim = "
          f"{params:>12,} params ({memory_mb:>7.1f} MB fp32, {memory_fp16:>6.1f} MB fp16)")
Embedding Table Memory Usage:
============================================================
Tiny           :   1000 vocab x   64 dim =       64,000 params (    0.2 MB fp32,    0.1 MB fp16)
Small          :   8000 vocab x  256 dim =    2,048,000 params (    7.8 MB fp32,    3.9 MB fp16)
Medium         :  32000 vocab x  512 dim =   16,384,000 params (   62.5 MB fp32,   31.2 MB fp16)
Large (GPT-2)  :  50000 vocab x  768 dim =   38,400,000 params (  146.5 MB fp32,   73.2 MB fp16)
Large (LLaMA)  : 100000 vocab x 4096 dim =  409,600,000 params ( 1562.5 MB fp32,  781.2 MB fp16)

Note: Modern models typically use fp16 or bf16, which halves the memory requirement. Quantization (int8, int4) can reduce it further.

Weight tying shares the embedding matrix between the input layer and output projection, halving the embedding parameter count:

# Weight tying: share embedding weights with output projection
# This is what GPT-2, LLaMA, and most modern LLMs do

import torch.nn as nn

class SimpleLMWithWeightTying(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # Output projection shares weights with embedding (transposed)
        self.output_proj = nn.Linear(embed_dim, vocab_size, bias=False)
        # Tie weights: output projection uses the same weights as embedding
        self.output_proj.weight = self.embedding.weight

    def forward(self, x):
        emb = self.embedding(x)  # (batch, seq, embed_dim)
        logits = self.output_proj(emb)  # (batch, seq, vocab_size)
        return logits

model = SimpleLMWithWeightTying(vocab_size=1000, embed_dim=256)
print(f"Embedding params: {model.embedding.weight.numel():,}")
print(f"Output proj params: {model.output_proj.weight.numel():,}")
print(f"Are weights shared? {model.embedding.weight is model.output_proj.weight}")
Embedding params: 256,000
Output proj params: 256,000
Are weights shared? True

Using the Module’s Embeddings

The embeddings.py file contains production-ready embedding classes:

from embeddings import (
    TokenEmbedding,
    LearnedPositionalEmbedding,
    SinusoidalPositionalEmbedding,
    TransformerEmbedding as ModuleTransformerEmbedding,
    demonstrate_embeddings
)

# Run the demonstration
demo_emb = demonstrate_embeddings(
    vocab_size=100,
    embed_dim=32,
    seq_len=8,
    verbose=True
)
============================================================
EMBEDDING DEMONSTRATION
============================================================

Created TransformerEmbedding:
  Vocab size: 100
  Embed dim: 32
  Max seq len: 128

Input token IDs (batch=2, seq=8):
  [71, 85, 68, 38, 60, 50, 50, 75]
  [46, 10, 46, 28, 28, 86, 44, 93]

Output shape: (2, 8, 32)
  (batch_size, seq_len, embed_dim)

First token's embedding (first 8 dims):
  [0.17343755066394806, 0.06077679991722107, 0.1327059119939804, 0.040316108614206314, -0.04842858761548996, 0.036001987755298615, 0.09963357448577881, -0.14788827300071716]

Token embedding lookup is consistent:
  Token 5 (first call):  [0.013927988708019257, -0.0007973717874847353, -0.012984905391931534, 0.041555918753147125]
  Token 5 (second call): [0.013927988708019257, -0.0007973717874847353, -0.012984905391931534, 0.041555918753147125]
  Equal: True

Summary

Key takeaways:

  1. Token embeddings are lookup tables that convert token IDs to vectors
  2. Positional embeddings add information about where tokens are in the sequence
  3. Sinusoidal positional embeddings use fixed sin/cos patterns - no parameters, can theoretically extrapolate
  4. Learned positional embeddings are trained like any other parameter - outperform sinusoidal in most benchmarks
  5. Modern approaches (RoPE, ALiBi) handle position differently and extrapolate better to long sequences
  6. Similar tokens end up with similar embeddings after training - capturing semantic relationships
  7. Scaling by sqrt(embed_dim) helps maintain stable gradients when dimensions vary
  8. Weight tying between input embeddings and output layer is common and reduces parameters

Common Pitfalls

  • Forgetting the sqrt scale: Without this scaling, positional embeddings can dominate or be ignored depending on embed_dim
  • Exceeding max_seq_len: Learned positional embeddings fail hard on longer sequences than training
  • Ignoring padding: Padding tokens should be zero vectors and excluded from gradients
  • Poor initialization: Large initial values cause training instability; use small std (0.01-0.02)

Going Deeper

What’s Next

Module 05: Attention covers the core mechanism that allows tokens to “look at” each other. Attention operates on embedding vectors. Position information encoded here becomes essential for computing relationships between tokens.