Module 04: Embeddings

Introduction

Converting token IDs into dense vectors. Embeddings give meaning to numbers - similar tokens end up close together in vector space.

Embeddings are learned vector representations. Instead of treating token ID 42 as just the number 42, we give it a rich vector like [0.2, -0.5, 0.8, ...] that captures its meaning.

Why embeddings matter for LLMs:

  • Similarity: Similar words have similar vectors (“cat” and “dog” are close)
  • Composition: Vectors can be combined meaningfully
  • Learning: The model learns these representations during training
  • Position: We also embed WHERE tokens are in the sequence

Two types of embeddings in transformers:

  1. Token embeddings: What the token means
  2. Positional embeddings: Where the token is in the sequence

What You’ll Learn

By the end of this module, you will be able to:

  • Explain how embedding lookups work as matrix multiplication
  • Implement token and positional embeddings from scratch
  • Understand how gradients flow through embedding layers
  • Choose between learned and sinusoidal positional embeddings
  • Recognize the role of embeddings in the transformer architecture

Memory and Scale Considerations

Embeddings are often the largest single component of a language model. The parameter count is simply vocab_size x embed_dim:

Model Vocab Size Embed Dim Embedding Params Memory (fp32)
GPT-2 Small 50,257 768 38.6M 147 MB
LLaMA 7B 32,000 4,096 131M 500 MB
LLaMA 70B 32,000 8,192 262M 1 GB

This is why vocabulary size is a critical design decision. A larger vocabulary means each token carries more information (fewer tokens per text), but the embedding table grows proportionally.

Intuition: Coordinates in Meaning Space

Think of embeddings as coordinates in “meaning space”:

Token: "cat"  -> [0.8, 0.1, 0.9, ...]   <- captures "animal", "pet", etc.
Token: "dog"  -> [0.7, 0.2, 0.8, ...]   <- similar to cat
Token: "code" -> [0.1, 0.9, 0.2, ...]   <- very different

Distance("cat", "dog") < Distance("cat", "code")

Positional embeddings add “where in the sequence” information:

Position 0: [1.0, 0.0, 0.5, ...]   <- "I'm first"
Position 1: [0.9, 0.1, 0.4, ...]   <- "I'm second"
Position 2: [0.8, 0.2, 0.3, ...]   <- "I'm third"

The model combines both:

Final embedding = Token embedding + Positional embedding

Embedding Architecture

Here’s how embeddings work in a transformer. Step through the pipeline to see how token IDs become rich vector representations:

TipTry This

Use the slider to step through the embedding pipeline. Notice how token IDs become vectors through table lookup, then get combined with position information.

Embeddings Are Just Lookup Tables

The key insight: embedding lookup is sparse matrix multiplication. When we say “look up embedding for token 3”, we’re actually doing:

  1. Create a one-hot vector: token 3 in vocab of 10 becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
  2. Multiply by the weight matrix: one_hot @ W
  3. Only one row of W participates (the row for token 3)

This isn’t just conceptual - it’s exactly what happens mathematically. PyTorch optimizes this by skipping the one-hot creation, but understanding the matrix multiplication view is essential for understanding gradients.

From Scratch: One-Hot Embedding Lookup

Let’s build an embedding layer using explicit one-hot vectors and matrix multiplication:

import numpy as np
import torch
import torch.nn as nn

class ScratchEmbedding:
    """Embedding layer using explicit one-hot multiplication.

    This shows what's really happening: embedding lookup is
    just sparse matrix multiplication.
    """

    def __init__(self, vocab_size: int, embed_dim: int):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        # Weight matrix: each row is the embedding for a token
        self.W = np.random.randn(vocab_size, embed_dim) * 0.02

    def __call__(self, token_ids: np.ndarray) -> np.ndarray:
        """
        token_ids: shape (batch, seq_len) - integer token IDs
        returns: shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Create one-hot encodings: (batch, seq_len, vocab_size)
        one_hot = np.zeros((batch_size, seq_len, self.vocab_size))

        # Set the appropriate positions to 1
        for b in range(batch_size):
            for t in range(seq_len):
                one_hot[b, t, token_ids[b, t]] = 1.0

        # Matrix multiply: (batch, seq_len, vocab_size) @ (vocab_size, embed_dim)
        # = (batch, seq_len, embed_dim)
        embeddings = one_hot @ self.W

        return embeddings

# Test it
vocab_size, embed_dim = 10, 4
scratch_emb = ScratchEmbedding(vocab_size, embed_dim)

# Sample tokens: batch of 2, sequence length 3
token_ids = np.array([[3, 7, 1],
                      [5, 3, 9]])

result = scratch_emb(token_ids)
print(f"Token IDs shape: {token_ids.shape}")
print(f"Embeddings shape: {result.shape}")
print(f"\nToken 3's embedding (row 3 of W):")
print(f"  From lookup: {result[0, 0]}")
print(f"  Direct W[3]: {scratch_emb.W[3]}")
print(f"  Match: {np.allclose(result[0, 0], scratch_emb.W[3])}")

The one-hot multiplication selects exactly one row from W. Watch the math:

# Visualize the one-hot multiplication
token_id = 3
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1.0

print("One-hot vector for token 3:")
print(f"  {one_hot}")
print(f"\nWeight matrix W (10 x 4):")
print(f"  Row 0: {scratch_emb.W[0]}")
print(f"  Row 1: {scratch_emb.W[1]}")
print(f"  Row 2: {scratch_emb.W[2]}")
print(f"  Row 3: {scratch_emb.W[3]}  <-- selected")
print(f"  ...")
print(f"\none_hot @ W = {one_hot @ scratch_emb.W}")
print(f"W[3] directly = {scratch_emb.W[3]}")

PyTorch’s nn.Embedding

PyTorch provides the same functionality but optimized - it skips creating the one-hot vector entirely:

# PyTorch equivalent
torch_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

# Copy weights from scratch version for comparison
with torch.no_grad():
    torch_emb.weight.copy_(torch.from_numpy(scratch_emb.W).float())

# Same token IDs
token_ids_torch = torch.tensor(token_ids)
result_torch = torch_emb(token_ids_torch)

print(f"Scratch result (token 3): {result[0, 0]}")
print(f"PyTorch result (token 3): {result_torch[0, 0].detach().numpy()}")
print(f"Match: {np.allclose(result, result_torch.detach().numpy())}")
TipKey Insight

nn.Embedding is just an optimized lookup - no one-hot materialization. But mathematically, it’s identical to one-hot times weight matrix. Understanding this helps when debugging gradient flow.

Making Lookups Differentiable

How do gradients flow through an embedding lookup? The answer comes directly from the matrix multiplication view.

From Scratch: Gradient Flow

When we compute output = one_hot @ W, the gradient with respect to W follows standard matrix calculus:

dL/dW = one_hot.T @ dL/doutput

This means only the selected rows receive gradients. If we looked up tokens [3, 7, 1], only rows 3, 7, and 1 of W get updated during training.

class ScratchEmbeddingWithGrad:
    """Embedding with gradient computation."""

    def __init__(self, vocab_size: int, embed_dim: int):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.W = np.random.randn(vocab_size, embed_dim) * 0.02
        self.grad_W = None
        self._last_one_hot = None  # Store for backward

    def forward(self, token_ids: np.ndarray) -> np.ndarray:
        batch_size, seq_len = token_ids.shape

        # Create one-hot: (batch, seq_len, vocab_size)
        one_hot = np.zeros((batch_size, seq_len, self.vocab_size))
        for b in range(batch_size):
            for t in range(seq_len):
                one_hot[b, t, token_ids[b, t]] = 1.0

        self._last_one_hot = one_hot
        return one_hot @ self.W

    def backward(self, grad_output: np.ndarray):
        """
        grad_output: shape (batch, seq_len, embed_dim) - gradient from next layer
        """
        # dL/dW = one_hot.T @ grad_output
        # Reshape for batch matmul: (batch, vocab_size, seq_len) @ (batch, seq_len, embed_dim)
        one_hot_T = self._last_one_hot.transpose(0, 2, 1)  # (batch, vocab, seq_len)

        # Accumulate gradients across batch
        self.grad_W = np.zeros_like(self.W)
        for b in range(grad_output.shape[0]):
            self.grad_W += one_hot_T[b] @ grad_output[b]

        return self.grad_W

# Demonstrate gradient flow
emb = ScratchEmbeddingWithGrad(vocab_size=10, embed_dim=4)
token_ids = np.array([[3, 7, 1]])  # batch=1, seq_len=3

# Forward
output = emb.forward(token_ids)

# Simulate gradient from loss (all ones for simplicity)
grad_from_loss = np.ones_like(output)

# Backward
grad_W = emb.backward(grad_from_loss)

print("Gradient magnitude per row of W:")
for i in range(10):
    magnitude = np.abs(grad_W[i]).sum()
    marker = " <-- used" if i in [3, 7, 1] else ""
    print(f"  Row {i}: {magnitude:.4f}{marker}")

print("\nOnly rows 1, 3, 7 received gradients!")

PyTorch: Automatic Gradient Tracking

PyTorch handles this automatically when requires_grad=True:

# PyTorch does this automatically
torch_emb = nn.Embedding(10, 4)

token_ids = torch.tensor([[3, 7, 1]])
output = torch_emb(token_ids)

# Fake loss: sum of embeddings
loss = output.sum()
loss.backward()

print("PyTorch gradient magnitude per row:")
for i in range(10):
    magnitude = torch_emb.weight.grad[i].abs().sum().item()
    marker = " <-- used" if i in [3, 7, 1] else ""
    print(f"  Row {i}: {magnitude:.4f}{marker}")
NoteSparse Updates

This “sparse gradient” property is why embedding layers can have millions of parameters but train efficiently - each batch only updates a small subset of rows.

Positional Information

Attention (without positional embeddings) is permutation-equivariant: if you reorder the input tokens, the attention scores simply reorder to match. The relationship between “cat” and “sat” is the same regardless of whether they’re at positions [0,1] or [1,0]. This means the model can’t distinguish “the cat sat” from “sat the cat” — a critical limitation since word order carries meaning.

Position embeddings solve this by giving each position a learnable vector that gets added to the token embedding.

From Scratch: Learnable Position Embeddings

Position embeddings are just another lookup table, indexed by position instead of token ID:

class ScratchPositionEmbedding:
    """Learnable position embeddings - same as token embeddings but indexed by position."""

    def __init__(self, max_seq_len: int, embed_dim: int):
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim
        # Each position gets its own learnable vector
        self.W = np.random.randn(max_seq_len, embed_dim) * 0.02

    def __call__(self, seq_len: int) -> np.ndarray:
        """
        seq_len: how many positions to return
        returns: shape (seq_len, embed_dim)
        """
        # Just slice the first seq_len positions
        return self.W[:seq_len]


class ScratchCombinedEmbedding:
    """Token embeddings + position embeddings."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
        self.token_emb = ScratchEmbedding(vocab_size, embed_dim)
        self.pos_emb = ScratchPositionEmbedding(max_seq_len, embed_dim)

    def __call__(self, token_ids: np.ndarray) -> np.ndarray:
        """
        token_ids: shape (batch, seq_len)
        returns: shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Get token embeddings: (batch, seq_len, embed_dim)
        tok_emb = self.token_emb(token_ids)

        # Get position embeddings: (seq_len, embed_dim)
        pos_emb = self.pos_emb(seq_len)

        # Add position embeddings (broadcasts over batch dimension)
        return tok_emb + pos_emb

# Test combined embedding
combined = ScratchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)

# Same token (ID=42) at different positions
tokens = np.array([[42, 42, 42, 42]])  # Same token, 4 positions
embeddings = combined(tokens)

print("Same token (42) at different positions:")
for pos in range(4):
    print(f"  Position {pos}: {embeddings[0, pos, :4]}...")

print("\nAll different due to position embeddings!")

Why Position Matters: Attention is Permutation-Invariant

Without position embeddings, attention treats tokens as an unordered set:

import matplotlib.pyplot as plt

# Demonstrate permutation invariance
def simple_attention_scores(embeddings):
    """Compute raw attention scores (Q @ K.T) without position."""
    # In real attention, Q = emb @ W_q, K = emb @ W_k
    # For simplicity, use embeddings directly
    return embeddings @ embeddings.T

# Create two orderings of the same tokens
token_emb = ScratchEmbedding(vocab_size=10, embed_dim=8)

# "the cat sat" = tokens [1, 5, 7]
order1 = np.array([[1, 5, 7]])
# "sat the cat" = tokens [7, 1, 5]
order2 = np.array([[7, 1, 5]])

emb1 = token_emb(order1)[0]  # (3, 8)
emb2 = token_emb(order2)[0]  # (3, 8)

scores1 = simple_attention_scores(emb1)
scores2 = simple_attention_scores(emb2)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Key insight: The pairwise relationships (attention scores) are the same,
# just permuted. Without position info, the model can't tell order apart.
im1 = axes[0].imshow(scores1, cmap='Blues')
axes[0].set_title('Order: [1, 5, 7]\n"the cat sat"')
axes[0].set_xlabel('Key position')
axes[0].set_ylabel('Query position')
plt.colorbar(im1, ax=axes[0])

im2 = axes[1].imshow(scores2, cmap='Blues')
axes[1].set_title('Order: [7, 1, 5]\n"sat the cat"')
axes[1].set_xlabel('Key position')
axes[1].set_ylabel('Query position')
plt.colorbar(im2, ax=axes[1])

plt.suptitle('Attention scores are just permuted\n(without position embeddings, order is lost)')
plt.tight_layout()
plt.show()

PyTorch: Combined Token + Position Embedding

class PyTorchCombinedEmbedding(nn.Module):
    """Standard transformer embedding: token + position."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len = token_ids.shape

        # Token embeddings
        tok_emb = self.token_emb(token_ids)

        # Position embeddings (create position indices)
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.pos_emb(positions)

        return tok_emb + pos_emb

# Compare scratch vs PyTorch
pytorch_combined = PyTorchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)

tokens_torch = torch.tensor([[42, 42, 42, 42]])
embeddings_torch = pytorch_combined(tokens_torch)

print("PyTorch: Same token (42) at different positions:")
for pos in range(4):
    print(f"  Position {pos}: {embeddings_torch[0, pos, :4].tolist()}")
TipKey Insight

Position embeddings are just another embedding table - they work identically to token embeddings but are indexed by position. The “magic” is simply: final = token_emb[token_id] + pos_emb[position].

The Math

Token Embeddings

Simple lookup table: E[token_id] = embedding_vector

Mathematically equivalent to one-hot multiplication:

one_hot = [0, 0, 0, 1, 0, ...]  # 1 at position token_id
embedding = one_hot @ E         # selects row token_id from E

Positional Embeddings

There are several approaches to encoding position:

1. Learned positional embeddings (GPT-2, BERT):

# Position table: (max_seq_len, embed_dim)
P = torch.randn(max_seq_len, embed_dim)
positions = P[:seq_len]  # Get positions for current sequence

Each position gets a trainable vector. Simple and effective, but cannot generalize to positions beyond max_seq_len.

2. Sinusoidal positional embeddings (original Transformer):

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

This creates a unique pattern for each position using waves of different frequencies. The key insight: PE(pos+k) can be represented as a linear function of PE(pos), allowing the model to learn relative positions.

3. Rotary Position Embedding - RoPE (LLaMA, Mistral): Rather than adding position embeddings to token embeddings, RoPE rotates the query and key vectors based on position. The rotation angle depends on both position and dimension, encoding relative positions naturally in the attention computation. This approach extrapolates well to longer sequences than seen during training.

4. ALiBi - Attention with Linear Biases (BLOOM): Instead of adding position information to embeddings, ALiBi adds a position-dependent bias directly to the attention scores: closer tokens get higher scores. This is applied during attention, not in the embedding layer.

Combined Embeddings

# Input: token_ids of shape (batch, seq_len)
token_emb = token_embedding[token_ids]     # (batch, seq_len, embed_dim)
pos_emb = position_embedding[:seq_len]      # (seq_len, embed_dim)
x = token_emb + pos_emb                     # (batch, seq_len, embed_dim)

Same Token, Different Positions

The same token (“the”) appears at multiple positions in a sentence. Even though it has the same token embedding, the final embedding differs because of position.

TipTry This
  1. Same position: Set both positions to the same value (e.g., 0 and 0). The similarity becomes 1.0 (identical).

  2. Adjacent positions: Compare positions 0 and 1. They are very similar (>0.95) because position embeddings change gradually.

  3. Distant positions: Compare positions 0 and 5. The similarity drops because position embeddings diverge.

The key insight: same token ID + different position = different final embedding. This is how the model knows word order matters.

Code Walkthrough

Let’s explore embeddings interactively:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

print(f"PyTorch version: {torch.__version__}")

Token Embeddings Basics

A token embedding is just a lookup table: token ID -> vector

# Create a simple token embedding
vocab_size = 100
embed_dim = 32

# nn.Embedding is PyTorch's lookup table
token_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)

print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embed_dim}")
print(f"Total parameters: {vocab_size * embed_dim:,}")
print(f"Embedding table shape: {token_emb.weight.shape}")
# Look up embeddings for some tokens
token_ids = torch.tensor([[5, 10, 15, 20]])
embeddings = token_emb(token_ids)

print(f"Input token IDs: {token_ids[0].tolist()}")
print(f"Output shape: {tuple(embeddings.shape)}")
print(f"\nToken 5's embedding (first 8 dims):")
print(f"  {embeddings[0, 0, :8].tolist()}")
# Same token always gets the same embedding
e1 = token_emb(torch.tensor([[42]]))
e2 = token_emb(torch.tensor([[42]]))

print(f"Token 42 embedding (call 1): {e1[0, 0, :4].tolist()}")
print(f"Token 42 embedding (call 2): {e2[0, 0, :4].tolist()}")
print(f"Equal: {torch.allclose(e1, e2)}")

Sinusoidal Positional Encoding

The original Transformer uses sin/cos functions to encode position. The key idea is to create a unique “fingerprint” for each position using waves of different frequencies:

  • Low-frequency components (high dimensions): Change slowly across positions, capturing coarse position
  • High-frequency components (low dimensions): Change rapidly, capturing fine-grained position

This is analogous to how Fourier series can represent any periodic function as a sum of sines and cosines:

import math

def create_sinusoidal_encoding(max_seq_len: int, embed_dim: int) -> torch.Tensor:
    """Create sinusoidal positional encoding matrix."""
    pe = torch.zeros(max_seq_len, embed_dim)
    position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

    # Compute div_term: 10000^(2i/d) = exp(2i * -log(10000) / d)
    div_term = torch.exp(
        torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
    )

    # Apply sin to even indices, cos to odd indices
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# Create positional encoding
pe = create_sinusoidal_encoding(max_seq_len=128, embed_dim=64)
print(f"Positional encoding shape: {pe.shape}")
# Visualize as heatmap
plt.figure(figsize=(14, 6))
plt.imshow(pe[:50].numpy(), aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
plt.colorbar(label='Value')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position')
plt.title('Sinusoidal Positional Encoding')
plt.show()
# Plot individual dimensions to see the patterns
plt.figure(figsize=(14, 4))

for dim in [0, 1, 10, 11, 30, 31]:
    plt.plot(pe[:50, dim].numpy(), label=f'Dim {dim} ({"sin" if dim % 2 == 0 else "cos"})')

plt.xlabel('Position')
plt.ylabel('Value')
plt.title('Positional Encoding by Dimension\n(Lower dims = higher frequency)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Compare positions by computing similarity
pe_subset = pe[:20]

# Compute cosine similarity between all position pairs
pe_norm = pe_subset / pe_subset.norm(dim=1, keepdim=True)
similarity = pe_norm @ pe_norm.T

plt.figure(figsize=(8, 6))
plt.imshow(similarity.numpy(), cmap='Blues')
plt.colorbar(label='Cosine Similarity')
plt.xlabel('Position')
plt.ylabel('Position')
plt.title('Positional Encoding Similarity Matrix\n(Nearby positions are more similar)')
plt.show()

Notice that:

  1. Nearby positions are similar: Positions 5 and 6 are more similar than positions 5 and 15
  2. The pattern is symmetric: sim(i, j) = sim(j, i)
  3. Each position is unique: No two positions have identical encodings

The sinusoidal encoding also has a key mathematical property: for any fixed offset k, the encoding PE(pos+k) can be expressed as a linear transformation of PE(pos). This helps the model learn relative positions (e.g., “this token is 3 positions before that token”).

Combined Transformer Embedding

In practice, we add token embeddings and positional embeddings together. The implementation in embeddings.py includes several important details:

  1. Scaling by sqrt(embed_dim): Token embeddings are multiplied by sqrt(embed_dim) before adding positional embeddings. This prevents the positional signal from dominating when embed_dim is large (since embeddings are typically initialized with small values like std=0.02).

  2. Initialization: Embeddings are initialized from a normal distribution with small standard deviation (0.02). This is crucial for stable training - large initial values can cause exploding gradients.

  3. Padding token handling: The embedding for the padding token (usually ID 0) is set to zeros and excluded from gradient updates.

  4. Dropout: Applied after combining embeddings for regularization.

class TransformerEmbedding(nn.Module):
    """Combined token + positional embedding."""

    def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int, dropout: float = 0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        seq_len = token_ids.shape[1]

        # Get token embeddings and scale
        token_emb = self.token_embedding(token_ids) * self.scale

        # Get positional embeddings
        positions = torch.arange(seq_len, device=token_ids.device)
        pos_emb = self.position_embedding(positions)

        # Combine and apply dropout
        return self.dropout(token_emb + pos_emb)

# Create embedding layer
emb = TransformerEmbedding(
    vocab_size=1000,
    embed_dim=64,
    max_seq_len=128,
    dropout=0.0  # Disable for visualization
)

# Process some tokens
tokens = torch.randint(0, 1000, (1, 10))
output = emb(tokens)

print(f"Input tokens: {tokens[0].tolist()}")
print(f"Output shape: {tuple(output.shape)}")
# Show that same token at different positions has different embeddings
# Put token 42 at positions 0, 5, and 9
tokens = torch.tensor([[42, 1, 2, 3, 4, 42, 6, 7, 8, 42]])
output = emb(tokens)

# Get the embeddings for token 42 at each position
pos_0 = output[0, 0].detach()
pos_5 = output[0, 5].detach()
pos_9 = output[0, 9].detach()

print("Token 42 at different positions:")
print(f"  Position 0: {pos_0[:4].tolist()}")
print(f"  Position 5: {pos_5[:4].tolist()}")
print(f"  Position 9: {pos_9[:4].tolist()}")
print(f"\nAll different due to positional encoding!")
# Visualize the combination
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Token embedding only (without scale for visualization)
token_only = emb.token_embedding(tokens)[0].detach().numpy()
axes[0].imshow(token_only, aspect='auto', cmap='RdBu')
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')
axes[0].set_title('Token Embeddings Only')

# Position embedding only
positions = torch.arange(10)
pos_only = emb.position_embedding(positions).detach().numpy()
axes[1].imshow(pos_only, aspect='auto', cmap='RdBu')
axes[1].set_xlabel('Dimension')
axes[1].set_ylabel('Position')
axes[1].set_title('Positional Embeddings Only')

# Combined
combined = output[0].detach().numpy()
axes[2].imshow(combined, aspect='auto', cmap='RdBu')
axes[2].set_xlabel('Dimension')
axes[2].set_ylabel('Position')
axes[2].set_title('Token + Position (Combined)')

plt.tight_layout()
plt.show()

Embedding Similarity

Embeddings capture meaning - similar tokens should have similar embeddings:

# Let's simulate "training" by manually setting some embeddings to be similar
# In practice, these patterns emerge from training on real text

vocab_size = 20
embed_dim = 16

token_emb = nn.Embedding(vocab_size, embed_dim)

# Manually set some tokens to have similar embeddings
# (simulating what would happen after training on related words)
with torch.no_grad():
    # Tokens 0-4: "numbers" (similar to each other)
    base_number = torch.randn(embed_dim)
    for i in range(5):
        token_emb.weight[i] = base_number + torch.randn(embed_dim) * 0.1

    # Tokens 5-9: "letters" (similar to each other, different from numbers)
    base_letter = torch.randn(embed_dim)
    for i in range(5, 10):
        token_emb.weight[i] = base_letter + torch.randn(embed_dim) * 0.1

# Compute all pairwise similarities
all_embeds = token_emb.weight[:10]
all_embeds_norm = all_embeds / all_embeds.norm(dim=1, keepdim=True)
similarity = (all_embeds_norm @ all_embeds_norm.T).detach().numpy()

plt.figure(figsize=(8, 6))
plt.imshow(similarity, cmap='RdBu', vmin=-1, vmax=1)
plt.colorbar(label='Cosine Similarity')
plt.xlabel('Token ID')
plt.ylabel('Token ID')
plt.title('Token Embedding Similarity\n(0-4: "numbers", 5-9: "letters")')

# Add labels
labels = ['N0', 'N1', 'N2', 'N3', 'N4', 'L0', 'L1', 'L2', 'L3', 'L4']
plt.xticks(range(10), labels)
plt.yticks(range(10), labels)
plt.show()

print("Notice: Numbers (N) are similar to each other, letters (L) are similar to each other,")
print("but numbers and letters are different from each other.")

2D Visualization with PCA

from sklearn.decomposition import PCA

# Get embeddings for all 10 tokens
embeddings = all_embeds.detach().numpy()

# Reduce to 2D
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))

# Numbers in blue
plt.scatter(emb_2d[:5, 0], emb_2d[:5, 1], c='blue', s=100, label='Numbers')
for i in range(5):
    plt.annotate(f'N{i}', (emb_2d[i, 0], emb_2d[i, 1]), fontsize=12)

# Letters in red
plt.scatter(emb_2d[5:, 0], emb_2d[5:, 1], c='red', s=100, label='Letters')
for i in range(5, 10):
    plt.annotate(f'L{i-5}', (emb_2d[i, 0], emb_2d[i, 1]), fontsize=12)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Token Embeddings in 2D\n(Similar tokens cluster together)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Interactive Exploration

Explore how sinusoidal position encodings create unique patterns for each position. The key insight: low dimensions change rapidly (high frequency), while high dimensions change slowly (low frequency).

TipTry This
  1. Frequency gradient: Look at the heatmap from left to right. Low dimensions (left) have rapid oscillation, high dimensions (right) change slowly.

  2. Adjacent positions: Set positions to 0 and 1. Notice high similarity (≈0.99+). The encodings are almost identical, differing only slightly.

  3. Distant positions: Compare positions 0 and 32. Similarity drops significantly because more dimension-waves have cycled.

  4. Unique fingerprints: Slide through different positions in the line plot. Each position has a unique “fingerprint” pattern.

  5. Sin/Cos pairs: In the line plot, blue dots are sin (even dims), orange dots are cos (odd dims). They’re 90° out of phase.

Exercises

Exercise 1: Compare Learned vs Sinusoidal Positional Embeddings

Learned and sinusoidal embeddings have different tradeoffs:

Aspect Learned Sinusoidal
Training Updated via backprop Fixed (no parameters)
Extrapolation Cannot generalize beyond max_seq_len Can theoretically extrapolate
Memory Adds parameters Zero parameter overhead
Performance Often slightly better in practice Good baseline
# Compare learned vs sinusoidal positional embeddings

learned = nn.Embedding(50, 32)  # Learned (random initialization)
sinusoidal = create_sinusoidal_encoding(50, 32)  # Fixed pattern

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

axes[0].imshow(learned.weight.detach().numpy(), aspect='auto', cmap='RdBu')
axes[0].set_title('Learned Positional Embeddings\n(Random initialization)')
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')

axes[1].imshow(sinusoidal.numpy(), aspect='auto', cmap='RdBu')
axes[1].set_title('Sinusoidal Positional Embeddings\n(Fixed pattern)')
axes[1].set_xlabel('Dimension')
axes[1].set_ylabel('Position')

plt.tight_layout()
plt.show()

print("Learned embeddings start random but are trained to capture position.")
print("Sinusoidal embeddings have a fixed pattern that encodes relative positions.")

Exercise 2: Effect of Embedding Dimension

The embedding dimension affects both model capacity and computational cost. Larger dimensions can represent more nuanced semantic distinctions but require more memory and computation in every layer of the model.

The sqrt(embed_dim) scaling factor is crucial: without it, the magnitude of embeddings would vary significantly with dimension, since randomly initialized vectors of higher dimension have larger expected norms.

# What happens with different embedding dimensions?

for dim in [8, 32, 128, 512]:
    emb = TransformerEmbedding(
        vocab_size=1000,
        embed_dim=dim,
        max_seq_len=128,
        dropout=0.0
    )
    tokens = torch.randint(0, 1000, (1, 32))
    output = emb(tokens)

    # Compute variance of output
    variance = output.var().item()
    print(f"Embed dim {dim:3d}: output variance = {variance:.4f}")

print("\nThe scale factor (sqrt(embed_dim)) helps keep variance stable!")

Exercise 3: Memory Usage of Embeddings

Understanding embedding memory is crucial for model sizing. With weight tying (sharing embeddings between input and output layers), you only pay this cost once. Without it, you pay twice.

# Memory usage of embeddings

configs = [
    {"vocab": 1000, "dim": 64, "name": "Tiny"},
    {"vocab": 8000, "dim": 256, "name": "Small"},
    {"vocab": 32000, "dim": 512, "name": "Medium"},
    {"vocab": 50000, "dim": 768, "name": "Large (GPT-2)"},
    {"vocab": 100000, "dim": 4096, "name": "Large (LLaMA)"},
]

print("Embedding Table Memory Usage:")
print("=" * 60)

for cfg in configs:
    params = cfg["vocab"] * cfg["dim"]
    memory_mb = params * 4 / (1024 * 1024)  # 4 bytes per float32
    memory_fp16 = memory_mb / 2  # fp16/bf16 halves memory
    print(f"{cfg['name']:15s}: {cfg['vocab']:6d} vocab x {cfg['dim']:4d} dim = "
          f"{params:>12,} params ({memory_mb:>7.1f} MB fp32, {memory_fp16:>6.1f} MB fp16)")

Note: Modern models typically use fp16 or bf16, which halves the memory requirement. Quantization (int8, int4) can reduce it further.

Weight tying shares the embedding matrix between the input layer and output projection, halving the embedding parameter count:

# Weight tying: share embedding weights with output projection
# This is what GPT-2, LLaMA, and most modern LLMs do

import torch.nn as nn

class SimpleLMWithWeightTying(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # Output projection shares weights with embedding (transposed)
        self.output_proj = nn.Linear(embed_dim, vocab_size, bias=False)
        # Tie weights: output projection uses the same weights as embedding
        self.output_proj.weight = self.embedding.weight

    def forward(self, x):
        emb = self.embedding(x)  # (batch, seq, embed_dim)
        logits = self.output_proj(emb)  # (batch, seq, vocab_size)
        return logits

model = SimpleLMWithWeightTying(vocab_size=1000, embed_dim=256)
print(f"Embedding params: {model.embedding.weight.numel():,}")
print(f"Output proj params: {model.output_proj.weight.numel():,}")
print(f"Are weights shared? {model.embedding.weight is model.output_proj.weight}")

Using the Module’s Embeddings

The embeddings.py file contains production-ready embedding classes:

from embeddings import (
    TokenEmbedding,
    LearnedPositionalEmbedding,
    SinusoidalPositionalEmbedding,
    TransformerEmbedding as ModuleTransformerEmbedding,
    demonstrate_embeddings
)

# Run the demonstration
demo_emb = demonstrate_embeddings(
    vocab_size=100,
    embed_dim=32,
    seq_len=8,
    verbose=True
)

Summary

Key takeaways:

  1. Token embeddings are lookup tables that convert token IDs to vectors
  2. Positional embeddings add information about where tokens are in the sequence
  3. Sinusoidal positional embeddings use fixed sin/cos patterns - no parameters, can theoretically extrapolate
  4. Learned positional embeddings are trained like any other parameter - often slightly better in practice
  5. Modern approaches (RoPE, ALiBi) handle position differently and extrapolate better to long sequences
  6. Similar tokens end up with similar embeddings after training - capturing semantic relationships
  7. Scaling by sqrt(embed_dim) helps maintain stable gradients when dimensions vary
  8. Weight tying between input embeddings and output layer is common and reduces parameters

Common Pitfalls

  • Forgetting the sqrt scale: Without it, positional embeddings can dominate or be ignored depending on embed_dim
  • Exceeding max_seq_len: Learned positional embeddings fail hard on longer sequences than training
  • Ignoring padding: Padding tokens should be zero vectors and excluded from gradients
  • Poor initialization: Large initial values cause training instability; use small std (0.01-0.02)

Going Deeper

What’s Next

In Module 05: Attention, we’ll learn the core mechanism that allows tokens to “look at” each other. Embeddings are the input to attention - the vectors that get attended to. The positional information encoded here becomes crucial when attention computes relationships between tokens.