Module 04: Embeddings
Introduction
Converting token IDs into dense vectors. Embeddings give meaning to numbers - similar tokens end up close together in vector space.
Embeddings are learned vector representations. Instead of treating token ID 42 as just the number 42, we give it a rich vector like [0.2, -0.5, 0.8, ...] that captures its meaning.
Why embeddings matter for LLMs:
- Similarity: Similar words have similar vectors (“cat” and “dog” are close)
- Composition: Vectors can be combined meaningfully
- Learning: The model learns these representations during training
- Position: We also embed WHERE tokens are in the sequence
Two types of embeddings in transformers:
- Token embeddings: What the token means
- Positional embeddings: Where the token is in the sequence
What You’ll Learn
By the end of this module, you will be able to:
- Explain how embedding lookups work as matrix multiplication
- Implement token and positional embeddings from scratch
- Understand how gradients flow through embedding layers
- Choose between learned and sinusoidal positional embeddings
- Recognize the role of embeddings in the transformer architecture
Memory and Scale Considerations
Embeddings are often the largest single component of a language model. The parameter count is simply vocab_size x embed_dim:
| Model | Vocab Size | Embed Dim | Embedding Params | Memory (fp32) |
|---|---|---|---|---|
| GPT-2 Small | 50,257 | 768 | 38.6M | 147 MB |
| LLaMA 7B | 32,000 | 4,096 | 131M | 500 MB |
| LLaMA 70B | 32,000 | 8,192 | 262M | 1 GB |
This is why vocabulary size is a critical design decision. A larger vocabulary means each token carries more information (fewer tokens per text), but the embedding table grows proportionally.
Intuition: Coordinates in Meaning Space
Think of embeddings as coordinates in “meaning space”:
Token: "cat" -> [0.8, 0.1, 0.9, ...] <- captures "animal", "pet", etc.
Token: "dog" -> [0.7, 0.2, 0.8, ...] <- similar to cat
Token: "code" -> [0.1, 0.9, 0.2, ...] <- very different
Distance("cat", "dog") < Distance("cat", "code")
Positional embeddings add “where in the sequence” information:
Position 0: [1.0, 0.0, 0.5, ...] <- "I'm first"
Position 1: [0.9, 0.1, 0.4, ...] <- "I'm second"
Position 2: [0.8, 0.2, 0.3, ...] <- "I'm third"
The model combines both:
Final embedding = Token embedding + Positional embedding
Embedding Architecture
Here’s how embeddings work in a transformer. Step through the pipeline to see how token IDs become rich vector representations:
TipTry This
Use the slider to step through the embedding pipeline. Notice how token IDs become vectors through table lookup, then get combined with position information.
Embeddings Are Just Lookup Tables
The key insight: embedding lookup is sparse matrix multiplication. When we say “look up embedding for token 3”, we’re actually doing:
- Create a one-hot vector: token 3 in vocab of 10 becomes
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0] - Multiply by the weight matrix:
one_hot @ W - Only one row of W participates (the row for token 3)
This isn’t just conceptual - it’s exactly what happens mathematically. PyTorch optimizes this by skipping the one-hot creation, but understanding the matrix multiplication view is essential for understanding gradients.
From Scratch: One-Hot Embedding Lookup
Let’s build an embedding layer using explicit one-hot vectors and matrix multiplication:
import numpy as np
import torch
import torch.nn as nn
class ScratchEmbedding:
"""Embedding layer using explicit one-hot multiplication.
This shows what's really happening: embedding lookup is
just sparse matrix multiplication.
"""
def __init__(self, vocab_size: int, embed_dim: int):
self.vocab_size = vocab_size
self.embed_dim = embed_dim
# Weight matrix: each row is the embedding for a token
self.W = np.random.randn(vocab_size, embed_dim) * 0.02
def __call__(self, token_ids: np.ndarray) -> np.ndarray:
"""
token_ids: shape (batch, seq_len) - integer token IDs
returns: shape (batch, seq_len, embed_dim)
"""
batch_size, seq_len = token_ids.shape
# Create one-hot encodings: (batch, seq_len, vocab_size)
one_hot = np.zeros((batch_size, seq_len, self.vocab_size))
# Set the appropriate positions to 1
for b in range(batch_size):
for t in range(seq_len):
one_hot[b, t, token_ids[b, t]] = 1.0
# Matrix multiply: (batch, seq_len, vocab_size) @ (vocab_size, embed_dim)
# = (batch, seq_len, embed_dim)
embeddings = one_hot @ self.W
return embeddings
# Test it
vocab_size, embed_dim = 10, 4
scratch_emb = ScratchEmbedding(vocab_size, embed_dim)
# Sample tokens: batch of 2, sequence length 3
token_ids = np.array([[3, 7, 1],
[5, 3, 9]])
result = scratch_emb(token_ids)
print(f"Token IDs shape: {token_ids.shape}")
print(f"Embeddings shape: {result.shape}")
print(f"\nToken 3's embedding (row 3 of W):")
print(f" From lookup: {result[0, 0]}")
print(f" Direct W[3]: {scratch_emb.W[3]}")
print(f" Match: {np.allclose(result[0, 0], scratch_emb.W[3])}")The one-hot multiplication selects exactly one row from W. Watch the math:
# Visualize the one-hot multiplication
token_id = 3
one_hot = np.zeros(vocab_size)
one_hot[token_id] = 1.0
print("One-hot vector for token 3:")
print(f" {one_hot}")
print(f"\nWeight matrix W (10 x 4):")
print(f" Row 0: {scratch_emb.W[0]}")
print(f" Row 1: {scratch_emb.W[1]}")
print(f" Row 2: {scratch_emb.W[2]}")
print(f" Row 3: {scratch_emb.W[3]} <-- selected")
print(f" ...")
print(f"\none_hot @ W = {one_hot @ scratch_emb.W}")
print(f"W[3] directly = {scratch_emb.W[3]}")PyTorch’s nn.Embedding
PyTorch provides the same functionality but optimized - it skips creating the one-hot vector entirely:
# PyTorch equivalent
torch_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
# Copy weights from scratch version for comparison
with torch.no_grad():
torch_emb.weight.copy_(torch.from_numpy(scratch_emb.W).float())
# Same token IDs
token_ids_torch = torch.tensor(token_ids)
result_torch = torch_emb(token_ids_torch)
print(f"Scratch result (token 3): {result[0, 0]}")
print(f"PyTorch result (token 3): {result_torch[0, 0].detach().numpy()}")
print(f"Match: {np.allclose(result, result_torch.detach().numpy())}")
TipKey Insight
nn.Embedding is just an optimized lookup - no one-hot materialization. But mathematically, it’s identical to one-hot times weight matrix. Understanding this helps when debugging gradient flow.
Making Lookups Differentiable
How do gradients flow through an embedding lookup? The answer comes directly from the matrix multiplication view.
From Scratch: Gradient Flow
When we compute output = one_hot @ W, the gradient with respect to W follows standard matrix calculus:
dL/dW = one_hot.T @ dL/doutput
This means only the selected rows receive gradients. If we looked up tokens [3, 7, 1], only rows 3, 7, and 1 of W get updated during training.
class ScratchEmbeddingWithGrad:
"""Embedding with gradient computation."""
def __init__(self, vocab_size: int, embed_dim: int):
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.W = np.random.randn(vocab_size, embed_dim) * 0.02
self.grad_W = None
self._last_one_hot = None # Store for backward
def forward(self, token_ids: np.ndarray) -> np.ndarray:
batch_size, seq_len = token_ids.shape
# Create one-hot: (batch, seq_len, vocab_size)
one_hot = np.zeros((batch_size, seq_len, self.vocab_size))
for b in range(batch_size):
for t in range(seq_len):
one_hot[b, t, token_ids[b, t]] = 1.0
self._last_one_hot = one_hot
return one_hot @ self.W
def backward(self, grad_output: np.ndarray):
"""
grad_output: shape (batch, seq_len, embed_dim) - gradient from next layer
"""
# dL/dW = one_hot.T @ grad_output
# Reshape for batch matmul: (batch, vocab_size, seq_len) @ (batch, seq_len, embed_dim)
one_hot_T = self._last_one_hot.transpose(0, 2, 1) # (batch, vocab, seq_len)
# Accumulate gradients across batch
self.grad_W = np.zeros_like(self.W)
for b in range(grad_output.shape[0]):
self.grad_W += one_hot_T[b] @ grad_output[b]
return self.grad_W
# Demonstrate gradient flow
emb = ScratchEmbeddingWithGrad(vocab_size=10, embed_dim=4)
token_ids = np.array([[3, 7, 1]]) # batch=1, seq_len=3
# Forward
output = emb.forward(token_ids)
# Simulate gradient from loss (all ones for simplicity)
grad_from_loss = np.ones_like(output)
# Backward
grad_W = emb.backward(grad_from_loss)
print("Gradient magnitude per row of W:")
for i in range(10):
magnitude = np.abs(grad_W[i]).sum()
marker = " <-- used" if i in [3, 7, 1] else ""
print(f" Row {i}: {magnitude:.4f}{marker}")
print("\nOnly rows 1, 3, 7 received gradients!")PyTorch: Automatic Gradient Tracking
PyTorch handles this automatically when requires_grad=True:
# PyTorch does this automatically
torch_emb = nn.Embedding(10, 4)
token_ids = torch.tensor([[3, 7, 1]])
output = torch_emb(token_ids)
# Fake loss: sum of embeddings
loss = output.sum()
loss.backward()
print("PyTorch gradient magnitude per row:")
for i in range(10):
magnitude = torch_emb.weight.grad[i].abs().sum().item()
marker = " <-- used" if i in [3, 7, 1] else ""
print(f" Row {i}: {magnitude:.4f}{marker}")
NoteSparse Updates
This “sparse gradient” property is why embedding layers can have millions of parameters but train efficiently - each batch only updates a small subset of rows.
Positional Information
Attention (without positional embeddings) is permutation-equivariant: if you reorder the input tokens, the attention scores simply reorder to match. The relationship between “cat” and “sat” is the same regardless of whether they’re at positions [0,1] or [1,0]. This means the model can’t distinguish “the cat sat” from “sat the cat” — a critical limitation since word order carries meaning.
Position embeddings solve this by giving each position a learnable vector that gets added to the token embedding.
From Scratch: Learnable Position Embeddings
Position embeddings are just another lookup table, indexed by position instead of token ID:
class ScratchPositionEmbedding:
"""Learnable position embeddings - same as token embeddings but indexed by position."""
def __init__(self, max_seq_len: int, embed_dim: int):
self.max_seq_len = max_seq_len
self.embed_dim = embed_dim
# Each position gets its own learnable vector
self.W = np.random.randn(max_seq_len, embed_dim) * 0.02
def __call__(self, seq_len: int) -> np.ndarray:
"""
seq_len: how many positions to return
returns: shape (seq_len, embed_dim)
"""
# Just slice the first seq_len positions
return self.W[:seq_len]
class ScratchCombinedEmbedding:
"""Token embeddings + position embeddings."""
def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
self.token_emb = ScratchEmbedding(vocab_size, embed_dim)
self.pos_emb = ScratchPositionEmbedding(max_seq_len, embed_dim)
def __call__(self, token_ids: np.ndarray) -> np.ndarray:
"""
token_ids: shape (batch, seq_len)
returns: shape (batch, seq_len, embed_dim)
"""
batch_size, seq_len = token_ids.shape
# Get token embeddings: (batch, seq_len, embed_dim)
tok_emb = self.token_emb(token_ids)
# Get position embeddings: (seq_len, embed_dim)
pos_emb = self.pos_emb(seq_len)
# Add position embeddings (broadcasts over batch dimension)
return tok_emb + pos_emb
# Test combined embedding
combined = ScratchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)
# Same token (ID=42) at different positions
tokens = np.array([[42, 42, 42, 42]]) # Same token, 4 positions
embeddings = combined(tokens)
print("Same token (42) at different positions:")
for pos in range(4):
print(f" Position {pos}: {embeddings[0, pos, :4]}...")
print("\nAll different due to position embeddings!")Why Position Matters: Attention is Permutation-Invariant
Without position embeddings, attention treats tokens as an unordered set:
import matplotlib.pyplot as plt
# Demonstrate permutation invariance
def simple_attention_scores(embeddings):
"""Compute raw attention scores (Q @ K.T) without position."""
# In real attention, Q = emb @ W_q, K = emb @ W_k
# For simplicity, use embeddings directly
return embeddings @ embeddings.T
# Create two orderings of the same tokens
token_emb = ScratchEmbedding(vocab_size=10, embed_dim=8)
# "the cat sat" = tokens [1, 5, 7]
order1 = np.array([[1, 5, 7]])
# "sat the cat" = tokens [7, 1, 5]
order2 = np.array([[7, 1, 5]])
emb1 = token_emb(order1)[0] # (3, 8)
emb2 = token_emb(order2)[0] # (3, 8)
scores1 = simple_attention_scores(emb1)
scores2 = simple_attention_scores(emb2)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Key insight: The pairwise relationships (attention scores) are the same,
# just permuted. Without position info, the model can't tell order apart.
im1 = axes[0].imshow(scores1, cmap='Blues')
axes[0].set_title('Order: [1, 5, 7]\n"the cat sat"')
axes[0].set_xlabel('Key position')
axes[0].set_ylabel('Query position')
plt.colorbar(im1, ax=axes[0])
im2 = axes[1].imshow(scores2, cmap='Blues')
axes[1].set_title('Order: [7, 1, 5]\n"sat the cat"')
axes[1].set_xlabel('Key position')
axes[1].set_ylabel('Query position')
plt.colorbar(im2, ax=axes[1])
plt.suptitle('Attention scores are just permuted\n(without position embeddings, order is lost)')
plt.tight_layout()
plt.show()PyTorch: Combined Token + Position Embedding
class PyTorchCombinedEmbedding(nn.Module):
"""Standard transformer embedding: token + position."""
def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, embed_dim)
self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
batch_size, seq_len = token_ids.shape
# Token embeddings
tok_emb = self.token_emb(token_ids)
# Position embeddings (create position indices)
positions = torch.arange(seq_len, device=token_ids.device)
pos_emb = self.pos_emb(positions)
return tok_emb + pos_emb
# Compare scratch vs PyTorch
pytorch_combined = PyTorchCombinedEmbedding(vocab_size=100, embed_dim=8, max_seq_len=32)
tokens_torch = torch.tensor([[42, 42, 42, 42]])
embeddings_torch = pytorch_combined(tokens_torch)
print("PyTorch: Same token (42) at different positions:")
for pos in range(4):
print(f" Position {pos}: {embeddings_torch[0, pos, :4].tolist()}")
TipKey Insight
Position embeddings are just another embedding table - they work identically to token embeddings but are indexed by position. The “magic” is simply: final = token_emb[token_id] + pos_emb[position].
The Math
Token Embeddings
Simple lookup table: E[token_id] = embedding_vector
Mathematically equivalent to one-hot multiplication:
one_hot = [0, 0, 0, 1, 0, ...] # 1 at position token_id
embedding = one_hot @ E # selects row token_id from E
Positional Embeddings
There are several approaches to encoding position:
1. Learned positional embeddings (GPT-2, BERT):
# Position table: (max_seq_len, embed_dim)
P = torch.randn(max_seq_len, embed_dim)
positions = P[:seq_len] # Get positions for current sequenceEach position gets a trainable vector. Simple and effective, but cannot generalize to positions beyond max_seq_len.
2. Sinusoidal positional embeddings (original Transformer):
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
This creates a unique pattern for each position using waves of different frequencies. The key insight: PE(pos+k) can be represented as a linear function of PE(pos), allowing the model to learn relative positions.
3. Rotary Position Embedding - RoPE (LLaMA, Mistral): Rather than adding position embeddings to token embeddings, RoPE rotates the query and key vectors based on position. The rotation angle depends on both position and dimension, encoding relative positions naturally in the attention computation. This approach extrapolates well to longer sequences than seen during training.
4. ALiBi - Attention with Linear Biases (BLOOM): Instead of adding position information to embeddings, ALiBi adds a position-dependent bias directly to the attention scores: closer tokens get higher scores. This is applied during attention, not in the embedding layer.
Combined Embeddings
# Input: token_ids of shape (batch, seq_len)
token_emb = token_embedding[token_ids] # (batch, seq_len, embed_dim)
pos_emb = position_embedding[:seq_len] # (seq_len, embed_dim)
x = token_emb + pos_emb # (batch, seq_len, embed_dim)Same Token, Different Positions
The same token (“the”) appears at multiple positions in a sentence. Even though it has the same token embedding, the final embedding differs because of position.
TipTry This
Same position: Set both positions to the same value (e.g., 0 and 0). The similarity becomes 1.0 (identical).
Adjacent positions: Compare positions 0 and 1. They are very similar (>0.95) because position embeddings change gradually.
Distant positions: Compare positions 0 and 5. The similarity drops because position embeddings diverge.
The key insight: same token ID + different position = different final embedding. This is how the model knows word order matters.
Code Walkthrough
Let’s explore embeddings interactively:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
print(f"PyTorch version: {torch.__version__}")Token Embeddings Basics
A token embedding is just a lookup table: token ID -> vector
# Create a simple token embedding
vocab_size = 100
embed_dim = 32
# nn.Embedding is PyTorch's lookup table
token_emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embed_dim}")
print(f"Total parameters: {vocab_size * embed_dim:,}")
print(f"Embedding table shape: {token_emb.weight.shape}")# Look up embeddings for some tokens
token_ids = torch.tensor([[5, 10, 15, 20]])
embeddings = token_emb(token_ids)
print(f"Input token IDs: {token_ids[0].tolist()}")
print(f"Output shape: {tuple(embeddings.shape)}")
print(f"\nToken 5's embedding (first 8 dims):")
print(f" {embeddings[0, 0, :8].tolist()}")# Same token always gets the same embedding
e1 = token_emb(torch.tensor([[42]]))
e2 = token_emb(torch.tensor([[42]]))
print(f"Token 42 embedding (call 1): {e1[0, 0, :4].tolist()}")
print(f"Token 42 embedding (call 2): {e2[0, 0, :4].tolist()}")
print(f"Equal: {torch.allclose(e1, e2)}")Sinusoidal Positional Encoding
The original Transformer uses sin/cos functions to encode position. The key idea is to create a unique “fingerprint” for each position using waves of different frequencies:
- Low-frequency components (high dimensions): Change slowly across positions, capturing coarse position
- High-frequency components (low dimensions): Change rapidly, capturing fine-grained position
This is analogous to how Fourier series can represent any periodic function as a sum of sines and cosines:
import math
def create_sinusoidal_encoding(max_seq_len: int, embed_dim: int) -> torch.Tensor:
"""Create sinusoidal positional encoding matrix."""
pe = torch.zeros(max_seq_len, embed_dim)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# Compute div_term: 10000^(2i/d) = exp(2i * -log(10000) / d)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
# Apply sin to even indices, cos to odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# Create positional encoding
pe = create_sinusoidal_encoding(max_seq_len=128, embed_dim=64)
print(f"Positional encoding shape: {pe.shape}")# Visualize as heatmap
plt.figure(figsize=(14, 6))
plt.imshow(pe[:50].numpy(), aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
plt.colorbar(label='Value')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position')
plt.title('Sinusoidal Positional Encoding')
plt.show()# Plot individual dimensions to see the patterns
plt.figure(figsize=(14, 4))
for dim in [0, 1, 10, 11, 30, 31]:
plt.plot(pe[:50, dim].numpy(), label=f'Dim {dim} ({"sin" if dim % 2 == 0 else "cos"})')
plt.xlabel('Position')
plt.ylabel('Value')
plt.title('Positional Encoding by Dimension\n(Lower dims = higher frequency)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()# Compare positions by computing similarity
pe_subset = pe[:20]
# Compute cosine similarity between all position pairs
pe_norm = pe_subset / pe_subset.norm(dim=1, keepdim=True)
similarity = pe_norm @ pe_norm.T
plt.figure(figsize=(8, 6))
plt.imshow(similarity.numpy(), cmap='Blues')
plt.colorbar(label='Cosine Similarity')
plt.xlabel('Position')
plt.ylabel('Position')
plt.title('Positional Encoding Similarity Matrix\n(Nearby positions are more similar)')
plt.show()Notice that:
- Nearby positions are similar: Positions 5 and 6 are more similar than positions 5 and 15
- The pattern is symmetric: sim(i, j) = sim(j, i)
- Each position is unique: No two positions have identical encodings
The sinusoidal encoding also has a key mathematical property: for any fixed offset k, the encoding PE(pos+k) can be expressed as a linear transformation of PE(pos). This helps the model learn relative positions (e.g., “this token is 3 positions before that token”).
Combined Transformer Embedding
In practice, we add token embeddings and positional embeddings together. The implementation in embeddings.py includes several important details:
Scaling by sqrt(embed_dim): Token embeddings are multiplied by
sqrt(embed_dim)before adding positional embeddings. This prevents the positional signal from dominating when embed_dim is large (since embeddings are typically initialized with small values likestd=0.02).Initialization: Embeddings are initialized from a normal distribution with small standard deviation (0.02). This is crucial for stable training - large initial values can cause exploding gradients.
Padding token handling: The embedding for the padding token (usually ID 0) is set to zeros and excluded from gradient updates.
Dropout: Applied after combining embeddings for regularization.
class TransformerEmbedding(nn.Module):
"""Combined token + positional embedding."""
def __init__(self, vocab_size: int, embed_dim: int, max_seq_len: int, dropout: float = 0.1):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
self.dropout = nn.Dropout(dropout)
self.scale = math.sqrt(embed_dim)
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
seq_len = token_ids.shape[1]
# Get token embeddings and scale
token_emb = self.token_embedding(token_ids) * self.scale
# Get positional embeddings
positions = torch.arange(seq_len, device=token_ids.device)
pos_emb = self.position_embedding(positions)
# Combine and apply dropout
return self.dropout(token_emb + pos_emb)
# Create embedding layer
emb = TransformerEmbedding(
vocab_size=1000,
embed_dim=64,
max_seq_len=128,
dropout=0.0 # Disable for visualization
)
# Process some tokens
tokens = torch.randint(0, 1000, (1, 10))
output = emb(tokens)
print(f"Input tokens: {tokens[0].tolist()}")
print(f"Output shape: {tuple(output.shape)}")# Show that same token at different positions has different embeddings
# Put token 42 at positions 0, 5, and 9
tokens = torch.tensor([[42, 1, 2, 3, 4, 42, 6, 7, 8, 42]])
output = emb(tokens)
# Get the embeddings for token 42 at each position
pos_0 = output[0, 0].detach()
pos_5 = output[0, 5].detach()
pos_9 = output[0, 9].detach()
print("Token 42 at different positions:")
print(f" Position 0: {pos_0[:4].tolist()}")
print(f" Position 5: {pos_5[:4].tolist()}")
print(f" Position 9: {pos_9[:4].tolist()}")
print(f"\nAll different due to positional encoding!")# Visualize the combination
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Token embedding only (without scale for visualization)
token_only = emb.token_embedding(tokens)[0].detach().numpy()
axes[0].imshow(token_only, aspect='auto', cmap='RdBu')
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')
axes[0].set_title('Token Embeddings Only')
# Position embedding only
positions = torch.arange(10)
pos_only = emb.position_embedding(positions).detach().numpy()
axes[1].imshow(pos_only, aspect='auto', cmap='RdBu')
axes[1].set_xlabel('Dimension')
axes[1].set_ylabel('Position')
axes[1].set_title('Positional Embeddings Only')
# Combined
combined = output[0].detach().numpy()
axes[2].imshow(combined, aspect='auto', cmap='RdBu')
axes[2].set_xlabel('Dimension')
axes[2].set_ylabel('Position')
axes[2].set_title('Token + Position (Combined)')
plt.tight_layout()
plt.show()Embedding Similarity
Embeddings capture meaning - similar tokens should have similar embeddings:
# Let's simulate "training" by manually setting some embeddings to be similar
# In practice, these patterns emerge from training on real text
vocab_size = 20
embed_dim = 16
token_emb = nn.Embedding(vocab_size, embed_dim)
# Manually set some tokens to have similar embeddings
# (simulating what would happen after training on related words)
with torch.no_grad():
# Tokens 0-4: "numbers" (similar to each other)
base_number = torch.randn(embed_dim)
for i in range(5):
token_emb.weight[i] = base_number + torch.randn(embed_dim) * 0.1
# Tokens 5-9: "letters" (similar to each other, different from numbers)
base_letter = torch.randn(embed_dim)
for i in range(5, 10):
token_emb.weight[i] = base_letter + torch.randn(embed_dim) * 0.1
# Compute all pairwise similarities
all_embeds = token_emb.weight[:10]
all_embeds_norm = all_embeds / all_embeds.norm(dim=1, keepdim=True)
similarity = (all_embeds_norm @ all_embeds_norm.T).detach().numpy()
plt.figure(figsize=(8, 6))
plt.imshow(similarity, cmap='RdBu', vmin=-1, vmax=1)
plt.colorbar(label='Cosine Similarity')
plt.xlabel('Token ID')
plt.ylabel('Token ID')
plt.title('Token Embedding Similarity\n(0-4: "numbers", 5-9: "letters")')
# Add labels
labels = ['N0', 'N1', 'N2', 'N3', 'N4', 'L0', 'L1', 'L2', 'L3', 'L4']
plt.xticks(range(10), labels)
plt.yticks(range(10), labels)
plt.show()
print("Notice: Numbers (N) are similar to each other, letters (L) are similar to each other,")
print("but numbers and letters are different from each other.")2D Visualization with PCA
from sklearn.decomposition import PCA
# Get embeddings for all 10 tokens
embeddings = all_embeds.detach().numpy()
# Reduce to 2D
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
# Numbers in blue
plt.scatter(emb_2d[:5, 0], emb_2d[:5, 1], c='blue', s=100, label='Numbers')
for i in range(5):
plt.annotate(f'N{i}', (emb_2d[i, 0], emb_2d[i, 1]), fontsize=12)
# Letters in red
plt.scatter(emb_2d[5:, 0], emb_2d[5:, 1], c='red', s=100, label='Letters')
for i in range(5, 10):
plt.annotate(f'L{i-5}', (emb_2d[i, 0], emb_2d[i, 1]), fontsize=12)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Token Embeddings in 2D\n(Similar tokens cluster together)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Interactive Exploration
Explore how sinusoidal position encodings create unique patterns for each position. The key insight: low dimensions change rapidly (high frequency), while high dimensions change slowly (low frequency).
TipTry This
Frequency gradient: Look at the heatmap from left to right. Low dimensions (left) have rapid oscillation, high dimensions (right) change slowly.
Adjacent positions: Set positions to 0 and 1. Notice high similarity (≈0.99+). The encodings are almost identical, differing only slightly.
Distant positions: Compare positions 0 and 32. Similarity drops significantly because more dimension-waves have cycled.
Unique fingerprints: Slide through different positions in the line plot. Each position has a unique “fingerprint” pattern.
Sin/Cos pairs: In the line plot, blue dots are sin (even dims), orange dots are cos (odd dims). They’re 90° out of phase.
Exercises
Exercise 1: Compare Learned vs Sinusoidal Positional Embeddings
Learned and sinusoidal embeddings have different tradeoffs:
| Aspect | Learned | Sinusoidal |
|---|---|---|
| Training | Updated via backprop | Fixed (no parameters) |
| Extrapolation | Cannot generalize beyond max_seq_len | Can theoretically extrapolate |
| Memory | Adds parameters | Zero parameter overhead |
| Performance | Often slightly better in practice | Good baseline |
# Compare learned vs sinusoidal positional embeddings
learned = nn.Embedding(50, 32) # Learned (random initialization)
sinusoidal = create_sinusoidal_encoding(50, 32) # Fixed pattern
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].imshow(learned.weight.detach().numpy(), aspect='auto', cmap='RdBu')
axes[0].set_title('Learned Positional Embeddings\n(Random initialization)')
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')
axes[1].imshow(sinusoidal.numpy(), aspect='auto', cmap='RdBu')
axes[1].set_title('Sinusoidal Positional Embeddings\n(Fixed pattern)')
axes[1].set_xlabel('Dimension')
axes[1].set_ylabel('Position')
plt.tight_layout()
plt.show()
print("Learned embeddings start random but are trained to capture position.")
print("Sinusoidal embeddings have a fixed pattern that encodes relative positions.")Exercise 2: Effect of Embedding Dimension
The embedding dimension affects both model capacity and computational cost. Larger dimensions can represent more nuanced semantic distinctions but require more memory and computation in every layer of the model.
The sqrt(embed_dim) scaling factor is crucial: without it, the magnitude of embeddings would vary significantly with dimension, since randomly initialized vectors of higher dimension have larger expected norms.
# What happens with different embedding dimensions?
for dim in [8, 32, 128, 512]:
emb = TransformerEmbedding(
vocab_size=1000,
embed_dim=dim,
max_seq_len=128,
dropout=0.0
)
tokens = torch.randint(0, 1000, (1, 32))
output = emb(tokens)
# Compute variance of output
variance = output.var().item()
print(f"Embed dim {dim:3d}: output variance = {variance:.4f}")
print("\nThe scale factor (sqrt(embed_dim)) helps keep variance stable!")Exercise 3: Memory Usage of Embeddings
Understanding embedding memory is crucial for model sizing. With weight tying (sharing embeddings between input and output layers), you only pay this cost once. Without it, you pay twice.
# Memory usage of embeddings
configs = [
{"vocab": 1000, "dim": 64, "name": "Tiny"},
{"vocab": 8000, "dim": 256, "name": "Small"},
{"vocab": 32000, "dim": 512, "name": "Medium"},
{"vocab": 50000, "dim": 768, "name": "Large (GPT-2)"},
{"vocab": 100000, "dim": 4096, "name": "Large (LLaMA)"},
]
print("Embedding Table Memory Usage:")
print("=" * 60)
for cfg in configs:
params = cfg["vocab"] * cfg["dim"]
memory_mb = params * 4 / (1024 * 1024) # 4 bytes per float32
memory_fp16 = memory_mb / 2 # fp16/bf16 halves memory
print(f"{cfg['name']:15s}: {cfg['vocab']:6d} vocab x {cfg['dim']:4d} dim = "
f"{params:>12,} params ({memory_mb:>7.1f} MB fp32, {memory_fp16:>6.1f} MB fp16)")Note: Modern models typically use fp16 or bf16, which halves the memory requirement. Quantization (int8, int4) can reduce it further.
Weight tying shares the embedding matrix between the input layer and output projection, halving the embedding parameter count:
# Weight tying: share embedding weights with output projection
# This is what GPT-2, LLaMA, and most modern LLMs do
import torch.nn as nn
class SimpleLMWithWeightTying(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
# Output projection shares weights with embedding (transposed)
self.output_proj = nn.Linear(embed_dim, vocab_size, bias=False)
# Tie weights: output projection uses the same weights as embedding
self.output_proj.weight = self.embedding.weight
def forward(self, x):
emb = self.embedding(x) # (batch, seq, embed_dim)
logits = self.output_proj(emb) # (batch, seq, vocab_size)
return logits
model = SimpleLMWithWeightTying(vocab_size=1000, embed_dim=256)
print(f"Embedding params: {model.embedding.weight.numel():,}")
print(f"Output proj params: {model.output_proj.weight.numel():,}")
print(f"Are weights shared? {model.embedding.weight is model.output_proj.weight}")Using the Module’s Embeddings
The embeddings.py file contains production-ready embedding classes:
from embeddings import (
TokenEmbedding,
LearnedPositionalEmbedding,
SinusoidalPositionalEmbedding,
TransformerEmbedding as ModuleTransformerEmbedding,
demonstrate_embeddings
)
# Run the demonstration
demo_emb = demonstrate_embeddings(
vocab_size=100,
embed_dim=32,
seq_len=8,
verbose=True
)Summary
Key takeaways:
- Token embeddings are lookup tables that convert token IDs to vectors
- Positional embeddings add information about where tokens are in the sequence
- Sinusoidal positional embeddings use fixed sin/cos patterns - no parameters, can theoretically extrapolate
- Learned positional embeddings are trained like any other parameter - often slightly better in practice
- Modern approaches (RoPE, ALiBi) handle position differently and extrapolate better to long sequences
- Similar tokens end up with similar embeddings after training - capturing semantic relationships
- Scaling by sqrt(embed_dim) helps maintain stable gradients when dimensions vary
- Weight tying between input embeddings and output layer is common and reduces parameters
Common Pitfalls
- Forgetting the sqrt scale: Without it, positional embeddings can dominate or be ignored depending on embed_dim
- Exceeding max_seq_len: Learned positional embeddings fail hard on longer sequences than training
- Ignoring padding: Padding tokens should be zero vectors and excluded from gradients
- Poor initialization: Large initial values cause training instability; use small std (0.01-0.02)
Going Deeper
- Word2Vec - Original word embeddings paper (Mikolov et al., 2013)
- Attention Is All You Need - Section 3.5 on positional encoding
- RoFormer: Rotary Position Embedding (RoPE) - Used in LLaMA, Mistral
- ALiBi: Train Short, Test Long - Position as attention bias
- Using the Output Embedding to Improve Language Models - Weight tying paper
What’s Next
In Module 05: Attention, we’ll learn the core mechanism that allows tokens to “look at” each other. Embeddings are the input to attention - the vectors that get attended to. The positional information encoded here becomes crucial when attention computes relationships between tokens.