---
title: "Module 03: Tokenization"
format:
html:
code-fold: false
toc: true
ipynb: default
jupyter: python3
---
{{< include ../_diagram-lib.qmd >}}
## Introduction
Converting text into numbers. Before a language model can process text, we need to break it into tokens and map each token to a number.
**Tokenization** is how we convert raw text into a sequence of integers that the model can process. Modern LLMs use **subword tokenization** - they break text into pieces smaller than words but larger than characters.
Why subword tokenization?
- **Word-level**: Can't handle new words (OOV problem), huge vocabulary needed (millions for multilingual)
- **Character-level**: Sequences become 4-5x longer, attention cost explodes O(n^2), model learns spelling from scratch
- **Subword**: Best of both worlds - handles new words via decomposition, reasonable sequence length
The most common algorithm is **BPE (Byte Pair Encoding)**, originally invented for data compression in 1994 and adapted for NLP in 2016:
1. Start with individual characters as the initial vocabulary
2. Count all adjacent token pairs in the training corpus
3. Merge the most frequent pair into a new token
4. Add the merged token to the vocabulary
5. Repeat until vocabulary size reached
### What You'll Learn
By the end of this module, you will be able to:
- Build a character-level tokenizer from scratch
- Understand why subword tokenization is necessary
- Implement the BPE algorithm for training and encoding
- Handle special tokens (PAD, UNK, BOS, EOS)
- Recognize trade-offs in vocabulary size
But before diving into BPE, let's build the simplest possible tokenizer from scratch.
## The Simplest Tokenizer
The most straightforward approach: treat each character as a token.
```{python}
# Build vocabulary from text
text = "hello world"
chars = sorted(set(text))
print(f"Unique characters: {chars}")
print(f"Vocabulary size: {len(chars)}")
```
```{python}
# The core of any tokenizer: two lookup tables
stoi = {ch: i for i, ch in enumerate(chars)} # string to integer
itos = {i: ch for i, ch in enumerate(chars)} # integer to string
print("stoi (encode):", stoi)
print("itos (decode):", itos)
```
```{python}
# Encode: text -> integers
def encode(text):
return [stoi[ch] for ch in text]
# Decode: integers -> text
def decode(ids):
return ''.join(itos[i] for i in ids)
# Try it out
encoded = encode("hello")
print(f"'hello' -> {encoded}")
print(f"{encoded} -> '{decode(encoded)}'")
```
```{python}
# Round-trip test
original = "hello world"
reconstructed = decode(encode(original))
print(f"Original: '{original}'")
print(f"Reconstructed: '{reconstructed}'")
print(f"Perfect round-trip: {original == reconstructed}")
```
That's it! A complete tokenizer in about 10 lines of Python. Every tokenizer — no matter how sophisticated — has these same two operations:
- **encode**: text to token IDs
- **decode**: token IDs back to text
### The Key Insight
Tokenization is really about **compression** and **semantic grouping**:
| Tokenization | Vocabulary Size | Sequence Length | Semantics |
|-------------|-----------------|-----------------|-----------|
| Character | ~100 (ASCII) | Very long | None (individual letters) |
| Word | ~1,000,000+ | Short | Strong (whole words) |
| Subword | ~30,000-100,000 | Medium | Moderate (meaningful pieces) |
## Why Characters Aren't Enough
Our character tokenizer works, but has serious problems at scale.
### Problem 1: Long Sequences
```{python}
sample_text = "The transformer architecture revolutionized natural language processing."
char_tokens = list(sample_text)
print(f"Text length: {len(sample_text)} characters")
print(f"Token count: {len(char_tokens)} tokens")
print(f"Compression ratio: {len(sample_text) / len(char_tokens):.2f}x (no compression!)")
```
Since attention is O(n^2) in sequence length, doubling the sequence length quadruples the compute cost. Character-level tokenization produces the longest possible sequences.
### Problem 2: No Semantic Units
```{python}
# The model sees this:
word = "transformer"
char_view = list(word)
print(f"Characters: {char_view}")
print(f"Token count: {len(char_view)}")
```
The model must learn from scratch that `t-r-a-n-s-f-o-r-m-e-r` is a meaningful unit. It gets no help from the tokenization. Compare this to word-level where "transformer" would be a single token with its own learned representation.
### Problem 3: Vocabulary Explosion for Bytes
```{python}
# If we go to byte-level (handling all Unicode)
text_with_emoji = "Hello! \U0001F60A"
byte_view = text_with_emoji.encode('utf-8')
print(f"Text: {text_with_emoji}")
print(f"Bytes: {list(byte_view)}")
print(f"Byte count: {len(byte_view)} (emoji = 4 bytes!)")
```
Byte-level tokenization can represent anything, but sequences become even longer. A single emoji becomes 4 tokens.
### The Tradeoff
This is the fundamental tradeoff in tokenization:
```
Characters: Small vocab, long sequences, no semantics
Words: Huge vocab, short sequences, good semantics, can't handle new words
Subwords: Medium vocab, medium sequences, some semantics, handles new words
```
**BPE finds a sweet spot** by learning which character sequences appear frequently together and merging them into single tokens.
## Intuition: Learning Patterns Through Merging
Think of BPE as compression that learns common patterns:
```{ojs}
//| echo: false
// Interactive BPE merging visualization for "hello"
viewof bpeMergeStep = Inputs.range([0, 4], {
value: 0,
step: 1,
label: "Merge step"
})
```
```{ojs}
//| echo: false
bpeMergeDiagram = {
const width = 620;
const height = 180;
// Merge states: each state is an array of tokens
const mergeStates = [
{ tokens: ['h', 'e', 'l', 'l', 'o'], label: 'Initial: Character Tokens', merge: null },
{ tokens: ['h', 'e', 'll', 'o'], label: "Merge 1: 'l' + 'l' → 'll'", merge: ['l', 'l', 'll'] },
{ tokens: ['he', 'll', 'o'], label: "Merge 2: 'h' + 'e' → 'he'", merge: ['h', 'e', 'he'] },
{ tokens: ['he', 'llo'], label: "Merge 3: 'll' + 'o' → 'llo'", merge: ['ll', 'o', 'llo'] },
{ tokens: ['hello'], label: "Merge 4: 'he' + 'llo' → 'hello'", merge: ['he', 'llo', 'hello'] }
];
const state = mergeStates[bpeMergeStep];
const tokens = state.tokens;
const svg = d3.create('svg')
.attr('width', width)
.attr('height', height)
.attr('viewBox', `0 0 ${width} ${height}`);
// Background
svg.append('rect')
.attr('width', width)
.attr('height', height)
.attr('fill', diagramTheme.bg)
.attr('rx', 8);
// Title
svg.append('text')
.attr('x', width / 2)
.attr('y', 28)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '14px')
.attr('font-weight', '600')
.text(state.label);
// Token display area
const tokenY = 90;
const tokenH = 50;
const gap = 8;
// Calculate total width needed for tokens
const tokenWidths = tokens.map(t => Math.max(50, t.length * 22 + 24));
const totalWidth = tokenWidths.reduce((a, b) => a + b, 0) + gap * (tokens.length - 1);
let startX = (width - totalWidth) / 2;
// Draw tokens
tokens.forEach((token, i) => {
const tokenW = tokenWidths[i];
const x = startX + tokenW / 2;
// Check if this token was just merged
const justMerged = state.merge && token === state.merge[2];
const g = svg.append('g')
.attr('transform', `translate(${x}, ${tokenY})`);
// Token box with animation effect for merged tokens
const rect = g.append('rect')
.attr('x', -tokenW / 2)
.attr('y', -tokenH / 2)
.attr('width', tokenW)
.attr('height', tokenH)
.attr('rx', 8)
.attr('fill', justMerged ? diagramTheme.highlight : diagramTheme.nodeFill)
.attr('stroke', justMerged ? diagramTheme.highlight : diagramTheme.nodeStroke)
.attr('stroke-width', justMerged ? 2.5 : 1.5);
if (justMerged) {
rect.attr('filter', `drop-shadow(0 0 8px ${diagramTheme.highlightGlow})`);
}
// Token text
g.append('text')
.attr('x', 0)
.attr('y', 0)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', justMerged ? diagramTheme.textOnHighlight : diagramTheme.nodeText)
.attr('font-size', '18px')
.attr('font-family', 'monospace')
.attr('font-weight', '500')
.text(`'${token}'`);
startX += tokenW + gap;
});
// Show merge indicator if applicable
if (state.merge) {
svg.append('text')
.attr('x', width / 2)
.attr('y', height - 25)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.accent)
.attr('font-size', '12px')
.attr('font-family', 'monospace')
.text(`Merged: '${state.merge[0]}' + '${state.merge[1]}' → '${state.merge[2]}'`);
} else {
svg.append('text')
.attr('x', width / 2)
.attr('y', height - 25)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '12px')
.attr('opacity', 0.7)
.text(`${tokens.length} tokens`);
}
return svg.node();
}
```
```{ojs}
//| echo: false
md`**Token count:** ${['h','e','l','l','o'].length - bpeMergeStep} → ${bpeMergeStep === 4 ? '1 token (fully merged)' : `${5 - bpeMergeStep} tokens`}`
```
For code, BPE learns things like:
- `def ` (function definition with space)
- `self.` (common in Python classes)
- `return ` (return statement)
- ` ` (4-space indent)
## The BPE Training Algorithm
Here's how BPE learns to tokenize:
```{ojs}
//| echo: false
// Interactive BPE training algorithm visualization
viewof bpeTrainStep = Inputs.range([0, 6], {
value: 0,
step: 1,
label: "Training iteration"
})
```
```{ojs}
//| echo: false
bpeTrainingDiagram = {
const width = 700;
const height = 320;
// Training states showing the BPE algorithm on "low lower lowest"
const trainStates = [
{
phase: 'start',
tokens: ['l', 'o', 'w', ' ', 'l', 'o', 'w', 'e', 'r', ' ', 'l', 'o', 'w', 'e', 's', 't'],
pairs: [["('l','o')", 3], ["('o','w')", 3], ["('w',' ')", 2], ["('w','e')", 2]],
highlight: null,
description: 'Start with individual characters'
},
{
phase: 'count',
tokens: ['l', 'o', 'w', ' ', 'l', 'o', 'w', 'e', 'r', ' ', 'l', 'o', 'w', 'e', 's', 't'],
pairs: [["('l','o')", 3], ["('o','w')", 3], ["('w',' ')", 2], ["('w','e')", 2]],
highlight: "('l','o')",
description: "Count pairs: ('l','o') appears 3 times (most frequent)"
},
{
phase: 'merge',
tokens: ['lo', 'w', ' ', 'lo', 'w', 'e', 'r', ' ', 'lo', 'w', 'e', 's', 't'],
pairs: [["('lo','w')", 3], ["('w',' ')", 2], ["('w','e')", 2]],
highlight: 'lo',
description: "Merge ('l','o') → 'lo' everywhere"
},
{
phase: 'count',
tokens: ['lo', 'w', ' ', 'lo', 'w', 'e', 'r', ' ', 'lo', 'w', 'e', 's', 't'],
pairs: [["('lo','w')", 3], ["('w',' ')", 2], ["('w','e')", 2]],
highlight: "('lo','w')",
description: "Count pairs: ('lo','w') appears 3 times"
},
{
phase: 'merge',
tokens: ['low', ' ', 'low', 'e', 'r', ' ', 'low', 'e', 's', 't'],
pairs: [["('low',' ')", 2], ["('low','e')", 2], ["(' ','low')", 2]],
highlight: 'low',
description: "Merge ('lo','w') → 'low'"
},
{
phase: 'count',
tokens: ['low', ' ', 'low', 'e', 'r', ' ', 'low', 'e', 's', 't'],
pairs: [["('low','e')", 2], ["('low',' ')", 2]],
highlight: "('low','e')",
description: "Count pairs: ('low','e') appears 2 times"
},
{
phase: 'merge',
tokens: ['low', ' ', 'lowe', 'r', ' ', 'lowe', 's', 't'],
pairs: [["('lowe','r')", 1], ["('lowe','s')", 1]],
highlight: 'lowe',
description: "Merge ('low','e') → 'lowe' — Continue until vocab size reached"
}
];
const state = trainStates[bpeTrainStep];
const svg = d3.create('svg')
.attr('width', width)
.attr('height', height)
.attr('viewBox', `0 0 ${width} ${height}`);
// Background
svg.append('rect')
.attr('width', width)
.attr('height', height)
.attr('fill', diagramTheme.bg)
.attr('rx', 8);
// Title / Phase indicator
const phaseColors = {
'start': diagramTheme.nodeStroke,
'count': diagramTheme.accent,
'merge': diagramTheme.highlight
};
svg.append('text')
.attr('x', width / 2)
.attr('y', 28)
.attr('text-anchor', 'middle')
.attr('fill', phaseColors[state.phase])
.attr('font-size', '14px')
.attr('font-weight', '600')
.text(state.phase === 'start' ? 'BPE Training Algorithm' :
state.phase === 'count' ? 'Phase: Count Pairs' : 'Phase: Merge');
// Token display area
const tokenY = 85;
const tokenH = 36;
const gap = 3;
// Calculate token layout
const tokens = state.tokens;
const tokenWidths = tokens.map(t => t === ' ' ? 28 : Math.max(28, t.length * 14 + 16));
const totalWidth = tokenWidths.reduce((a, b) => a + b, 0) + gap * (tokens.length - 1);
const scale = totalWidth > width - 40 ? (width - 40) / totalWidth : 1;
let startX = (width - totalWidth * scale) / 2;
// Draw tokens
tokens.forEach((token, i) => {
const tokenW = tokenWidths[i] * scale;
const x = startX + tokenW / 2;
const isHighlighted = state.highlight === token;
const isSpace = token === ' ';
const g = svg.append('g')
.attr('transform', `translate(${x}, ${tokenY})`);
const rect = g.append('rect')
.attr('x', -tokenW / 2)
.attr('y', -tokenH / 2)
.attr('width', tokenW)
.attr('height', tokenH)
.attr('rx', 5)
.attr('fill', isHighlighted ? diagramTheme.highlight :
isSpace ? diagramTheme.bgSecondary : diagramTheme.nodeFill)
.attr('stroke', isHighlighted ? diagramTheme.highlight : diagramTheme.nodeStroke)
.attr('stroke-width', isHighlighted ? 2 : 1);
if (isHighlighted) {
rect.attr('filter', `drop-shadow(0 0 6px ${diagramTheme.highlightGlow})`);
}
g.append('text')
.attr('x', 0)
.attr('y', 0)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', isHighlighted ? diagramTheme.textOnHighlight : diagramTheme.nodeText)
.attr('font-size', `${11 * scale}px`)
.attr('font-family', 'monospace')
.text(isSpace ? '␣' : token);
startX += tokenW + gap * scale;
});
// Pair frequencies section
const pairY = 170;
svg.append('text')
.attr('x', 20)
.attr('y', pairY)
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '12px')
.attr('font-weight', '600')
.text('Pair frequencies:');
const pairs = state.pairs;
const pairGap = 150;
pairs.forEach((pair, i) => {
const x = 20 + i * pairGap;
const isHighlightedPair = state.highlight === pair[0];
svg.append('text')
.attr('x', x)
.attr('y', pairY + 24)
.attr('fill', isHighlightedPair ? diagramTheme.highlight : diagramTheme.nodeText)
.attr('font-size', '12px')
.attr('font-family', 'monospace')
.attr('font-weight', isHighlightedPair ? '700' : '400')
.text(`${pair[0]}: ${pair[1]}`);
});
// Description
svg.append('rect')
.attr('x', 20)
.attr('y', height - 65)
.attr('width', width - 40)
.attr('height', 45)
.attr('rx', 6)
.attr('fill', diagramTheme.bgSecondary)
.attr('stroke', diagramTheme.nodeStroke)
.attr('stroke-width', 1);
svg.append('text')
.attr('x', width / 2)
.attr('y', height - 38)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '13px')
.text(state.description);
// Token count
svg.append('text')
.attr('x', width - 20)
.attr('y', 28)
.attr('text-anchor', 'end')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '11px')
.attr('opacity', 0.7)
.text(`${tokens.length} tokens`);
return svg.node();
}
```
## The Math
BPE is simple - just counting and merging:
```python
# Count pair frequencies
pairs = count_pairs(tokens) # {'he': 50, 'el': 30, 'll': 80, ...}
# Find most frequent
best_pair = max(pairs, key=pairs.get) # ('l', 'l')
# Merge everywhere
tokens = merge(tokens, best_pair, 'll')
```
**Vocabulary size** is a hyperparameter:
- Too small: Sequences too long, less meaning per token
- Too large: Many rare tokens, harder to learn
- Typical: 8K-50K tokens for LLMs
## Encoding New Text
Once trained, encoding applies merges in the order they were learned:
```{ojs}
//| echo: false
// Interactive encoding demonstration for "lower"
viewof encodeStep = Inputs.range([0, 5], {
value: 0,
step: 1,
label: "Encoding step"
})
```
```{ojs}
//| echo: false
encodingDiagram = {
const width = 620;
const height = 240;
// Encoding steps showing how merges are applied in order
const encodeSteps = [
{ tokens: ['l', 'o', 'w', 'e', 'r'], label: 'Split to characters', merge: null, ids: null },
{ tokens: ['lo', 'w', 'e', 'r'], label: "Apply merge 1: 'l' + 'o' → 'lo'", merge: ['l', 'o', 'lo'], ids: null },
{ tokens: ['low', 'e', 'r'], label: "Apply merge 2: 'lo' + 'w' → 'low'", merge: ['lo', 'w', 'low'], ids: null },
{ tokens: ['lowe', 'r'], label: "Apply merge 3: 'low' + 'e' → 'lowe'", merge: ['low', 'e', 'lowe'], ids: null },
{ tokens: ['lower'], label: "Apply merge 4: 'lowe' + 'r' → 'lower'", merge: ['lowe', 'r', 'lower'], ids: null },
{ tokens: ['lower'], label: "Look up token IDs", merge: null, ids: [15] }
];
const state = encodeSteps[encodeStep];
const tokens = state.tokens;
const svg = d3.create('svg')
.attr('width', width)
.attr('height', height)
.attr('viewBox', `0 0 ${width} ${height}`);
// Background
svg.append('rect')
.attr('width', width)
.attr('height', height)
.attr('fill', diagramTheme.bg)
.attr('rx', 8);
// Step indicator
svg.append('text')
.attr('x', 20)
.attr('y', 28)
.attr('fill', diagramTheme.accent)
.attr('font-size', '12px')
.attr('font-weight', '600')
.text(`Step ${encodeStep + 1}/6`);
// Title
svg.append('text')
.attr('x', width / 2)
.attr('y', 28)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '14px')
.attr('font-weight', '600')
.text(state.label);
// Arrow showing merge progression
if (encodeStep > 0 && encodeStep < 5) {
// Show the "before" state faded
const prevTokens = encodeSteps[encodeStep - 1].tokens;
const prevY = 70;
const prevGap = 6;
const prevWidths = prevTokens.map(t => Math.max(40, t.length * 16 + 20));
const prevTotal = prevWidths.reduce((a, b) => a + b, 0) + prevGap * (prevTokens.length - 1);
let prevX = (width - prevTotal) / 2;
prevTokens.forEach((token, i) => {
const w = prevWidths[i];
const x = prevX + w / 2;
const isMerging = state.merge && (token === state.merge[0] || token === state.merge[1]);
svg.append('rect')
.attr('x', x - w / 2)
.attr('y', prevY - 16)
.attr('width', w)
.attr('height', 32)
.attr('rx', 5)
.attr('fill', diagramTheme.bgSecondary)
.attr('stroke', isMerging ? diagramTheme.accent : diagramTheme.nodeStroke)
.attr('stroke-width', isMerging ? 2 : 1)
.attr('opacity', 0.6);
svg.append('text')
.attr('x', x)
.attr('y', prevY)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '13px')
.attr('font-family', 'monospace')
.attr('opacity', 0.5)
.text(`'${token}'`);
prevX += w + prevGap;
});
// Arrow down
svg.append('path')
.attr('d', `M${width/2},${prevY + 22} L${width/2},${prevY + 45}`)
.attr('stroke', diagramTheme.accent)
.attr('stroke-width', 2)
.attr('marker-end', 'url(#encode-arrow)');
// Arrow marker
const defs = svg.append('defs');
defs.append('marker')
.attr('id', 'encode-arrow')
.attr('viewBox', '0 -5 10 10')
.attr('refX', 8)
.attr('refY', 0)
.attr('markerWidth', 6)
.attr('markerHeight', 6)
.attr('orient', 'auto')
.append('path')
.attr('d', 'M0,-5L10,0L0,5')
.attr('fill', diagramTheme.accent);
}
// Current tokens (main display)
const tokenY = encodeStep > 0 && encodeStep < 5 ? 150 : 110;
const tokenH = 50;
const gap = 10;
const tokenWidths = tokens.map(t => Math.max(60, t.length * 20 + 28));
const totalWidth = tokenWidths.reduce((a, b) => a + b, 0) + gap * (tokens.length - 1);
let startX = (width - totalWidth) / 2;
tokens.forEach((token, i) => {
const tokenW = tokenWidths[i];
const x = startX + tokenW / 2;
const justMerged = state.merge && token === state.merge[2];
const showId = state.ids !== null;
const g = svg.append('g')
.attr('transform', `translate(${x}, ${tokenY})`);
const rect = g.append('rect')
.attr('x', -tokenW / 2)
.attr('y', -tokenH / 2)
.attr('width', tokenW)
.attr('height', tokenH)
.attr('rx', 8)
.attr('fill', justMerged ? diagramTheme.highlight :
showId ? diagramTheme.accent : diagramTheme.nodeFill)
.attr('stroke', justMerged ? diagramTheme.highlight :
showId ? diagramTheme.accent : diagramTheme.nodeStroke)
.attr('stroke-width', justMerged || showId ? 2.5 : 1.5);
if (justMerged || showId) {
rect.attr('filter', `drop-shadow(0 0 8px ${justMerged ? diagramTheme.highlightGlow : diagramTheme.accentGlow})`);
}
// Token text
g.append('text')
.attr('x', 0)
.attr('y', showId ? -8 : 0)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', justMerged ? diagramTheme.textOnHighlight :
showId ? diagramTheme.textOnAccent : diagramTheme.nodeText)
.attr('font-size', '16px')
.attr('font-family', 'monospace')
.attr('font-weight', '500')
.text(`'${token}'`);
// ID display
if (showId && state.ids[i] !== undefined) {
g.append('text')
.attr('x', 0)
.attr('y', 12)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', diagramTheme.textOnAccent)
.attr('font-size', '13px')
.attr('opacity', 0.9)
.text(`ID: ${state.ids[i]}`);
}
startX += tokenW + gap;
});
// Bottom info
const infoY = height - 30;
if (state.ids) {
svg.append('text')
.attr('x', width / 2)
.attr('y', infoY)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.accent)
.attr('font-size', '14px')
.attr('font-weight', '600')
.text(`Output: [${state.ids.join(', ')}]`);
} else {
svg.append('text')
.attr('x', width / 2)
.attr('y', infoY)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '12px')
.attr('opacity', 0.7)
.text(`${tokens.length} token${tokens.length > 1 ? 's' : ''}`);
}
return svg.node();
}
```
## Handling Unknown Words
BPE can handle words it has never seen:
```{ojs}
//| echo: false
// Toggle between known and unknown word handling
viewof wordType = Inputs.radio(["Known Word: 'lowest'", "Unknown Word: 'lows'"], {
value: "Known Word: 'lowest'",
label: "Word type"
})
```
```{ojs}
//| echo: false
unknownWordsDiagram = {
const width = 650;
const height = 280;
const isKnown = wordType.includes('lowest');
const svg = d3.create('svg')
.attr('width', width)
.attr('height', height)
.attr('viewBox', `0 0 ${width} ${height}`);
// Background
svg.append('rect')
.attr('width', width)
.attr('height', height)
.attr('fill', diagramTheme.bg)
.attr('rx', 8);
// Title
svg.append('text')
.attr('x', width / 2)
.attr('y', 28)
.attr('text-anchor', 'middle')
.attr('fill', isKnown ? diagramTheme.accent : diagramTheme.highlight)
.attr('font-size', '15px')
.attr('font-weight', '700')
.text(isKnown ? "Known Word: 'lowest'" : "Unknown Word: 'lows'");
if (isKnown) {
// Known word path: direct lookup
const centerY = 100;
// Input word
svg.append('text')
.attr('x', 80)
.attr('y', centerY)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '18px')
.attr('font-family', 'monospace')
.attr('font-weight', '600')
.text("'lowest'");
// Arrow
svg.append('path')
.attr('d', `M140,${centerY} L260,${centerY}`)
.attr('stroke', diagramTheme.accent)
.attr('stroke-width', 3)
.attr('marker-end', 'url(#known-arrow)');
// Result box
const resultG = svg.append('g')
.attr('transform', `translate(350, ${centerY})`);
resultG.append('rect')
.attr('x', -70)
.attr('y', -28)
.attr('width', 140)
.attr('height', 56)
.attr('rx', 8)
.attr('fill', diagramTheme.accent)
.attr('filter', `drop-shadow(0 0 10px ${diagramTheme.accentGlow})`);
resultG.append('text')
.attr('x', 0)
.attr('y', -6)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.textOnAccent)
.attr('font-size', '16px')
.attr('font-family', 'monospace')
.attr('font-weight', '600')
.text('[16]');
resultG.append('text')
.attr('x', 0)
.attr('y', 14)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.textOnAccent)
.attr('font-size', '11px')
.attr('opacity', 0.9)
.text('Single token');
// Efficiency note
svg.append('text')
.attr('x', width / 2)
.attr('y', centerY + 60)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '12px')
.attr('opacity', 0.8)
.text('Direct vocabulary lookup — maximum efficiency');
// Arrow marker
const defs = svg.append('defs');
defs.append('marker')
.attr('id', 'known-arrow')
.attr('viewBox', '0 -5 10 10')
.attr('refX', 8)
.attr('refY', 0)
.attr('markerWidth', 8)
.attr('markerHeight', 8)
.attr('orient', 'auto')
.append('path')
.attr('d', 'M0,-5L10,0L0,5')
.attr('fill', diagramTheme.accent);
} else {
// Unknown word path: split and apply merges
const steps = [
{ y: 70, label: "Input", tokens: ["'lows'"], note: null },
{ y: 120, label: "Split", tokens: ["'l'", "'o'", "'w'", "'s'"], note: "Character-level" },
{ y: 170, label: "Merge", tokens: ["'low'", "'s'"], note: "Apply learned merges" },
{ y: 220, label: "IDs", tokens: ["[13, 9]"], note: "Subword tokens" }
];
// Arrow marker
const defs = svg.append('defs');
defs.append('marker')
.attr('id', 'unknown-arrow')
.attr('viewBox', '0 -5 10 10')
.attr('refX', 8)
.attr('refY', 0)
.attr('markerWidth', 6)
.attr('markerHeight', 6)
.attr('orient', 'auto')
.append('path')
.attr('d', 'M0,-5L10,0L0,5')
.attr('fill', diagramTheme.highlight);
steps.forEach((step, i) => {
// Step label
svg.append('text')
.attr('x', 50)
.attr('y', step.y)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '11px')
.attr('font-weight', '600')
.attr('opacity', 0.7)
.text(step.label);
// Tokens
const tokenGap = 10;
const tokenWidths = step.tokens.map(t => t.startsWith('[') ? 100 : Math.max(40, t.length * 14 + 16));
const totalW = tokenWidths.reduce((a, b) => a + b, 0) + tokenGap * (step.tokens.length - 1);
let startX = 200;
step.tokens.forEach((token, j) => {
const w = tokenWidths[j];
const x = startX + w / 2;
const isResult = i === steps.length - 1;
const rect = svg.append('rect')
.attr('x', x - w / 2)
.attr('y', step.y - 16)
.attr('width', w)
.attr('height', 32)
.attr('rx', 6)
.attr('fill', isResult ? diagramTheme.highlight : diagramTheme.nodeFill)
.attr('stroke', isResult ? diagramTheme.highlight : diagramTheme.nodeStroke)
.attr('stroke-width', isResult ? 2 : 1.5);
if (isResult) {
rect.attr('filter', `drop-shadow(0 0 6px ${diagramTheme.highlightGlow})`);
}
svg.append('text')
.attr('x', x)
.attr('y', step.y)
.attr('text-anchor', 'middle')
.attr('dominant-baseline', 'central')
.attr('fill', isResult ? diagramTheme.textOnHighlight : diagramTheme.nodeText)
.attr('font-size', '13px')
.attr('font-family', 'monospace')
.attr('font-weight', '500')
.text(token);
startX += w + tokenGap;
});
// Note
if (step.note) {
svg.append('text')
.attr('x', 480)
.attr('y', step.y)
.attr('text-anchor', 'start')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '11px')
.attr('opacity', 0.6)
.text(step.note);
}
// Arrow to next step
if (i < steps.length - 1) {
svg.append('path')
.attr('d', `M200,${step.y + 18} L200,${steps[i+1].y - 18}`)
.attr('stroke', diagramTheme.highlight)
.attr('stroke-width', 2)
.attr('marker-end', 'url(#unknown-arrow)');
}
});
}
// Why BPE Works section
const whyY = height - 38;
const reasons = isKnown ?
["Common words → single tokens (efficient)"] :
["Rare words → split into subwords (still encodable)", "Never out of vocabulary"];
svg.append('text')
.attr('x', width / 2)
.attr('y', whyY)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '11px')
.attr('font-style', 'italic')
.attr('opacity', 0.7)
.text(reasons.join(' • '));
return svg.node();
}
```
## Special Tokens
Before diving into code, let's understand **special tokens** - reserved tokens with specific meanings in the LLM pipeline:
| Token | Purpose | When Used |
|-------|---------|-----------|
| `<PAD>` (ID 0) | Padding | Batch processing requires same-length sequences. Padding fills shorter sequences. |
| `<UNK>` (ID 1) | Unknown | Characters not seen during training. Production tokenizers avoid this with byte-level BPE. |
| `<BOS>` (ID 2) | Beginning of Sequence | Signals the start of text. Helps model distinguish context boundaries. |
| `<EOS>` (ID 3) | End of Sequence | Signals text completion. Model generates this to stop. Critical for generation. |
These tokens are reserved in the vocabulary before training begins, ensuring consistent IDs across all tokenizers.
## Code Walkthrough
Let's explore tokenization interactively:
```{python}
# Import our BPE tokenizer
from tokenizer import BPETokenizer, SPECIAL_TOKENS
print("Special tokens:", SPECIAL_TOKENS)
print("\nThese tokens are reserved at IDs 0-3 before training begins.")
```
### Training a BPE Tokenizer
The `BPETokenizer` class has key parameters:
- `vocab_size`: Target vocabulary size (including special tokens)
- `min_frequency`: Minimum times a pair must appear to be merged (default: 2). This prevents rare pairs from being merged — if a pair only appears once, it's likely noise rather than a useful pattern. Higher values create more conservative, generalizable vocabularies.
- `verbose`: Print detailed training progress
```{python}
# Simple text to train on
simple_text = "ab cd ab cd ab cd ab cd " * 20
# Create and train tokenizer
# vocab_size includes the 4 special tokens, so effective learned tokens = vocab_size - 4
tokenizer = BPETokenizer(vocab_size=30, verbose=False)
stats = tokenizer.train(simple_text, show_progress=True)
print(f"\nVocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")
print(f"Special tokens: {stats['num_special_tokens']}")
```
```{python}
# See what patterns were learned
print("Learned merges:")
for i, ((a, b), merged) in enumerate(list(tokenizer.merges.items())[:10]):
print(f" {i+1}. {repr(a)} + {repr(b)} -> {repr(merged)}")
```
### Encoding and Decoding
Encoding applies learned merges in the order they were learned. This is crucial - the merge order determines how text is split.
```{python}
# Encode some text
test_text = "ab cd"
ids = tokenizer.encode(test_text)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"Text: '{test_text}'")
print(f"Token IDs: {ids}")
print(f"Tokens: {tokens}")
# Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: '{decoded}'")
print(f"Round-trip successful: {test_text == decoded}")
```
```{python}
# With special tokens (used during actual LLM training/inference)
ids_with_special = tokenizer.encode(test_text, add_special_tokens=True)
print(f"\nWith special tokens: {ids_with_special}")
print(f"Tokens: {[tokenizer.id_to_token(i) for i in ids_with_special]}")
# Decoding skips special tokens by default
decoded = tokenizer.decode(ids_with_special, skip_special_tokens=True)
print(f"Decoded (skip special): '{decoded}'")
```
### Training on Python Code
```{python}
python_code = '''
def fibonacci(n):
"""Calculate the nth Fibonacci number."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
def factorial(n):
"""Calculate n factorial."""
if n <= 1:
return 1
return n * factorial(n - 1)
class Calculator:
def __init__(self):
self.result = 0
def add(self, x):
self.result += x
return self
def subtract(self, x):
self.result -= x
return self
# Main execution
if __name__ == "__main__":
print(fibonacci(10))
print(factorial(5))
'''
print(f"Training on {len(python_code)} characters of Python code")
```
```{python}
# Train tokenizer on code
code_tokenizer = BPETokenizer(vocab_size=200, verbose=False)
stats = code_tokenizer.train(python_code * 3, show_progress=True)
print(f"\nFinal vocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")
```
```{python}
# Look at what code patterns were learned
print("Interesting tokens learned (longest first):")
print("=" * 40)
interesting_patterns = []
for token, id in code_tokenizer.vocab.items():
if len(token) >= 2 and not token.startswith('<'):
interesting_patterns.append((token, id))
# Sort by length (longer = more merged)
interesting_patterns.sort(key=lambda x: len(x[0]), reverse=True)
for token, id in interesting_patterns[:15]:
print(f" {id:3d}: {repr(token)}")
```
### Visualizing Tokenization
```{python}
def visualize_tokens(tokenizer, text):
"""Show how text is split into tokens with colors."""
ids = tokenizer.encode(text)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"Original: {repr(text)}")
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"IDs: {ids}")
print(f"Compression: {len(text)/len(ids):.2f} chars/token")
print()
# Try different code patterns
patterns = [
"def fibonacci(n):",
"self.result = 0",
"return self",
" for i in range(10):",
]
for pattern in patterns:
visualize_tokens(code_tokenizer, pattern)
```
### Vocabulary Size Tradeoffs
Vocabulary size is one of the most important hyperparameters in tokenization:
**Larger vocabulary:**
- (+) Shorter sequences = faster training, more context in fixed window
- (+) Common words as single tokens = better semantic units
- (-) Larger embedding table = more parameters, more memory
- (-) Rare tokens get few training examples = poor representations
**Smaller vocabulary:**
- (+) Smaller model, faster embedding lookups
- (+) Every token well-trained on many examples
- (-) Longer sequences = slower training, less context
- (-) Words split into less meaningful pieces
```{python}
test_text = "def calculate_fibonacci(number):\n return fibonacci(number)"
vocab_sizes = [50, 100, 200, 500]
print(f"Text: {repr(test_text)}")
print(f"Text length: {len(test_text)} characters")
print()
for vocab_size in vocab_sizes:
tok = BPETokenizer(vocab_size=vocab_size, verbose=False)
tok.train(python_code * 5, show_progress=False)
ids = tok.encode(test_text)
tokens = [tok.id_to_token(i) for i in ids]
print(f"Vocab size {vocab_size}:")
print(f" Tokens: {len(ids)}")
print(f" Ratio: {len(test_text)/len(ids):.1f} chars/token")
print(f" Sample: {[tok.id_to_token(i) for i in ids[:5]]}...")
print()
```
**Real-world vocabulary sizes:**
- GPT-2: 50,257 tokens
- GPT-4: ~100,000 tokens
- Llama 2: 32,000 tokens
- Claude: ~100,000 tokens
### Saving and Loading
```{python}
import tempfile
import os
# Save tokenizer
save_path = tempfile.mktemp(suffix='.json')
code_tokenizer.save(save_path)
# Load it back
loaded = BPETokenizer.load(save_path)
# Verify it works the same
test = "def test():"
original_ids = code_tokenizer.encode(test)
loaded_ids = loaded.encode(test)
print(f"\nOriginal encoding: {original_ids}")
print(f"Loaded encoding: {loaded_ids}")
print(f"Match: {original_ids == loaded_ids}")
# Cleanup
os.unlink(save_path)
```
## Interactive Exploration
Watch BPE tokenization in action. Type text and see how it gets broken into tokens through iterative pair merging.
**Note**: This demo uses a simplified, pre-defined set of common English merge rules (not dynamically computed). A real tokenizer would learn merges from a training corpus, but the mechanism is identical.
```{ojs}
//| echo: false
// Pre-trained BPE merges (curated for demo purposes)
// These are ordered by frequency - common pairs first
bpeMerges = [
// Common letter pairs
["t", "h", "th"],
["h", "e", "he"],
["i", "n", "in"],
["e", "r", "er"],
["a", "n", "an"],
["r", "e", "re"],
["o", "n", "on"],
["e", "s", "es"],
["o", "r", "or"],
["t", "i", "ti"],
["e", "n", "en"],
["a", "t", "at"],
["e", "d", "ed"],
["o", "u", "ou"],
["i", "s", "is"],
["i", "t", "it"],
["a", "l", "al"],
["a", "r", "ar"],
["s", "t", "st"],
["l", "l", "ll"],
["l", "e", "le"],
["n", "d", "nd"],
// Common trigrams
["th", "e", "the"],
["in", "g", "ing"],
["an", "d", "and"],
["ti", "on", "tion"],
["er", "s", "ers"],
["he", "r", "her"],
["ll", "o", "llo"],
["he", "ll", "hell"],
["hell", "o", "hello"],
["w", "or", "wor"],
["wor", "l", "worl"],
["worl", "d", "world"]
]
// Apply a single merge to token list
function applyMerge(tokens, left, right, merged) {
const result = [];
let i = 0;
while (i < tokens.length) {
if (i < tokens.length - 1 && tokens[i] === left && tokens[i + 1] === right) {
result.push(merged);
i += 2;
} else {
result.push(tokens[i]);
i += 1;
}
}
return result;
}
// Apply merges up to a certain step
function tokenizeWithSteps(text, maxStep) {
// Start with character-level tokens (preserve spaces)
let tokens = text.split('');
const steps = [{ tokens: [...tokens], mergeApplied: null }];
for (let i = 0; i < Math.min(maxStep, bpeMerges.length); i++) {
const [left, right, merged] = bpeMerges[i];
const newTokens = applyMerge(tokens, left, right, merged);
// Only record step if something changed
if (newTokens.length !== tokens.length) {
tokens = newTokens;
steps.push({
tokens: [...tokens],
mergeApplied: `"${left}" + "${right}" → "${merged}"`
});
}
}
return { finalTokens: tokens, steps };
}
// Fully tokenize (all merges)
function tokenize(text) {
let tokens = text.split('');
for (const [left, right, merged] of bpeMerges) {
tokens = applyMerge(tokens, left, right, merged);
}
return tokens;
}
```
```{ojs}
//| echo: false
// Dark mode detection
isDark = {
const check = () => document.body.classList.contains('quarto-dark');
return check();
}
// Theme colors for light/dark mode
theme = isDark ? {
// Dark mode colors
textPrimary: '#e8e6e3',
textMuted: '#a8a6a2',
spaceBg: '#3a3c3d',
spaceBorder: '#5a5c5d',
tokenBorder: 50,
tokenLightness: 25,
historyBg: '#2b2d2e',
stepBg: '#1f2122',
stepBorderInitial: '#6b7280',
stepBorderMerge: '#6b8cae',
stepTextMuted: '#a8a6a2',
tokenStepBg: '#1e3a5f',
spaceStepBg: '#3a3c3d'
} : {
// Light mode colors
textPrimary: '#1e293b',
textMuted: '#6b7280',
spaceBg: '#e5e7eb',
spaceBorder: '#9ca3af',
tokenBorder: 50,
tokenLightness: 85,
historyBg: '#f9fafb',
stepBg: '#ffffff',
stepBorderInitial: '#6b7280',
stepBorderMerge: '#3b82f6',
stepTextMuted: '#9ca3af',
tokenStepBg: '#dbeafe',
spaceStepBg: '#e5e7eb'
}
```
```{ojs}
//| echo: false
viewof inputText = Inputs.text({
label: "Enter text",
value: "hello world",
placeholder: "Type something...",
width: 400
})
viewof showSteps = Inputs.toggle({
label: "Show step-by-step",
value: true
})
viewof maxMergeStep = Inputs.range([0, bpeMerges.length], {
value: bpeMerges.length,
step: 1,
label: "Merge steps to apply",
disabled: !showSteps
})
```
```{ojs}
//| echo: false
// Tokenization results
result = tokenizeWithSteps(inputText.toLowerCase(), showSteps ? maxMergeStep : bpeMerges.length)
finalTokens = result.finalTokens
tokenizationSteps = result.steps
// Stats
charCount = inputText.length
tokenCount = finalTokens.length
compressionRatio = charCount > 0 ? (charCount / tokenCount).toFixed(2) : 0
```
```{ojs}
//| echo: false
// Token visualization as colored boxes
html`
<div style="margin: 20px 0; color: ${theme.textPrimary};">
<strong>Tokens (${tokenCount}):</strong>
<div style="display: flex; flex-wrap: wrap; gap: 4px; margin-top: 8px;">
${finalTokens.map((token, i) => {
// Color based on token length (longer = more merged)
const hue = Math.min(token.length * 30, 200);
const color = `hsl(${hue}, 70%, ${theme.tokenLightness}%)`;
const isSpace = token === ' ';
return html`<span style="
background: ${isSpace ? theme.spaceBg : color};
padding: 4px 8px;
border-radius: 4px;
font-family: monospace;
font-size: 14px;
color: ${theme.textPrimary};
border: 1px solid ${isSpace ? theme.spaceBorder : `hsl(${hue}, ${theme.tokenBorder}%, ${isDark ? 50 : 60}%)`};
">${isSpace ? '␣' : token}</span>`;
})}
</div>
</div>
`
```
```{ojs}
//| echo: false
md`**Stats:** ${charCount} characters → ${tokenCount} tokens | **Compression:** ${compressionRatio} chars/token`
```
```{ojs}
//| echo: false
// Step-by-step view (when enabled)
showSteps && maxMergeStep > 0 ? html`
<div style="margin-top: 20px; padding: 15px; background: ${theme.historyBg}; border-radius: 8px; color: ${theme.textPrimary};">
<strong>Merge History:</strong>
<div style="font-family: monospace; font-size: 13px; margin-top: 10px;">
${tokenizationSteps.map((step, i) => html`
<div style="margin: 8px 0; padding: 8px; background: ${theme.stepBg}; border-radius: 4px; border-left: 3px solid ${i === 0 ? theme.stepBorderInitial : theme.stepBorderMerge};">
<div style="color: ${theme.textMuted}; font-size: 11px; margin-bottom: 4px;">
${i === 0 ? 'Initial (characters)' : `Step ${i}: ${step.mergeApplied}`}
</div>
<div style="display: flex; flex-wrap: wrap; gap: 2px;">
${step.tokens.map(t => html`<span style="background: ${t === ' ' ? theme.spaceStepBg : theme.tokenStepBg}; padding: 2px 6px; border-radius: 3px; color: ${theme.textPrimary};">${t === ' ' ? '␣' : t}</span>`)}
</div>
<div style="color: ${theme.stepTextMuted}; font-size: 11px; margin-top: 4px;">${step.tokens.length} tokens</div>
</div>
`)}
</div>
</div>
` : html``
```
::: {.callout-tip}
## Try This
1. **Common words merge well**: Type "the" or "and" - they become single tokens quickly due to high-frequency merges.
2. **Step through merges**: Enable "Show step-by-step" and slide the merge steps from 0 to max. Watch how character pairs combine into larger tokens.
3. **Rare words stay split**: Type "xyz" or uncommon words - they remain as characters because those patterns weren't in the training data.
4. **Compression varies**: Compare "the the the" (high compression) vs "qxz qxz qxz" (low compression). Common patterns compress better.
5. **Spaces are preserved**: Notice that spaces remain as separate tokens (shown as ␣). This is typical BPE behavior.
:::
## Exercises
### Exercise 1: Compression Efficiency
BPE achieves better compression on repetitive text. This matters because better compression = shorter sequences = more context in the model's window.
```{python}
# Train on repetitive vs varied text and compare compression
texts = {
"repetitive": "the the the " * 100,
"varied": " ".join([f"word{i}" for i in range(100)]),
"code": python_code,
}
print("Compression comparison:")
print("=" * 40)
for name, text in texts.items():
tok = BPETokenizer(vocab_size=200, verbose=False)
tok.train(text, show_progress=False)
ids = tok.encode(text)
ratio = len(text) / len(ids)
print(f"{name:12s}: {ratio:.2f} chars/token")
print("\nNote: Repetitive text compresses best because BPE learns")
print("common patterns. Code has structure but more variety.")
```
### Exercise 2: Analyze the First Merges
The first merges reveal the most frequent patterns in your data. For English text, you'll often see common letter pairs like 'th', 'he', 'in'.
```{python}
# What patterns are learned first?
sample_text = "hello world hello world hello world " * 10
tok = BPETokenizer(vocab_size=50, verbose=False)
tok.train(sample_text, show_progress=False)
print("First 10 merges (most frequent patterns):")
for i, ((a, b), merged) in enumerate(list(tok.merges.items())[:10]):
print(f" {i+1}. '{a}' + '{b}' = '{merged}'")
print("\nNotice: Common substrings merge first, eventually")
print("forming complete words like 'hello' and 'world'.")
```
### Exercise 3: Observe Unknown Character Behavior
Our simple tokenizer can only encode characters it saw during training. Characters not in the vocabulary become `<UNK>` tokens. This exercise demonstrates the problem — and why production tokenizers use byte-level BPE to solve it.
```{python}
# What happens with characters not in training?
tokenizer = BPETokenizer(vocab_size=50, verbose=False)
tokenizer.train("hello world", show_progress=False)
# Try encoding text with emoji
test = "hello world" # Safe text
try:
ids = tokenizer.encode(test)
print(f"'{test}' -> {ids}")
print(f"Decoded: '{tokenizer.decode(ids)}'")
except Exception as e:
print(f"Error: {e}")
# Now try with a character not in training
test2 = "hello 123"
ids = tokenizer.encode(test2)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"\n'{test2}' -> {ids}")
print(f"Tokens: {tokens}")
print("\nNotice: '1', '2', '3' become <UNK> (ID 1) because they")
print("weren't in the training data.")
print("\nThis is why production tokenizers use BYTE-LEVEL BPE:")
print("- Operate on UTF-8 bytes (0-255) instead of Unicode characters")
print("- Any byte sequence can be represented -> no UNK tokens")
print("- tiktoken and SentencePiece both use this approach")
```
### Exercise 4: Whitespace Handling
Whitespace is tricky in tokenization. Our tokenizer preserves it, but notice how spaces can be part of tokens.
```{python}
# Whitespace is significant in tokenization
code_tok = BPETokenizer(vocab_size=100, verbose=False)
code_tok.train("def foo():\n return 1\ndef bar():\n return 2", show_progress=False)
# See how indentation is tokenized
samples = [
"def foo():",
" return", # 4 spaces
" x", # 8 spaces
]
for sample in samples:
ids = code_tok.encode(sample)
tokens = [code_tok.id_to_token(i) for i in ids]
print(f"{repr(sample):20s} -> {tokens}")
print("\nIn production tokenizers, leading spaces often attach to")
print("the following word: ' hello' is one token, not ' ' + 'hello'")
```
## Tokenization in the LLM Pipeline
Here's where tokenization fits in the full pipeline:
```{ojs}
//| echo: false
// Interactive step-through of tokenization in LLM pipeline
viewof pipelineStep = Inputs.range([0, 5], {
value: 0,
step: 1,
label: "Pipeline stage"
})
```
```{ojs}
//| echo: false
llmPipelineDiagram = {
const width = 720;
const height = 200;
// Pipeline stages
const stages = [
{ id: 'input', x: 60, label: 'Raw Text', sublabel: "'def hello():'", group: 'Input' },
{ id: 'split', x: 180, label: 'Split', sublabel: "['def',' ','hello',...]", group: 'Tokenization' },
{ id: 'lookup', x: 300, label: 'Look up IDs', sublabel: '[42, 5, 128, ...]', group: 'Tokenization' },
{ id: 'embed', x: 430, label: 'Embeddings', sublabel: 'Module 04', group: 'Model' },
{ id: 'transform', x: 550, label: 'Transformer', sublabel: 'Module 06', group: 'Model' },
{ id: 'decode', x: 670, label: 'Decode', sublabel: 'Back to text', group: 'Output' }
];
// Stage descriptions
const descriptions = [
"Start: Raw source code text as input",
"Tokenization: Split text into subword tokens using BPE",
"Tokenization: Convert tokens to integer IDs via vocabulary lookup",
"Model: Map token IDs to dense vector embeddings",
"Model: Process embeddings through transformer layers",
"Output: Decode predicted token IDs back to readable text"
];
const svg = d3.create('svg')
.attr('width', width)
.attr('height', height)
.attr('viewBox', `0 0 ${width} ${height}`);
// Background
svg.append('rect')
.attr('width', width)
.attr('height', height)
.attr('fill', diagramTheme.bg)
.attr('rx', 8);
// Group backgrounds
const groups = [
{ name: 'Input', x1: 20, x2: 120, color: diagramTheme.nodeStroke },
{ name: 'Tokenization', x1: 130, x2: 360, color: diagramTheme.highlight },
{ name: 'Model', x1: 370, x2: 610, color: diagramTheme.accent },
{ name: 'Output', x1: 620, x2: 710, color: diagramTheme.nodeStroke }
];
groups.forEach(group => {
const isActive = stages.filter(s => s.group === group.name)
.some((s, i) => stages.indexOf(s) === pipelineStep);
svg.append('rect')
.attr('x', group.x1)
.attr('y', 25)
.attr('width', group.x2 - group.x1)
.attr('height', 95)
.attr('rx', 6)
.attr('fill', 'transparent')
.attr('stroke', isActive ? group.color : diagramTheme.nodeStroke)
.attr('stroke-width', isActive ? 2 : 1)
.attr('stroke-dasharray', isActive ? 'none' : '4,2')
.attr('opacity', isActive ? 1 : 0.4);
svg.append('text')
.attr('x', (group.x1 + group.x2) / 2)
.attr('y', 40)
.attr('text-anchor', 'middle')
.attr('fill', isActive ? group.color : diagramTheme.nodeText)
.attr('font-size', '10px')
.attr('font-weight', isActive ? '600' : '400')
.attr('opacity', isActive ? 1 : 0.5)
.text(group.name);
});
// Arrow marker
const defs = svg.append('defs');
defs.append('marker')
.attr('id', 'pipeline-arrow')
.attr('viewBox', '0 -5 10 10')
.attr('refX', 8)
.attr('refY', 0)
.attr('markerWidth', 5)
.attr('markerHeight', 5)
.attr('orient', 'auto')
.append('path')
.attr('d', 'M0,-5L10,0L0,5')
.attr('fill', diagramTheme.edgeStroke);
defs.append('marker')
.attr('id', 'pipeline-arrow-active')
.attr('viewBox', '0 -5 10 10')
.attr('refX', 8)
.attr('refY', 0)
.attr('markerWidth', 5)
.attr('markerHeight', 5)
.attr('orient', 'auto')
.append('path')
.attr('d', 'M0,-5L10,0L0,5')
.attr('fill', diagramTheme.highlight);
// Draw nodes
const nodeY = 85;
const nodeW = 90;
const nodeH = 48;
stages.forEach((stage, i) => {
const isActive = i === pipelineStep;
const isPast = i < pipelineStep;
const g = svg.append('g')
.attr('transform', `translate(${stage.x}, ${nodeY})`);
const rect = g.append('rect')
.attr('x', -nodeW / 2)
.attr('y', -nodeH / 2)
.attr('width', nodeW)
.attr('height', nodeH)
.attr('rx', 6)
.attr('fill', isActive ? diagramTheme.highlight :
isPast ? diagramTheme.bgSecondary : diagramTheme.nodeFill)
.attr('stroke', isActive ? diagramTheme.highlight :
isPast ? diagramTheme.accent : diagramTheme.nodeStroke)
.attr('stroke-width', isActive ? 2.5 : 1.5);
if (isActive) {
rect.attr('filter', `drop-shadow(0 0 10px ${diagramTheme.highlightGlow})`);
}
// Main label
g.append('text')
.attr('x', 0)
.attr('y', -6)
.attr('text-anchor', 'middle')
.attr('fill', isActive ? diagramTheme.textOnHighlight : diagramTheme.nodeText)
.attr('font-size', '11px')
.attr('font-weight', '600')
.text(stage.label);
// Sublabel
g.append('text')
.attr('x', 0)
.attr('y', 10)
.attr('text-anchor', 'middle')
.attr('fill', isActive ? diagramTheme.textOnHighlight : diagramTheme.nodeText)
.attr('font-size', '9px')
.attr('font-family', 'monospace')
.attr('opacity', isActive ? 0.9 : 0.6)
.text(stage.sublabel);
// Draw arrow to next stage
if (i < stages.length - 1) {
const nextStage = stages[i + 1];
const arrowActive = i === pipelineStep - 1;
svg.append('path')
.attr('d', `M${stage.x + nodeW/2 + 5},${nodeY} L${nextStage.x - nodeW/2 - 10},${nodeY}`)
.attr('stroke', arrowActive ? diagramTheme.highlight : diagramTheme.edgeStroke)
.attr('stroke-width', arrowActive ? 2 : 1.5)
.attr('marker-end', `url(#${arrowActive ? 'pipeline-arrow-active' : 'pipeline-arrow'})`);
}
});
// Description
svg.append('rect')
.attr('x', 20)
.attr('y', height - 50)
.attr('width', width - 40)
.attr('height', 35)
.attr('rx', 6)
.attr('fill', diagramTheme.bgSecondary);
svg.append('text')
.attr('x', width / 2)
.attr('y', height - 28)
.attr('text-anchor', 'middle')
.attr('fill', diagramTheme.nodeText)
.attr('font-size', '12px')
.text(descriptions[pipelineStep]);
return svg.node();
}
```
## Summary
Key takeaways:
1. **BPE learns subword units** by iteratively merging the most frequent adjacent token pairs
2. **Vocabulary size** is a tradeoff: larger = shorter sequences but more parameters and sparse token usage
3. **Special tokens** (BOS, EOS, PAD, UNK) serve critical roles in the LLM pipeline
4. **Code patterns** emerge naturally (def, self., return, indentation) when trained on code
5. **Round-trip guarantee**: encode -> decode should perfectly reconstruct the original text
6. **Production tokenizers** use byte-level BPE to handle any Unicode character without UNK tokens
### What We Simplified
Our implementation differs from production tokenizers in several ways:
| Our Tokenizer | Production Tokenizers |
|---------------|----------------------|
| Character-level BPE | Byte-level BPE (handles any UTF-8) |
| Python dict lookups | Optimized Rust/C++ (tiktoken is 10x+ faster) |
| No regex pre-tokenization | GPT uses regex to split contractions, numbers specially |
| Simple word splitting | Careful handling of whitespace, punctuation |
**Pre-tokenization** is a critical step we simplified. Production tokenizers first split text into "words" using regex patterns before applying BPE. This prevents merges across word boundaries — for example, preventing "the" at the end of one word from merging with "e" at the start of the next. GPT-2's tokenizer uses a carefully crafted regex that handles contractions ("don't" → "don", "'t"), numbers, and punctuation specially. This pre-split ensures more linguistically meaningful merges.
### Practical Implications
- **Context length**: A 4096-token context window holds varying amounts of text depending on tokenization efficiency
- **Cost**: API pricing is per-token, so tokenization directly affects cost
- **Multilingual**: Tokenizers trained on English use more tokens for other languages (2-3x for some)
- **Code vs prose**: Code often tokenizes inefficiently (many single-character tokens for syntax)
## What's Next
In [Module 04: Embeddings](../m04_embeddings/lesson.qmd), we'll learn how to convert token IDs into dense vectors that capture meaning. Each token becomes a learnable vector in high-dimensional space.