/**
* Segmented step control for visualization stepping.
* @param {Object} options
* @param {number} options.min - Minimum step value (default 0)
* @param {number} options.max - Maximum step value
* @param {number} options.value - Initial value (default min)
* @param {string} options.label - Optional label text
* @returns {number} Current step value (reactive)
*/
stepControl = function({min = 0, max, value, label = null} = {}) {
const initialValue = value ?? min;
const steps = Array.from({length: max - min + 1}, (_, i) => min + i);
const container = htl.html`<div class="step-control">
${label ? htl.html`<span class="step-control-label">${label}</span>` : ''}
<div class="step-control-segments" role="group" aria-label="${label || 'Step control'}">
${steps.map(step => htl.html`<button
class="step-control-segment ${step === initialValue ? 'active' : ''}"
data-step="${step}"
aria-pressed="${step === initialValue}"
tabindex="${step === initialValue ? 0 : -1}"
>${step}</button>`)}
</div>
</div>`;
const segments = container.querySelectorAll('.step-control-segment');
let currentValue = initialValue;
function updateActive(newValue) {
currentValue = newValue;
segments.forEach(seg => {
const isActive = parseInt(seg.dataset.step) === newValue;
seg.classList.toggle('active', isActive);
seg.setAttribute('aria-pressed', isActive);
seg.tabIndex = isActive ? 0 : -1;
});
container.value = newValue;
container.dispatchEvent(new Event('input', {bubbles: true}));
}
// Click handler
segments.forEach(seg => {
seg.addEventListener('click', () => {
updateActive(parseInt(seg.dataset.step));
});
});
// Keyboard navigation
container.addEventListener('keydown', (e) => {
if (e.key === 'ArrowRight' || e.key === 'ArrowDown') {
e.preventDefault();
const next = Math.min(currentValue + 1, max);
updateActive(next);
segments[next - min].focus();
} else if (e.key === 'ArrowLeft' || e.key === 'ArrowUp') {
e.preventDefault();
const prev = Math.max(currentValue - 1, min);
updateActive(prev);
segments[prev - min].focus();
} else if (e.key === 'Home') {
e.preventDefault();
updateActive(min);
segments[0].focus();
} else if (e.key === 'End') {
e.preventDefault();
updateActive(max);
segments[max - min].focus();
}
});
container.value = initialValue;
return container;
}Module 03: Tokenization
Introduction
A language model requires numbers, not text. Tokenization breaks text into tokens and maps each to an integer.
Tokenization converts raw text into integers the model can process. Modern LLMs use subword tokenization - they break text into pieces smaller than words but larger than characters.
Why subword tokenization?
- Word-level: Cannot handle new words (OOV problem), huge vocabulary needed (millions for multilingual)
- Character-level: Sequences become 4-5x longer, attention cost explodes O(n^2), model learns spelling from scratch
- Subword: Best of both worlds - handles new words via decomposition, reasonable sequence length
BPE (Byte Pair Encoding) dominates modern tokenization. Philip Gage invented it for data compression in 1994; researchers adapted it for NLP in 2016:
- Start with individual characters as the initial vocabulary
- Count all adjacent token pairs in the training corpus
- Merge the most frequent pair into a new token
- Add the merged token to the vocabulary
- Repeat until vocabulary size reached
What You’ll Learn
After this module, you can:
- Build a character-level tokenizer from scratch
- Understand why subword tokenization outperforms alternatives
- Implement the BPE algorithm for training and encoding
- Handle special tokens (PAD, UNK, BOS, EOS)
- Recognize trade-offs in vocabulary size
Prerequisites
This module requires familiarity with:
- Module 01: Tensors — Basic tensor operations and shapes
First, build the simplest tokenizer from scratch.
The Simplest Tokenizer
The simplest approach treats each character as a token.
# Build vocabulary from text
text = "hello world"
chars = sorted(set(text))
print(f"Unique characters: {chars}")
print(f"Vocabulary size: {len(chars)}")Unique characters: [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']
Vocabulary size: 8
# The core of any tokenizer: two lookup tables
stoi = {ch: i for i, ch in enumerate(chars)} # string to integer
itos = {i: ch for i, ch in enumerate(chars)} # integer to string
print("stoi (encode):", stoi)
print("itos (decode):", itos)stoi (encode): {' ': 0, 'd': 1, 'e': 2, 'h': 3, 'l': 4, 'o': 5, 'r': 6, 'w': 7}
itos (decode): {0: ' ', 1: 'd', 2: 'e', 3: 'h', 4: 'l', 5: 'o', 6: 'r', 7: 'w'}
# Encode: text -> integers
def encode(text):
return [stoi[ch] for ch in text]
# Decode: integers -> text
def decode(ids):
return ''.join(itos[i] for i in ids)
# Try it out
encoded = encode("hello")
print(f"'hello' -> {encoded}")
print(f"{encoded} -> '{decode(encoded)}'")'hello' -> [3, 2, 4, 4, 5]
[3, 2, 4, 4, 5] -> 'hello'
# Round-trip test
original = "hello world"
reconstructed = decode(encode(original))
print(f"Original: '{original}'")
print(f"Reconstructed: '{reconstructed}'")
print(f"Perfect round-trip: {original == reconstructed}")Original: 'hello world'
Reconstructed: 'hello world'
Perfect round-trip: True
Ten lines of Python produce a complete tokenizer. Every tokenizer — no matter how sophisticated — has these same two operations:
- encode: text to token IDs
- decode: token IDs back to text
The Key Insight
Tokenization achieves compression and semantic grouping:
| Tokenization | Vocabulary Size | Sequence Length | Semantics |
|---|---|---|---|
| Character | ~100 (ASCII) | Very long | None (individual letters) |
| Word | ~1,000,000+ | Short | Strong (whole words) |
| Subword | ~30,000-100,000 | Medium | Moderate (meaningful pieces) |
Why Characters Aren’t Enough
Our character tokenizer works — but fails at scale.
Problem 1: Long Sequences
sample_text = "The transformer architecture revolutionized natural language processing."
char_tokens = list(sample_text)
print(f"Text length: {len(sample_text)} characters")
print(f"Token count: {len(char_tokens)} tokens")
print(f"Compression ratio: {len(sample_text) / len(char_tokens):.2f}x (no compression!)")Text length: 72 characters
Token count: 72 tokens
Compression ratio: 1.00x (no compression!)
Since attention is O(n^2) in sequence length, doubling the sequence length quadruples the compute cost. Character-level tokenization produces the longest possible sequences.
Problem 2: No Semantic Units
# The model sees this:
word = "transformer"
char_view = list(word)
print(f"Characters: {char_view}")
print(f"Token count: {len(char_view)}")Characters: ['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'e', 'r']
Token count: 11
The model must discover on its own that t-r-a-n-s-f-o-r-m-e-r forms a meaningful unit. Character tokenization provides no semantic guidance. At word-level, “transformer” occupies one token with its own learned representation.
Problem 3: Vocabulary Explosion for Bytes
# If we go to byte-level (handling all Unicode)
text_with_emoji = "Hello! \U0001F60A"
byte_view = text_with_emoji.encode('utf-8')
print(f"Text: {text_with_emoji}")
print(f"Bytes: {list(byte_view)}")
print(f"Byte count: {len(byte_view)} (emoji = 4 bytes!)")Text: Hello! 😊
Bytes: [72, 101, 108, 108, 111, 33, 32, 240, 159, 152, 138]
Byte count: 11 (emoji = 4 bytes!)
Byte-level tokenization can represent anything, but sequences become even longer. Byte-level tokenization splits a single emoji into 4 tokens.
The Tradeoff
This is the fundamental tradeoff in tokenization:
Characters: Small vocab, long sequences, no semantics
Words: Huge vocab, short sequences, good semantics, can't handle new words
Subwords: Medium vocab, medium sequences, some semantics, handles new words
BPE merges frequently co-occurring character sequences into single tokens — the sweet spot between characters and words.
Intuition: Learning Patterns Through Merging
Think of BPE as compression that learns common patterns:
For code, BPE learns patterns like:
def(function definition with space)self.(common in Python classes)return(return statement)(4-space indent)
The BPE Training Algorithm
BPE learns to tokenize through this process:
The Math
BPE is simple - just counting and merging:
# Count pair frequencies
pairs = count_pairs(tokens) # {'he': 50, 'el': 30, 'll': 80, ...}
# Find most frequent
best_pair = max(pairs, key=pairs.get) # ('l', 'l')
# Merge everywhere
tokens = merge(tokens, best_pair, 'll')Vocabulary size is a hyperparameter:
- Too small: Sequences too long, less meaning per token
- Too large: Many rare tokens, harder to learn
- Typical: 8K-50K tokens for LLMs
Encoding New Text
Once trained, encoding applies merges in the order they were learned:
Handling Unknown Words
BPE can handle words it has never seen:
Special Tokens
Before examining the code, understand special tokens - reserved tokens with specific meanings in the LLM pipeline:
| Token | Purpose | When Used |
|---|---|---|
<PAD> (ID 0) |
Padding | Batch processing requires same-length sequences. Padding fills shorter sequences. |
<UNK> (ID 1) |
Unknown | Characters not seen during training. Production tokenizers avoid this with byte-level BPE. |
<BOS> (ID 2) |
Beginning of Sequence | Signals the start of text. Helps model distinguish context boundaries. |
<EOS> (ID 3) |
End of Sequence | Signals text completion. Model generates this to stop. Critical for generation. |
The vocabulary reserves these tokens before training begins, ensuring consistent IDs across all tokenizers.
Code Walkthrough
Explore tokenization interactively:
# Import our BPE tokenizer
from tokenizer import BPETokenizer, SPECIAL_TOKENS
print("Special tokens:", SPECIAL_TOKENS)
print("\nThese tokens are reserved at IDs 0-3 before training begins.")Special tokens: {'<PAD>': 0, '<UNK>': 1, '<BOS>': 2, '<EOS>': 3}
These tokens are reserved at IDs 0-3 before training begins.
Training a BPE Tokenizer
The BPETokenizer class has key parameters:
vocab_size: Target vocabulary size (including special tokens)min_frequency: Minimum times a pair must appear to be merged (default: 2). This prevents rare pairs from being merged — if a pair only appears once, it’s likely noise rather than a useful pattern. Higher values create more conservative, generalizable vocabularies.verbose: Print detailed training progress
# Simple text to train on
simple_text = "ab cd ab cd ab cd ab cd " * 20
# Create and train tokenizer
# vocab_size includes the 4 special tokens, so effective learned tokens = vocab_size - 4
tokenizer = BPETokenizer(vocab_size=30, verbose=False)
stats = tokenizer.train(simple_text, show_progress=True)
print(f"\nVocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")
print(f"Special tokens: {stats['num_special_tokens']}")============================================================
BPE TOKENIZER TRAINING
============================================================
Text length: 480 characters
Target vocab size: 30
Training complete!
Final vocab size: 11
Merges learned: 2
Vocab size: 11
Merges learned: 2
Special tokens: 4
# See what patterns were learned
print("Learned merges:")
for i, ((a, b), merged) in enumerate(list(tokenizer.merges.items())[:10]):
print(f" {i+1}. {repr(a)} + {repr(b)} -> {repr(merged)}")Learned merges:
1. 'a' + 'b' -> 'ab'
2. 'c' + 'd' -> 'cd'
Encoding and Decoding
Encoding applies merges in their learned order. This is crucial - the merge order determines how text is split.
# Encode some text
test_text = "ab cd"
ids = tokenizer.encode(test_text)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"Text: '{test_text}'")
print(f"Token IDs: {ids}")
print(f"Tokens: {tokens}")
# Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: '{decoded}'")
print(f"Round-trip successful: {test_text == decoded}")Text: 'ab cd'
Token IDs: [9, 4, 10]
Tokens: ['ab', ' ', 'cd']
Decoded: 'ab cd'
Round-trip successful: True
# With special tokens (used during actual LLM training/inference)
ids_with_special = tokenizer.encode(test_text, add_special_tokens=True)
print(f"\nWith special tokens: {ids_with_special}")
print(f"Tokens: {[tokenizer.id_to_token(i) for i in ids_with_special]}")
# Decoding skips special tokens by default
decoded = tokenizer.decode(ids_with_special, skip_special_tokens=True)
print(f"Decoded (skip special): '{decoded}'")
With special tokens: [2, 9, 4, 10, 3]
Tokens: ['<BOS>', 'ab', ' ', 'cd', '<EOS>']
Decoded (skip special): 'ab cd'
Training on Python Code
python_code = '''
def fibonacci(n):
"""Calculate the nth Fibonacci number."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
def factorial(n):
"""Calculate n factorial."""
if n <= 1:
return 1
return n * factorial(n - 1)
class Calculator:
def __init__(self):
self.result = 0
def add(self, x):
self.result += x
return self
def subtract(self, x):
self.result -= x
return self
# Main execution
if __name__ == "__main__":
print(fibonacci(10))
print(factorial(5))
'''
print(f"Training on {len(python_code)} characters of Python code")Training on 562 characters of Python code
# Train tokenizer on code
code_tokenizer = BPETokenizer(vocab_size=200, verbose=False)
stats = code_tokenizer.train(python_code * 3, show_progress=True)
print(f"\nFinal vocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")============================================================
BPE TOKENIZER TRAINING
============================================================
Text length: 1,686 characters
Target vocab size: 200
Merge 50: 'self.res' + 'ul' → 'self.resul' (count: 9, progress: 32.1%)
Merge 100: 'sub' + 't' → 'subt' (count: 3, progress: 64.1%)
Training complete!
Final vocab size: 177
Merges learned: 133
Final vocab size: 177
Merges learned: 133
# Look at what code patterns were learned
print("Interesting tokens learned (longest first):")
print("=" * 40)
interesting_patterns = []
for token, id in code_tokenizer.vocab.items():
if len(token) >= 2 and not token.startswith('<'):
interesting_patterns.append((token, id))
# Sort by length (longer = more merged)
interesting_patterns.sort(key=lambda x: len(x[0]), reverse=True)
for token, id in interesting_patterns[:15]:
print(f" {id:3d}: {repr(token)}")Interesting tokens learned (longest first):
========================================
172: 'print(fibonacci(10))'
176: 'print(factorial(5))'
171: 'print(fibonacci(10'
170: 'print(fibonacci(1'
175: 'print(factorial(5'
169: 'print(fibonacci('
174: 'print(factorial('
136: '__init__(self):'
168: 'print(fibonacci'
173: 'print(factorial'
146: 'subtract(self,'
110: 'fibonacci(n):'
122: 'factorial(n):'
123: 'factorial."""'
135: '__init__(self'
Visualizing Tokenization
def visualize_tokens(tokenizer, text):
"""Show how text is split into tokens with colors."""
ids = tokenizer.encode(text)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"Original: {repr(text)}")
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"IDs: {ids}")
print(f"Compression: {len(text)/len(ids):.2f} chars/token")
print()
# Try different code patterns
patterns = [
"def fibonacci(n):",
"self.result = 0",
"return self",
" for i in range(10):",
]
for pattern in patterns:
visualize_tokens(code_tokenizer, pattern)Original: 'def fibonacci(n):'
Tokens (3): ['def', ' ', 'fibonacci(n):']
IDs: [64, 5, 110]
Compression: 5.67 chars/token
Original: 'self.result = 0'
Tokens (5): ['self.result', ' ', '=', ' ', '0']
IDs: [94, 5, 21, 5, 15]
Compression: 3.00 chars/token
Original: 'return self'
Tokens (3): ['return', ' ', 'self']
IDs: [60, 5, 52]
Compression: 3.67 chars/token
Original: ' for i in range(10):'
Tokens (18): [' ', 'f', 'o', 'r', ' ', 'i', ' ', 'in', ' ', 'r', 'a', 'n', '<UNK>', 'e', '(', '1', '0', '):']
IDs: [45, 31, 37, 39, 5, 33, 5, 88, 5, 39, 26, 36, 1, 30, 8, 16, 15, 71]
Compression: 1.28 chars/token
Vocabulary Size Tradeoffs
Vocabulary size is one of the most important hyperparameters in tokenization:
Larger vocabulary: - (+) Shorter sequences = faster training, more context in fixed window - (+) Common words as single tokens = better semantic units - (-) Larger embedding table = more parameters, more memory - (-) Rare tokens get few training examples = poor representations
Smaller vocabulary: - (+) Smaller model, faster embedding lookups - (+) Every token well-trained on many examples - (-) Longer sequences = slower training, less context - (-) Words split into less meaningful pieces
test_text = "def calculate_fibonacci(number):\n return fibonacci(number)"
vocab_sizes = [50, 100, 200, 500]
print(f"Text: {repr(test_text)}")
print(f"Text length: {len(test_text)} characters")
print()
for vocab_size in vocab_sizes:
tok = BPETokenizer(vocab_size=vocab_size, verbose=False)
tok.train(python_code * 5, show_progress=False)
ids = tok.encode(test_text)
tokens = [tok.id_to_token(i) for i in ids]
print(f"Vocab size {vocab_size}:")
print(f" Tokens: {len(ids)}")
print(f" Ratio: {len(test_text)/len(ids):.1f} chars/token")
print(f" Sample: {[tok.id_to_token(i) for i in ids[:5]]}...")
print()Text: 'def calculate_fibonacci(number):\n return fibonacci(number)'
Text length: 61 characters
Vocab size 50:
Tokens: 54
Ratio: 1.1 chars/token
Sample: ['d', 'e', 'f', ' ', 'c']...
Vocab size 100:
Tokens: 27
Ratio: 2.3 chars/token
Sample: ['def', ' ', 'c', 'al', 'c']...
Vocab size 200:
Tokens: 27
Ratio: 2.3 chars/token
Sample: ['def', ' ', 'c', 'al', 'c']...
Vocab size 500:
Tokens: 27
Ratio: 2.3 chars/token
Sample: ['def', ' ', 'c', 'al', 'c']...
Real-world vocabulary sizes: - GPT-2: 50,257 tokens - GPT-4: ~100,000 tokens - Llama 2: 32,000 tokens - Claude: ~100,000 tokens
Saving and Loading
import tempfile
import os
# Save tokenizer
save_path = tempfile.mktemp(suffix='.json')
code_tokenizer.save(save_path)
# Load it back
loaded = BPETokenizer.load(save_path)
# Verify it works the same
test = "def test():"
original_ids = code_tokenizer.encode(test)
loaded_ids = loaded.encode(test)
print(f"\nOriginal encoding: {original_ids}")
print(f"Loaded encoding: {loaded_ids}")
print(f"Match: {original_ids == loaded_ids}")
# Cleanup
os.unlink(save_path)Tokenizer saved to /tmp/tmpdsmcysu0.json
Tokenizer loaded from /tmp/tmpdsmcysu0.json
Vocab size: 177
Original encoding: [64, 5, 41, 30, 40, 41, 8, 71]
Loaded encoding: [64, 5, 41, 30, 40, 41, 8, 71]
Match: True
Interactive Exploration
Watch BPE tokenization step by step. Type text and see how it gets broken into tokens through iterative pair merging.
WarningDemo Uses Pre-defined Merge Rules
This interactive demo uses a simplified, pre-defined set of common English merge rules — not dynamically computed merges. A real tokenizer learns merges from a training corpus, but the mechanism shown here is identical. The Python implementation above (BPETokenizer) demonstrates actual BPE training.
TipTry This
Common words merge well: Type “the” or “and” - they become single tokens quickly due to high-frequency merges.
Step through merges: Enable “Show step-by-step” and slide the merge steps from 0 to max. Watch how character pairs combine into larger tokens.
Rare words stay split: Type “xyz” or uncommon words - they remain as characters because those patterns weren’t in the training data.
Compression varies: Compare “the the the” (high compression) vs “qxz qxz qxz” (low compression). Common patterns compress better.
Spaces are preserved: Notice that spaces remain as separate tokens (shown as ␣). This is typical BPE behavior.
Exercises
Exercise 1: Compression Efficiency
BPE achieves better compression on repetitive text. This matters because better compression = shorter sequences = more context in the model’s window.
# Train on repetitive vs varied text and compare compression
texts = {
"repetitive": "the the the " * 100,
"varied": " ".join([f"word{i}" for i in range(100)]),
"code": python_code,
}
print("Compression comparison:")
print("=" * 40)
for name, text in texts.items():
tok = BPETokenizer(vocab_size=200, verbose=False)
tok.train(text, show_progress=False)
ids = tok.encode(text)
ratio = len(text) / len(ids)
print(f"{name:12s}: {ratio:.2f} chars/token")
print("\nNote: Repetitive text compresses best because BPE learns")
print("common patterns. Code has structure but more variety.")Compression comparison:
========================================
repetitive : 2.00 chars/token
varied : 2.38 chars/token
code : 2.70 chars/token
Note: Repetitive text compresses best because BPE learns
common patterns. Code has structure but more variety.
Exercise 2: Analyze the First Merges
The first merges reveal the most frequent patterns in your data. For English text, you’ll often see common letter pairs like ‘th’, ‘he’, ‘in’.
# What patterns are learned first?
sample_text = "hello world hello world hello world " * 10
tok = BPETokenizer(vocab_size=50, verbose=False)
tok.train(sample_text, show_progress=False)
print("First 10 merges (most frequent patterns):")
for i, ((a, b), merged) in enumerate(list(tok.merges.items())[:10]):
print(f" {i+1}. '{a}' + '{b}' = '{merged}'")
print("\nNotice: Common substrings merge first, eventually")
print("forming complete words like 'hello' and 'world'.")First 10 merges (most frequent patterns):
1. 'h' + 'e' = 'he'
2. 'he' + 'l' = 'hel'
3. 'hel' + 'l' = 'hell'
4. 'hell' + 'o' = 'hello'
5. 'w' + 'o' = 'wo'
6. 'wo' + 'r' = 'wor'
7. 'wor' + 'l' = 'worl'
8. 'worl' + 'd' = 'world'
Notice: Common substrings merge first, eventually
forming complete words like 'hello' and 'world'.
Exercise 3: Observe Unknown Character Behavior
Our simple tokenizer can only encode characters it saw during training. Characters not in the vocabulary become <UNK> tokens. This exercise demonstrates the problem — and why production tokenizers use byte-level BPE to solve it.
# What happens with characters not in training?
tokenizer = BPETokenizer(vocab_size=50, verbose=False)
tokenizer.train("hello world", show_progress=False)
# Try encoding text with emoji
test = "hello world" # Safe text
try:
ids = tokenizer.encode(test)
print(f"'{test}' -> {ids}")
print(f"Decoded: '{tokenizer.decode(ids)}'")
except Exception as e:
print(f"Error: {e}")
# Now try with a character not in training
test2 = "hello 123"
ids = tokenizer.encode(test2)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"\n'{test2}' -> {ids}")
print(f"Tokens: {tokens}")
print("\nNotice: '1', '2', '3' become <UNK> (ID 1) because they")
print("weren't in the training data.")
print("\nThis is why production tokenizers use BYTE-LEVEL BPE:")
print("- Operate on UTF-8 bytes (0-255) instead of Unicode characters")
print("- Any byte sequence can be represented -> no UNK tokens")
print("- tiktoken and SentencePiece both use this approach")'hello world' -> [7, 6, 8, 8, 9, 4, 11, 9, 10, 8, 5]
Decoded: 'hello world'
'hello 123' -> [7, 6, 8, 8, 9, 4, 1, 1, 1]
Tokens: ['h', 'e', 'l', 'l', 'o', ' ', '<UNK>', '<UNK>', '<UNK>']
Notice: '1', '2', '3' become <UNK> (ID 1) because they
weren't in the training data.
This is why production tokenizers use BYTE-LEVEL BPE:
- Operate on UTF-8 bytes (0-255) instead of Unicode characters
- Any byte sequence can be represented -> no UNK tokens
- tiktoken and SentencePiece both use this approach
Exercise 4: Whitespace Handling
Whitespace is tricky in tokenization. Our tokenizer preserves it, but notice how spaces can be part of tokens.
# Whitespace is significant in tokenization
code_tok = BPETokenizer(vocab_size=100, verbose=False)
code_tok.train("def foo():\n return 1\ndef bar():\n return 2", show_progress=False)
# See how indentation is tokenized
samples = [
"def foo():",
" return", # 4 spaces
" x", # 8 spaces
]
for sample in samples:
ids = code_tok.encode(sample)
tokens = [code_tok.id_to_token(i) for i in ids]
print(f"{repr(sample):20s} -> {tokens}")
print("\nIn production tokenizers, leading spaces often attach to")
print("the following word: ' hello' is one token, not ' ' + 'hello'")'def foo():' -> ['def', ' ', 'f', 'o', 'o', '():']
' return' -> [' ', ' ', 'return']
' x' -> [' ', ' ', ' ', ' ', '<UNK>']
In production tokenizers, leading spaces often attach to
the following word: ' hello' is one token, not ' ' + 'hello'
Tokenization in the LLM Pipeline
Tokenization occupies the first stage of the LLM pipeline:
Summary
Key takeaways:
- BPE learns subword units by iteratively merging the most frequent adjacent token pairs
- Vocabulary size is a tradeoff: larger = shorter sequences but more parameters and sparse token usage
- Special tokens (BOS, EOS, PAD, UNK) serve critical roles in the LLM pipeline
- Code patterns emerge naturally (def, self., return, indentation) when trained on code
- Round-trip guarantee: encode -> decode should perfectly reconstruct the original text
- Production tokenizers use byte-level BPE to handle any Unicode character without UNK tokens
What We Simplified
Our implementation differs from production tokenizers in several ways:
| Our Tokenizer | Production Tokenizers |
|---|---|
| Character-level BPE | Byte-level BPE (handles any UTF-8) |
| Python dict lookups | Optimized Rust/C++ (tiktoken is 10x+ faster) |
| No regex pre-tokenization | GPT uses regex to split contractions, numbers specially |
| Simple word splitting | Careful handling of whitespace, punctuation |
Pre-tokenization is a critical step we simplified. Production tokenizers first split text into “words” using regex patterns before applying BPE. This prevents merges across word boundaries — for example, preventing “the” at the end of one word from merging with “e” at the start of the next. GPT-2’s tokenizer uses a carefully crafted regex that handles contractions (“don’t” → “don”, “’t”), numbers, and punctuation specially. This pre-split ensures more linguistically meaningful merges.
Practical Implications
- Context length: A 4096-token context window holds varying amounts of text depending on tokenization efficiency
- Cost: API pricing is per-token, so tokenization directly affects cost
- Multilingual: Tokenizers trained on English use more tokens for other languages (2-3x for some)
- Code vs prose: Code often tokenizes inefficiently (many single-character tokens for syntax)
What’s Next
Module 04: Embeddings converts token IDs into dense vectors that capture meaning. Each token becomes a learnable vector in high-dimensional space.