Module 03: Tokenization

Introduction

Converting text into numbers. Before a language model can process text, we need to break it into tokens and map each token to a number.

Tokenization is how we convert raw text into a sequence of integers that the model can process. Modern LLMs use subword tokenization - they break text into pieces smaller than words but larger than characters.

Why subword tokenization?

  • Word-level: Can’t handle new words (OOV problem), huge vocabulary needed (millions for multilingual)
  • Character-level: Sequences become 4-5x longer, attention cost explodes O(n^2), model learns spelling from scratch
  • Subword: Best of both worlds - handles new words via decomposition, reasonable sequence length

The most common algorithm is BPE (Byte Pair Encoding), originally invented for data compression in 1994 and adapted for NLP in 2016:

  1. Start with individual characters as the initial vocabulary
  2. Count all adjacent token pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Add the merged token to the vocabulary
  5. Repeat until vocabulary size reached

What You’ll Learn

By the end of this module, you will be able to:

  • Build a character-level tokenizer from scratch
  • Understand why subword tokenization is necessary
  • Implement the BPE algorithm for training and encoding
  • Handle special tokens (PAD, UNK, BOS, EOS)
  • Recognize trade-offs in vocabulary size

But before diving into BPE, let’s build the simplest possible tokenizer from scratch.

The Simplest Tokenizer

The most straightforward approach: treat each character as a token.

# Build vocabulary from text
text = "hello world"
chars = sorted(set(text))
print(f"Unique characters: {chars}")
print(f"Vocabulary size: {len(chars)}")
# The core of any tokenizer: two lookup tables
stoi = {ch: i for i, ch in enumerate(chars)}  # string to integer
itos = {i: ch for i, ch in enumerate(chars)}  # integer to string

print("stoi (encode):", stoi)
print("itos (decode):", itos)
# Encode: text -> integers
def encode(text):
    return [stoi[ch] for ch in text]

# Decode: integers -> text
def decode(ids):
    return ''.join(itos[i] for i in ids)

# Try it out
encoded = encode("hello")
print(f"'hello' -> {encoded}")
print(f"{encoded} -> '{decode(encoded)}'")
# Round-trip test
original = "hello world"
reconstructed = decode(encode(original))
print(f"Original:      '{original}'")
print(f"Reconstructed: '{reconstructed}'")
print(f"Perfect round-trip: {original == reconstructed}")

That’s it! A complete tokenizer in about 10 lines of Python. Every tokenizer — no matter how sophisticated — has these same two operations:

  • encode: text to token IDs
  • decode: token IDs back to text

The Key Insight

Tokenization is really about compression and semantic grouping:

Tokenization Vocabulary Size Sequence Length Semantics
Character ~100 (ASCII) Very long None (individual letters)
Word ~1,000,000+ Short Strong (whole words)
Subword ~30,000-100,000 Medium Moderate (meaningful pieces)

Why Characters Aren’t Enough

Our character tokenizer works, but has serious problems at scale.

Problem 1: Long Sequences

sample_text = "The transformer architecture revolutionized natural language processing."
char_tokens = list(sample_text)
print(f"Text length: {len(sample_text)} characters")
print(f"Token count: {len(char_tokens)} tokens")
print(f"Compression ratio: {len(sample_text) / len(char_tokens):.2f}x (no compression!)")

Since attention is O(n^2) in sequence length, doubling the sequence length quadruples the compute cost. Character-level tokenization produces the longest possible sequences.

Problem 2: No Semantic Units

# The model sees this:
word = "transformer"
char_view = list(word)
print(f"Characters: {char_view}")
print(f"Token count: {len(char_view)}")

The model must learn from scratch that t-r-a-n-s-f-o-r-m-e-r is a meaningful unit. It gets no help from the tokenization. Compare this to word-level where “transformer” would be a single token with its own learned representation.

Problem 3: Vocabulary Explosion for Bytes

# If we go to byte-level (handling all Unicode)
text_with_emoji = "Hello! \U0001F60A"
byte_view = text_with_emoji.encode('utf-8')
print(f"Text: {text_with_emoji}")
print(f"Bytes: {list(byte_view)}")
print(f"Byte count: {len(byte_view)} (emoji = 4 bytes!)")

Byte-level tokenization can represent anything, but sequences become even longer. A single emoji becomes 4 tokens.

The Tradeoff

This is the fundamental tradeoff in tokenization:

Characters: Small vocab, long sequences, no semantics
Words:      Huge vocab, short sequences, good semantics, can't handle new words
Subwords:   Medium vocab, medium sequences, some semantics, handles new words

BPE finds a sweet spot by learning which character sequences appear frequently together and merging them into single tokens.

Intuition: Learning Patterns Through Merging

Think of BPE as compression that learns common patterns:

For code, BPE learns things like:

  • def (function definition with space)
  • self. (common in Python classes)
  • return (return statement)
  • (4-space indent)

The BPE Training Algorithm

Here’s how BPE learns to tokenize:

The Math

BPE is simple - just counting and merging:

# Count pair frequencies
pairs = count_pairs(tokens)  # {'he': 50, 'el': 30, 'll': 80, ...}

# Find most frequent
best_pair = max(pairs, key=pairs.get)  # ('l', 'l')

# Merge everywhere
tokens = merge(tokens, best_pair, 'll')

Vocabulary size is a hyperparameter:

  • Too small: Sequences too long, less meaning per token
  • Too large: Many rare tokens, harder to learn
  • Typical: 8K-50K tokens for LLMs

Encoding New Text

Once trained, encoding applies merges in the order they were learned:

Handling Unknown Words

BPE can handle words it has never seen:

Special Tokens

Before diving into code, let’s understand special tokens - reserved tokens with specific meanings in the LLM pipeline:

Token Purpose When Used
<PAD> (ID 0) Padding Batch processing requires same-length sequences. Padding fills shorter sequences.
<UNK> (ID 1) Unknown Characters not seen during training. Production tokenizers avoid this with byte-level BPE.
<BOS> (ID 2) Beginning of Sequence Signals the start of text. Helps model distinguish context boundaries.
<EOS> (ID 3) End of Sequence Signals text completion. Model generates this to stop. Critical for generation.

These tokens are reserved in the vocabulary before training begins, ensuring consistent IDs across all tokenizers.

Code Walkthrough

Let’s explore tokenization interactively:

# Import our BPE tokenizer
from tokenizer import BPETokenizer, SPECIAL_TOKENS

print("Special tokens:", SPECIAL_TOKENS)
print("\nThese tokens are reserved at IDs 0-3 before training begins.")

Training a BPE Tokenizer

The BPETokenizer class has key parameters:

  • vocab_size: Target vocabulary size (including special tokens)
  • min_frequency: Minimum times a pair must appear to be merged (default: 2). This prevents rare pairs from being merged — if a pair only appears once, it’s likely noise rather than a useful pattern. Higher values create more conservative, generalizable vocabularies.
  • verbose: Print detailed training progress
# Simple text to train on
simple_text = "ab cd ab cd ab cd ab cd " * 20

# Create and train tokenizer
# vocab_size includes the 4 special tokens, so effective learned tokens = vocab_size - 4
tokenizer = BPETokenizer(vocab_size=30, verbose=False)
stats = tokenizer.train(simple_text, show_progress=True)

print(f"\nVocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")
print(f"Special tokens: {stats['num_special_tokens']}")
# See what patterns were learned
print("Learned merges:")
for i, ((a, b), merged) in enumerate(list(tokenizer.merges.items())[:10]):
    print(f"  {i+1}. {repr(a)} + {repr(b)} -> {repr(merged)}")

Encoding and Decoding

Encoding applies learned merges in the order they were learned. This is crucial - the merge order determines how text is split.

# Encode some text
test_text = "ab cd"
ids = tokenizer.encode(test_text)
tokens = [tokenizer.id_to_token(i) for i in ids]

print(f"Text: '{test_text}'")
print(f"Token IDs: {ids}")
print(f"Tokens: {tokens}")

# Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: '{decoded}'")
print(f"Round-trip successful: {test_text == decoded}")
# With special tokens (used during actual LLM training/inference)
ids_with_special = tokenizer.encode(test_text, add_special_tokens=True)
print(f"\nWith special tokens: {ids_with_special}")
print(f"Tokens: {[tokenizer.id_to_token(i) for i in ids_with_special]}")

# Decoding skips special tokens by default
decoded = tokenizer.decode(ids_with_special, skip_special_tokens=True)
print(f"Decoded (skip special): '{decoded}'")

Training on Python Code

python_code = '''
def fibonacci(n):
    """Calculate the nth Fibonacci number."""
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

def factorial(n):
    """Calculate n factorial."""
    if n <= 1:
        return 1
    return n * factorial(n - 1)

class Calculator:
    def __init__(self):
        self.result = 0

    def add(self, x):
        self.result += x
        return self

    def subtract(self, x):
        self.result -= x
        return self

# Main execution
if __name__ == "__main__":
    print(fibonacci(10))
    print(factorial(5))
'''

print(f"Training on {len(python_code)} characters of Python code")
# Train tokenizer on code
code_tokenizer = BPETokenizer(vocab_size=200, verbose=False)
stats = code_tokenizer.train(python_code * 3, show_progress=True)

print(f"\nFinal vocab size: {stats['vocab_size']}")
print(f"Merges learned: {stats['num_merges']}")
# Look at what code patterns were learned
print("Interesting tokens learned (longest first):")
print("=" * 40)

interesting_patterns = []
for token, id in code_tokenizer.vocab.items():
    if len(token) >= 2 and not token.startswith('<'):
        interesting_patterns.append((token, id))

# Sort by length (longer = more merged)
interesting_patterns.sort(key=lambda x: len(x[0]), reverse=True)

for token, id in interesting_patterns[:15]:
    print(f"  {id:3d}: {repr(token)}")

Visualizing Tokenization

def visualize_tokens(tokenizer, text):
    """Show how text is split into tokens with colors."""
    ids = tokenizer.encode(text)
    tokens = [tokenizer.id_to_token(i) for i in ids]

    print(f"Original: {repr(text)}")
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"IDs: {ids}")
    print(f"Compression: {len(text)/len(ids):.2f} chars/token")
    print()

# Try different code patterns
patterns = [
    "def fibonacci(n):",
    "self.result = 0",
    "return self",
    "    for i in range(10):",
]

for pattern in patterns:
    visualize_tokens(code_tokenizer, pattern)

Vocabulary Size Tradeoffs

Vocabulary size is one of the most important hyperparameters in tokenization:

Larger vocabulary: - (+) Shorter sequences = faster training, more context in fixed window - (+) Common words as single tokens = better semantic units - (-) Larger embedding table = more parameters, more memory - (-) Rare tokens get few training examples = poor representations

Smaller vocabulary: - (+) Smaller model, faster embedding lookups - (+) Every token well-trained on many examples - (-) Longer sequences = slower training, less context - (-) Words split into less meaningful pieces

test_text = "def calculate_fibonacci(number):\n    return fibonacci(number)"

vocab_sizes = [50, 100, 200, 500]

print(f"Text: {repr(test_text)}")
print(f"Text length: {len(test_text)} characters")
print()

for vocab_size in vocab_sizes:
    tok = BPETokenizer(vocab_size=vocab_size, verbose=False)
    tok.train(python_code * 5, show_progress=False)

    ids = tok.encode(test_text)
    tokens = [tok.id_to_token(i) for i in ids]

    print(f"Vocab size {vocab_size}:")
    print(f"  Tokens: {len(ids)}")
    print(f"  Ratio: {len(test_text)/len(ids):.1f} chars/token")
    print(f"  Sample: {[tok.id_to_token(i) for i in ids[:5]]}...")
    print()

Real-world vocabulary sizes: - GPT-2: 50,257 tokens - GPT-4: ~100,000 tokens - Llama 2: 32,000 tokens - Claude: ~100,000 tokens

Saving and Loading

import tempfile
import os

# Save tokenizer
save_path = tempfile.mktemp(suffix='.json')
code_tokenizer.save(save_path)

# Load it back
loaded = BPETokenizer.load(save_path)

# Verify it works the same
test = "def test():"
original_ids = code_tokenizer.encode(test)
loaded_ids = loaded.encode(test)

print(f"\nOriginal encoding: {original_ids}")
print(f"Loaded encoding:   {loaded_ids}")
print(f"Match: {original_ids == loaded_ids}")

# Cleanup
os.unlink(save_path)

Interactive Exploration

Watch BPE tokenization in action. Type text and see how it gets broken into tokens through iterative pair merging.

Note: This demo uses a simplified, pre-defined set of common English merge rules (not dynamically computed). A real tokenizer would learn merges from a training corpus, but the mechanism is identical.

TipTry This
  1. Common words merge well: Type “the” or “and” - they become single tokens quickly due to high-frequency merges.

  2. Step through merges: Enable “Show step-by-step” and slide the merge steps from 0 to max. Watch how character pairs combine into larger tokens.

  3. Rare words stay split: Type “xyz” or uncommon words - they remain as characters because those patterns weren’t in the training data.

  4. Compression varies: Compare “the the the” (high compression) vs “qxz qxz qxz” (low compression). Common patterns compress better.

  5. Spaces are preserved: Notice that spaces remain as separate tokens (shown as ␣). This is typical BPE behavior.

Exercises

Exercise 1: Compression Efficiency

BPE achieves better compression on repetitive text. This matters because better compression = shorter sequences = more context in the model’s window.

# Train on repetitive vs varied text and compare compression

texts = {
    "repetitive": "the the the " * 100,
    "varied": " ".join([f"word{i}" for i in range(100)]),
    "code": python_code,
}

print("Compression comparison:")
print("=" * 40)

for name, text in texts.items():
    tok = BPETokenizer(vocab_size=200, verbose=False)
    tok.train(text, show_progress=False)
    ids = tok.encode(text)
    ratio = len(text) / len(ids)
    print(f"{name:12s}: {ratio:.2f} chars/token")

print("\nNote: Repetitive text compresses best because BPE learns")
print("common patterns. Code has structure but more variety.")

Exercise 2: Analyze the First Merges

The first merges reveal the most frequent patterns in your data. For English text, you’ll often see common letter pairs like ‘th’, ‘he’, ‘in’.

# What patterns are learned first?

sample_text = "hello world hello world hello world " * 10
tok = BPETokenizer(vocab_size=50, verbose=False)
tok.train(sample_text, show_progress=False)

print("First 10 merges (most frequent patterns):")
for i, ((a, b), merged) in enumerate(list(tok.merges.items())[:10]):
    print(f"  {i+1}. '{a}' + '{b}' = '{merged}'")

print("\nNotice: Common substrings merge first, eventually")
print("forming complete words like 'hello' and 'world'.")

Exercise 3: Observe Unknown Character Behavior

Our simple tokenizer can only encode characters it saw during training. Characters not in the vocabulary become <UNK> tokens. This exercise demonstrates the problem — and why production tokenizers use byte-level BPE to solve it.

# What happens with characters not in training?

tokenizer = BPETokenizer(vocab_size=50, verbose=False)
tokenizer.train("hello world", show_progress=False)

# Try encoding text with emoji
test = "hello world"  # Safe text
try:
    ids = tokenizer.encode(test)
    print(f"'{test}' -> {ids}")
    print(f"Decoded: '{tokenizer.decode(ids)}'")
except Exception as e:
    print(f"Error: {e}")

# Now try with a character not in training
test2 = "hello 123"
ids = tokenizer.encode(test2)
tokens = [tokenizer.id_to_token(i) for i in ids]
print(f"\n'{test2}' -> {ids}")
print(f"Tokens: {tokens}")
print("\nNotice: '1', '2', '3' become <UNK> (ID 1) because they")
print("weren't in the training data.")
print("\nThis is why production tokenizers use BYTE-LEVEL BPE:")
print("- Operate on UTF-8 bytes (0-255) instead of Unicode characters")
print("- Any byte sequence can be represented -> no UNK tokens")
print("- tiktoken and SentencePiece both use this approach")

Exercise 4: Whitespace Handling

Whitespace is tricky in tokenization. Our tokenizer preserves it, but notice how spaces can be part of tokens.

# Whitespace is significant in tokenization
code_tok = BPETokenizer(vocab_size=100, verbose=False)
code_tok.train("def foo():\n    return 1\ndef bar():\n    return 2", show_progress=False)

# See how indentation is tokenized
samples = [
    "def foo():",
    "    return",  # 4 spaces
    "        x",   # 8 spaces
]

for sample in samples:
    ids = code_tok.encode(sample)
    tokens = [code_tok.id_to_token(i) for i in ids]
    print(f"{repr(sample):20s} -> {tokens}")

print("\nIn production tokenizers, leading spaces often attach to")
print("the following word: ' hello' is one token, not ' ' + 'hello'")

Tokenization in the LLM Pipeline

Here’s where tokenization fits in the full pipeline:

Summary

Key takeaways:

  1. BPE learns subword units by iteratively merging the most frequent adjacent token pairs
  2. Vocabulary size is a tradeoff: larger = shorter sequences but more parameters and sparse token usage
  3. Special tokens (BOS, EOS, PAD, UNK) serve critical roles in the LLM pipeline
  4. Code patterns emerge naturally (def, self., return, indentation) when trained on code
  5. Round-trip guarantee: encode -> decode should perfectly reconstruct the original text
  6. Production tokenizers use byte-level BPE to handle any Unicode character without UNK tokens

What We Simplified

Our implementation differs from production tokenizers in several ways:

Our Tokenizer Production Tokenizers
Character-level BPE Byte-level BPE (handles any UTF-8)
Python dict lookups Optimized Rust/C++ (tiktoken is 10x+ faster)
No regex pre-tokenization GPT uses regex to split contractions, numbers specially
Simple word splitting Careful handling of whitespace, punctuation

Pre-tokenization is a critical step we simplified. Production tokenizers first split text into “words” using regex patterns before applying BPE. This prevents merges across word boundaries — for example, preventing “the” at the end of one word from merging with “e” at the start of the next. GPT-2’s tokenizer uses a carefully crafted regex that handles contractions (“don’t” → “don”, “’t”), numbers, and punctuation specially. This pre-split ensures more linguistically meaningful merges.

Practical Implications

  • Context length: A 4096-token context window holds varying amounts of text depending on tokenization efficiency
  • Cost: API pricing is per-token, so tokenization directly affects cost
  • Multilingual: Tokenizers trained on English use more tokens for other languages (2-3x for some)
  • Code vs prose: Code often tokenizes inefficiently (many single-character tokens for syntax)

What’s Next

In Module 04: Embeddings, we’ll learn how to convert token IDs into dense vectors that capture meaning. Each token becomes a learnable vector in high-dimensional space.