Module 08: Generation

Introduction

After training a language model, we want to generate text. The model outputs probabilities for the next token - but how do we choose which token to use? This module explores various decoding strategies that produce different results.

Text generation is the process of producing new text from a trained model. The model predicts probability distributions over the vocabulary, and we must decide how to select the next token from these probabilities.

Why does the decoding strategy matter?

  • Different strategies, different outputs: Greedy decoding gives deterministic results, sampling gives variety
  • Control creativity vs coherence: Temperature and filtering parameters let us tune this tradeoff
  • Application-specific needs: Code generation wants precision, creative writing wants diversity
  • Avoid degenerate outputs: Poor settings lead to repetition, incoherence, or nonsense

Understanding generation is essential for building LLM applications.

What You’ll Learn

By the end of this module, you will be able to:

  • Implement the autoregressive generation loop from scratch
  • Apply and combine decoding strategies (greedy, temperature, top-k, top-p)
  • Use repetition penalties to prevent degenerate outputs
  • Understand KV-caching for efficient generation
  • Choose appropriate generation parameters for different use cases

The Generation Loop

Text generation is autoregressive - we generate one token at a time, feeding previous tokens back to the model:

Each step:

  1. Feed current tokens to the model
  2. Get probability distribution over vocabulary for next position
  3. Apply decoding strategy to select next token
  4. Append selected token to sequence
  5. Repeat until stopping criterion met

From Scratch: The Generation Loop

Let’s build generation from the ground up. At its core, generation is just repeated next-token prediction with sampling.

import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    """Stable softmax: subtract max to prevent overflow."""
    x_max = x.max(axis=-1, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def generate_scratch(get_logits, context: np.ndarray, max_new_tokens: int = 10, temperature: float = 1.0) -> np.ndarray:
    """
    Generate tokens autoregressively from scratch.

    Args:
        get_logits: Function that takes context (1, seq_len) and returns logits (1, vocab_size)
        context: Starting token ids, shape (1, seq_len)
        max_new_tokens: How many tokens to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        Extended context with generated tokens
    """
    ctx = context.copy()

    for _ in range(max_new_tokens):
        # 1. Get logits for last position
        logits = get_logits(ctx)  # (1, vocab_size)

        # 2. Apply temperature (scale before softmax)
        logits = logits / temperature

        # 3. Convert to probabilities
        probs = softmax(logits)[0]  # (vocab_size,)

        # 4. Sample next token
        next_token = np.random.choice(len(probs), p=probs)

        # 5. Append to context
        ctx = np.concatenate([ctx, [[next_token]]], axis=1)

    return ctx

Key insight: Generation is surprisingly simple. The model predicts, we sample, and we feed the result back in. That’s it.

From Logits to Tokens

The logits-to-token pipeline is the heart of generation:

# Step-by-step: logits -> probabilities -> token

# Simulate model output (logits for 8-token vocabulary)
logits = np.array([[2.0, 1.5, 0.5, 0.0, -0.5, -1.0, -1.5, -2.0]])
token_names = ["the", "cat", "sat", "on", "mat", "dog", "ran", "fast"]

print("Step 1: Raw logits from model")
for i, (name, logit) in enumerate(zip(token_names, logits[0])):
    print(f"  {name:>4}: {logit:+.1f}")

print("\nStep 2: Apply softmax to get probabilities")
probs = softmax(logits)[0]
for name, prob in zip(token_names, probs):
    bar = "█" * int(prob * 40)
    print(f"  {name:>4}: {prob:.3f} {bar}")

print("\nStep 3: Sample from the distribution")
np.random.seed(42)
sampled_idx = np.random.choice(len(probs), p=probs)
print(f"  Sampled token: '{token_names[sampled_idx]}' (index {sampled_idx})")

Temperature controls the randomness by scaling logits before softmax:

print("Effect of temperature on the same logits:\n")

for temp in [0.5, 1.0, 2.0]:
    scaled_logits = logits / temp
    probs = softmax(scaled_logits)[0]

    print(f"Temperature = {temp}:")
    for name, prob in zip(token_names[:4], probs[:4]):  # Show top 4
        bar = "█" * int(prob * 30)
        print(f"  {name:>4}: {prob:.3f} {bar}")
    print()

print("Lower temp = sharper (more deterministic)")
print("Higher temp = flatter (more random)")

Now let’s see it in action with a real model.

Code Walkthrough

Let’s explore generation interactively:

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

print(f"PyTorch version: {torch.__version__}")

Setting Up

import sys
sys.path.insert(0, '..')

from generation import (
    top_k_filtering,
    top_p_filtering,
    apply_repetition_penalty,
    generate,
    generate_greedy,
    generate_sample,
    get_token_probabilities,
    get_top_tokens,
)
from m06_transformer.transformer import create_gpt_tiny

# Create a small model for demonstration
vocab_size = 50
model = create_gpt_tiny(vocab_size=vocab_size)

# Create a sample prompt
prompt = torch.randint(0, vocab_size, (1, 5))
print(f"Prompt tokens: {prompt[0].tolist()}")

Understanding Model Output

A language model outputs logits (unnormalized scores) that become probabilities after softmax:

# Get probability distribution for next token
probs = get_token_probabilities(model, prompt)

print(f"Probability distribution shape: {probs.shape}")
print(f"Sum of probabilities: {probs.sum().item():.4f}")

# Visualize the distribution
plt.figure(figsize=(12, 4))
plt.bar(range(vocab_size), probs[0].numpy())
plt.xlabel('Token ID')
plt.ylabel('Probability')
plt.title('Next Token Probability Distribution')
plt.grid(True, alpha=0.3)
plt.show()

# Show top tokens
top = get_top_tokens(probs, k=5)
print("\nTop 5 most likely next tokens:")
for token_id, prob in top:
    print(f"  Token {token_id}: {prob*100:.2f}%")

Decoding Strategies

1. Greedy Decoding

Always pick the token with the highest probability - simple but can be repetitive.

Pros: Deterministic, coherent output Cons: Boring, repetitive, can get stuck in loops

# Generate with greedy decoding
output_greedy = generate_greedy(model, prompt, max_new_tokens=15)

print(f"Prompt: {prompt[0].tolist()}")
print(f"Generated: {output_greedy[0, 5:].tolist()}")
# Greedy is deterministic - same output every time
print("Multiple greedy generations (should all be identical):")
for i in range(3):
    out = generate_greedy(model, prompt, max_new_tokens=10)
    print(f"  Run {i+1}: {out[0, 5:].tolist()}")

2. Temperature Sampling

Temperature controls the “sharpness” of the probability distribution before sampling:

\[P_{\text{new}} = \text{softmax}(\text{logits} / T)\]

  • Temperature < 1.0: Sharper distribution (more like greedy)
  • Temperature = 1.0: Original distribution
  • Temperature > 1.0: Flatter distribution (more random)

Note: Temperature = 0 would cause division by zero. In practice, very low temperatures (e.g., 0.01) approximate greedy decoding, and many implementations treat temperature = 0 as an alias for greedy mode.

# Visualize temperature effects
logits = torch.tensor([2.0, 1.0, 0.5, 0.0, -0.5, -1.0, -2.0])

fig, axes = plt.subplots(1, 4, figsize=(16, 3))

for ax, temp in zip(axes, [0.3, 0.7, 1.0, 2.0]):
    scaled_logits = logits / temp
    probs = F.softmax(scaled_logits, dim=0)

    ax.bar(range(7), probs.numpy())
    ax.set_xlabel('Token')
    ax.set_ylabel('Probability')
    ax.set_title(f'Temperature = {temp}')
    ax.set_ylim(0, 1)

plt.suptitle('Effect of Temperature on Probability Distribution', fontsize=14)
plt.tight_layout()
plt.show()
# Generate with different temperatures
print("Generating with different temperatures:\n")

for temp in [0.3, 0.7, 1.0, 1.5]:
    print(f"Temperature = {temp}:")
    for i in range(3):
        torch.manual_seed(42 + i)
        out = generate(model, prompt, max_new_tokens=10, temperature=temp, do_sample=True)
        print(f"  Sample {i+1}: {out[0, 5:].tolist()}")
    print()

3. Top-k Sampling

Only sample from the k most likely tokens - filters out unlikely tokens:

# Demonstrate top-k filtering
logits = torch.tensor([[1.0, 3.0, 0.5, 2.5, 0.0, 2.0, -1.0, 1.5]])
original_probs = F.softmax(logits, dim=-1)

print("Original probabilities:")
for i, p in enumerate(original_probs[0]):
    print(f"  Token {i}: {p.item():.3f}")

# Apply top-k filtering
for k in [3, 5]:
    filtered = top_k_filtering(logits.clone(), k)
    filtered_probs = F.softmax(filtered, dim=-1)

    print(f"\nAfter top-k = {k}:")
    for i, p in enumerate(filtered_probs[0]):
        if p > 0:
            print(f"  Token {i}: {p.item():.3f}")
# Visualize top-k effect
fig, axes = plt.subplots(1, 4, figsize=(16, 3))

logits = torch.randn(1, 20) * 2  # Random logits

for ax, k in zip(axes, [1, 3, 5, 20]):
    filtered = top_k_filtering(logits.clone(), k)
    probs = F.softmax(filtered, dim=-1)[0]

    ax.bar(range(20), probs.numpy())
    ax.set_xlabel('Token')
    ax.set_ylabel('Probability')
    ax.set_title(f'Top-k = {k}')

plt.suptitle('Effect of Top-k Filtering', fontsize=14)
plt.tight_layout()
plt.show()

4. Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose cumulative probability exceeds p. This adapts to the distribution - keeps more tokens when uncertain, fewer when confident.

Key advantage: Top-p adapts to the distribution shape: - Peaked (confident): Keeps fewer tokens - Flat (uncertain): Keeps more tokens

# Demonstrate top-p filtering
logits = torch.tensor([[3.0, 2.0, 1.5, 1.0, 0.5, 0.0, -0.5, -1.0]])
probs = F.softmax(logits, dim=-1)[0]

# Sort and show cumulative probabilities
sorted_probs, sorted_idx = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_probs, dim=0)

print("Tokens sorted by probability:")
print(f"{'Token':<8} {'Prob':<10} {'Cumulative':<10}")
print("-" * 28)
for i, (idx, p, c) in enumerate(zip(sorted_idx, sorted_probs, cumulative)):
    marker = " <- cutoff (p=0.9)" if c.item() > 0.9 and (i == 0 or cumulative[i-1].item() <= 0.9) else ""
    print(f"{idx.item():<8} {p.item():<10.3f} {c.item():<10.3f}{marker}")
# Compare top-p on peaked vs flat distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Peaked distribution (confident model)
peaked_logits = torch.tensor([[5.0, 1.0, 0.5, 0.0, -0.5, -1.0, -1.5, -2.0]])

# Flat distribution (uncertain model)
flat_logits = torch.tensor([[1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3]])

for row, (logits, name) in enumerate([(peaked_logits, "Peaked (confident)"), (flat_logits, "Flat (uncertain)")]):
    for col, p in enumerate([0.5, 0.7, 0.9]):
        filtered = top_p_filtering(logits.clone(), p)
        probs = F.softmax(filtered, dim=-1)[0]

        ax = axes[row, col]
        ax.bar(range(8), probs.numpy())
        ax.set_xlabel('Token')
        ax.set_ylabel('Probability')
        ax.set_title(f'{name}\np={p}, tokens kept: {(probs > 0).sum().item()}')

plt.suptitle('Top-p Adapts to Distribution Shape', fontsize=14)
plt.tight_layout()
plt.show()

print("Notice: Top-p keeps more tokens when the model is uncertain (flat dist)")
print("and fewer tokens when confident (peaked dist)!")

Combining Strategies

Each strategy has trade-offs: temperature affects the overall distribution shape, top-k provides a hard cutoff, and top-p adapts to model confidence. In practice, combining them often works better than any single approach:

# The typical generation pipeline
logits_example = torch.randn(1, vocab_size) * 2

# Step 1: Apply temperature
temperature = 0.7
logits_temp = logits_example / temperature

# Step 2: Apply top-k filtering
logits_topk = top_k_filtering(logits_temp.clone(), top_k=20)

# Step 3: Apply top-p filtering
logits_topp = top_p_filtering(logits_topk.clone(), top_p=0.9)

# Step 4: Sample from the distribution
probs = F.softmax(logits_topp, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

print("Combined filtering pipeline:")
print(f"  Original vocab: {vocab_size} tokens")
print(f"  After top-k=20: {(F.softmax(logits_topk, dim=-1) > 0).sum().item()} tokens")
print(f"  After top-p=0.9: {(probs > 0).sum().item()} tokens")
print(f"  Sampled token: {next_token.item()}")
# Compare different strategy combinations
strategies = [
    ("Greedy", {"do_sample": False}),
    ("Temperature=0.5", {"temperature": 0.5, "do_sample": True}),
    ("Temperature=1.0", {"temperature": 1.0, "do_sample": True}),
    ("Top-k=5", {"top_k": 5, "do_sample": True}),
    ("Top-p=0.9", {"top_p": 0.9, "do_sample": True}),
    ("Combined (T=0.7, k=20, p=0.9)", {"temperature": 0.7, "top_k": 20, "top_p": 0.9, "do_sample": True}),
]

print("Comparing strategies (3 samples each):\n")

for name, kwargs in strategies:
    print(f"{name}:")
    for i in range(3):
        torch.manual_seed(100 + i)
        out = generate(model, prompt, max_new_tokens=10, **kwargs)
        tokens = out[0, 5:].tolist()
        print(f"  {tokens}")
    print()

Measuring Output Diversity

Let’s quantify how different strategies affect output diversity:

def measure_diversity(model, prompt, num_samples=20, **kwargs):
    """Measure how diverse the generated outputs are."""
    outputs = []
    for i in range(num_samples):
        torch.manual_seed(i)
        out = generate(model, prompt, max_new_tokens=15, **kwargs)
        outputs.append(tuple(out[0].tolist()))

    unique = len(set(outputs))
    return unique / num_samples

# Compare diversity across settings
settings = [
    ("Greedy", {"do_sample": False}),
    ("Temp=0.3", {"temperature": 0.3, "do_sample": True}),
    ("Temp=0.7", {"temperature": 0.7, "do_sample": True}),
    ("Temp=1.0", {"temperature": 1.0, "do_sample": True}),
    ("Temp=1.5", {"temperature": 1.5, "do_sample": True}),
]

diversities = []
for name, kwargs in settings:
    div = measure_diversity(model, prompt, num_samples=20, **kwargs)
    diversities.append((name, div))
    print(f"{name}: {div*100:.0f}% unique outputs")
# Visualize diversity
names = [d[0] for d in diversities]
values = [d[1] * 100 for d in diversities]

plt.figure(figsize=(10, 5))
colors = ['gray'] + list(plt.cm.viridis(np.linspace(0.2, 0.8, len(names)-1)))
plt.bar(names, values, color=colors)
plt.ylabel('% Unique Outputs')
plt.title('Output Diversity vs Temperature')
plt.ylim(0, 105)
for i, v in enumerate(values):
    plt.text(i, v + 2, f'{v:.0f}%', ha='center')
plt.show()

Choosing Parameters

Recommended settings for different use cases:

Goal Temperature Top-k Top-p
Code generation 0.2-0.4 10-20 0.8-0.9
Factual/deterministic 0.3-0.5 5-10 0.5-0.7
Coherent responses 0.7-0.9 20-50 0.85-0.92
Creative writing 0.8-1.2 40-100 0.9-0.95
# Example settings for different applications
use_cases = {
    "Code generation": {"temperature": 0.2, "top_p": 0.9, "do_sample": True},
    "Balanced chat": {"temperature": 0.7, "top_p": 0.9, "do_sample": True},
    "Creative writing": {"temperature": 1.0, "top_p": 0.95, "do_sample": True},
    "Brainstorming": {"temperature": 1.5, "top_p": 0.95, "do_sample": True},
}

print("Sample outputs for different use cases:\n")

for name, kwargs in use_cases.items():
    print(f"{name}:")
    for i in range(2):
        torch.manual_seed(42 + i)
        out = generate(model, prompt, max_new_tokens=12, **kwargs)
        print(f"  {out[0, 5:].tolist()}")
    print()

Repetition Penalty

A common problem with text generation is repetition - the model gets stuck repeating the same tokens or phrases. Repetition penalties address this by reducing the probability of tokens that have already appeared.

The repetition penalty works as follows:

  • For tokens that have appeared before:
    • If the logit is positive, divide by the penalty (reduces probability)
    • If the logit is negative, multiply by the penalty (makes it more negative)
  • Penalty = 1.0 means no change
  • Penalty > 1.0 discourages repetition (common values: 1.1 - 1.5)
# Demonstrate repetition penalty
logits = torch.tensor([[2.0, 1.5, 1.0, 0.5, -0.5, -1.0]])
previous_tokens = torch.tensor([[0, 1, 4]])  # Tokens 0, 1, and 4 appeared

print("Original logits:")
for i, l in enumerate(logits[0]):
    marker = " (appeared)" if i in [0, 1, 4] else ""
    print(f"  Token {i}: {l.item():.2f}{marker}")

# Apply penalty
penalized = apply_repetition_penalty(logits, previous_tokens, penalty=1.5)

print("\nAfter repetition penalty (1.5):")
for i, l in enumerate(penalized[0]):
    marker = " (appeared)" if i in [0, 1, 4] else ""
    print(f"  Token {i}: {l.item():.2f}{marker}")

# Compare probabilities
orig_probs = F.softmax(logits, dim=-1)
new_probs = F.softmax(penalized, dim=-1)

print("\nProbability changes:")
for i in [0, 1, 2]:
    print(f"  Token {i}: {orig_probs[0,i].item():.3f} -> {new_probs[0,i].item():.3f}")
# Generate with and without repetition penalty
print("Generation without repetition penalty:")
out = generate_greedy(model, prompt, max_new_tokens=30)
tokens = out[0].tolist()
from collections import Counter
counts = Counter(tokens)
print(f"  Tokens: {tokens[5:]}")
print(f"  Most common: {counts.most_common(3)}")

print("\nGeneration with repetition penalty (1.3):")
out = generate(model, prompt, max_new_tokens=30, do_sample=False, repetition_penalty=1.3)
tokens = out[0].tolist()
counts = Counter(tokens)
print(f"  Tokens: {tokens[5:]}")
print(f"  Most common: {counts.most_common(3)}")

When to use repetition penalty:

  • Always for open-ended generation (stories, chat)
  • Less critical for short, structured outputs (classification, extraction)
  • Typical values: 1.1 for mild effect, 1.3-1.5 for stronger effect
  • Too high (> 2.0) can make outputs incoherent

Stop Conditions

Generation needs to know when to stop. There are two main stopping conditions:

  1. Maximum length (max_new_tokens) - Hard limit on generated tokens
  2. EOS token (eos_token_id) - Stop when a special end-of-sequence token is generated
# Demonstrate EOS stopping
# In real models, EOS is a special token. Here we'll use token 42 as our "EOS"
eos_id = 42

print(f"Generating with EOS token = {eos_id}")
print(f"(Generation stops early if token {eos_id} is produced)")

# Without EOS
out_no_eos = generate_greedy(model, prompt, max_new_tokens=20)
print(f"\nWithout EOS check: {len(out_no_eos[0]) - 5} new tokens generated")
print(f"  Tokens: {out_no_eos[0, 5:].tolist()}")

# With EOS (may stop early if 42 is generated)
out_with_eos = generate_greedy(model, prompt, max_new_tokens=20, eos_token_id=eos_id)
print(f"\nWith EOS check: {len(out_with_eos[0]) - 5} new tokens generated")
print(f"  Tokens: {out_with_eos[0, 5:].tolist()}")

if len(out_with_eos[0]) < len(out_no_eos[0]):
    print(f"  (Stopped early due to EOS token)")

Practical notes on stopping:

  • Always set a reasonable max_new_tokens to prevent runaway generation
  • EOS tokens are essential for chat/instruction models to indicate response completion
  • Batched generation continues until ALL sequences hit a stop condition
  • Some APIs support multiple stop sequences (not just EOS)

KV-Cache Optimization

In generation, we process one new token at a time. Without optimization, we’d recompute attention for ALL previous tokens every step - wasting computation!

KV-Cache stores Key and Value projections from previous tokens:

  • Without cache: O(n^2) per token, O(n^3) total for n tokens
  • With cache: O(n) per token, O(n^2) total for n tokens

This is a crucial optimization for fast inference!

How KV-Cache Works

In attention, we compute: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

For autoregressive generation:

  1. First forward pass (prompt): Compute K, V for all prompt tokens and cache them
  2. Each new token: Only compute Q, K, V for the new token
  3. Attention: New Q attends to cached K, V plus new K, V
  4. Update cache: Append new K, V to the cache

Memory Tradeoff

KV-cache trades memory for speed:

Aspect Without Cache With Cache
Computation per token O(n^2) O(n)
Memory O(1) extra O(n * layers * d)
Total time for n tokens O(n^3) O(n^2)

For a model with:

  • 32 layers, d_model = 4096, 8K context
  • KV cache size = 2 (K and V) × 32 × 4096 × 8192 × 2 bytes (float16) = ~4GB per sequence

This is why long-context models need significant GPU memory!

Practical Considerations

  • Prompt processing: First pass processes entire prompt (can be batched efficiently)
  • Generation: Subsequent tokens are generated one at a time (memory-bound)
  • Batch size tradeoff: Larger batches amortize overhead but need more KV-cache memory
  • Context length: Longer contexts need more cache memory per sequence

Note: Our simple generate() function recomputes everything each step for clarity. Production implementations use KV-caching for efficiency.

Interactive Exploration

Experiment with sampling strategies in real-time. Adjust temperature, top-k, and top-p to see how each reshapes the probability distribution before sampling.

TipTry This
  1. Temperature effect: Set top-k=0, top-p=1.0, then slide temperature from 0.1 to 2.0. Watch how low temperature makes “the” dominate, while high temperature flattens the distribution.

  2. Top-k sharpness: Set temperature=1.0, top-p=1.0, then increase top-k from 1 to 10. Notice how exactly k tokens are kept, regardless of their probabilities.

  3. Top-p adaptation: Set temperature=1.0, top-k=0, then decrease top-p from 1.0 to 0.5. Notice how top-p keeps more tokens when the distribution is flat, fewer when peaked.

  4. Combined filtering: Try temperature=0.7, top-k=10, top-p=0.9. This is a realistic configuration for balanced text generation.

Exercises

Exercise 1: Temperature Exploration

Experiment with extreme temperatures and observe the output behavior:

# Try very low and very high temperatures
print("Extreme temperature exploration:\n")

for temp in [0.1, 0.5, 1.0, 2.0, 5.0]:
    print(f"Temperature = {temp}:")
    outputs = set()
    for i in range(5):
        torch.manual_seed(i)
        out = generate(model, prompt, max_new_tokens=8, temperature=temp, do_sample=True)
        outputs.add(tuple(out[0, 5:].tolist()))
    print(f"  {len(outputs)}/5 unique sequences")
    # Show one sample
    torch.manual_seed(42)
    sample = generate(model, prompt, max_new_tokens=8, temperature=temp, do_sample=True)
    print(f"  Sample: {sample[0, 5:].tolist()}")
    print()

Exercise 2: Top-k vs Top-p

Compare how top-k and top-p behave differently:

# Compare filtering approaches
print("Top-k vs Top-p filtering:\n")

# Create a bimodal distribution (two likely options)
bimodal_logits = torch.tensor([[3.0, 3.0, -1.0, -1.0, -2.0, -2.0, -3.0, -3.0, -4.0, -4.0]])

print("Original probabilities (bimodal - two equally likely tokens):")
bimodal_probs = F.softmax(bimodal_logits, dim=-1)
for i, p in enumerate(bimodal_probs[0][:5]):
    print(f"  Token {i}: {p.item():.3f}")

# Top-k=2 keeps exactly 2 tokens
topk_filtered = top_k_filtering(bimodal_logits.clone(), 2)
topk_probs = F.softmax(topk_filtered, dim=-1)

# Top-p=0.5 adapts to distribution
topp_filtered = top_p_filtering(bimodal_logits.clone(), 0.5)
topp_probs = F.softmax(topp_filtered, dim=-1)

print(f"\nTop-k=2 keeps: {(topk_probs > 0).sum().item()} tokens")
print(f"Top-p=0.5 keeps: {(topp_probs > 0).sum().item()} tokens")

print("\n** Key insight: Top-k always keeps exactly k tokens.")
print("   Top-p adapts: it may keep fewer tokens if one dominates.")

Exercise 3: Repetition Penalty Effects

Explore how different repetition penalty values affect generation:

# Compare different repetition penalties
print("Repetition penalty comparison (greedy decoding, 40 tokens):\n")

for penalty in [1.0, 1.1, 1.3, 1.5, 2.0]:
    out = generate(model, prompt, max_new_tokens=40, do_sample=False, repetition_penalty=penalty)
    tokens = out[0, 5:].tolist()

    # Count unique tokens
    unique_ratio = len(set(tokens)) / len(tokens)

    print(f"Penalty = {penalty}:")
    print(f"  Unique tokens: {len(set(tokens))}/{len(tokens)} ({unique_ratio*100:.0f}%)")
    print(f"  First 15: {tokens[:15]}")
    print()

Exercise 4: Observing Repetition in Long Generation

Observe how greedy decoding can lead to repetition:

# Generate longer sequences to see repetition patterns
print("Long greedy generation (may show repetition):\n")

# Generate more tokens
long_output = generate_greedy(model, prompt, max_new_tokens=50)
tokens = long_output[0].tolist()

print(f"Generated sequence ({len(tokens)} tokens):")
print(tokens)

# Count token frequency
from collections import Counter
token_counts = Counter(tokens)
print(f"\nMost common tokens:")
for token, count in token_counts.most_common(5):
    print(f"  Token {token}: {count} times ({count/len(tokens)*100:.1f}%)")

The Complete Generation Function

Here’s the main generation function from our codebase:

# Display the signature and key parts
import inspect
from generation import generate

print("generate() function signature:")
print(inspect.signature(generate))
print()
print("Key parameters:")
print("  - model: The language model")
print("  - prompt_tokens: Starting sequence (batch, seq_len)")
print("  - max_new_tokens: How many tokens to generate")
print("  - temperature: Distribution sharpness (default 1.0)")
print("  - top_k: Filter to top k tokens (optional)")
print("  - top_p: Nucleus sampling threshold (optional)")
print("  - do_sample: If False, use greedy decoding")
print("  - eos_token_id: Stop token (optional)")
print("  - repetition_penalty: Penalize repeated tokens (default 1.0)")

Summary

Key takeaways from this module:

  1. Autoregressive generation: Produce tokens one at a time, feeding each back as input
  2. Greedy decoding: Always pick the max - deterministic but can be repetitive
  3. Temperature: Controls randomness - lower is more focused, higher is more diverse
  4. Top-k sampling: Limits choices to k most likely tokens
  5. Top-p (nucleus) sampling: Adapts to distribution shape - keeps more tokens when uncertain
  6. Repetition penalty: Reduces probability of previously-generated tokens to prevent loops
  7. Stop conditions: Use EOS tokens and max length to control when generation ends
  8. Combine strategies: Temperature + top-p + repetition penalty is common in practice
  9. KV-cache: Essential optimization - trades memory for O(n) speedup per token

Common Pitfalls

Problem Cause Solution
Repetitive output Greedy decoding or low temperature Use sampling, repetition penalty
Incoherent nonsense Temperature too high Lower temperature, use top-p
Cuts off mid-sentence max_new_tokens too low Increase limit, ensure EOS handling
Slow generation No KV-cache Implement caching (production)
Out of memory Long context + large batch Reduce batch size or context

Conclusion

Congratulations! You’ve completed the Learn LLM series. You now understand all the building blocks of a language model:

  1. Tensors: The fundamental data structure
  2. Autograd: Automatic differentiation for training
  3. Tokenization: Converting text to numbers
  4. Embeddings: Learned vector representations
  5. Attention: The mechanism that lets tokens interact
  6. Transformer: The complete architecture
  7. Training: How models learn from data
  8. Generation: How to produce text from trained models

What’s Next?

  • Check out the minigpt directory to see everything assembled into a working model
  • Train your own small language model on real data
  • Explore the Going Deeper resources for advanced topics

Going Deeper

Core Papers:

Advanced Topics (not covered here):

  • Beam Search: Maintain k best partial sequences; better for translation, worse for open-ended generation
  • Speculative Decoding: Use a small draft model to propose tokens, verify with large model in parallel
  • Structured Generation: Constrain outputs to valid JSON, code syntax, or grammar rules
  • Contrastive Decoding: Compare probabilities from expert and amateur models

Practical Resources: