Module 00: What Is a Language Model?

~15 minutes · No prerequisites

NoteWhat You’ll Learn

After this module, you will:

  • Understand what language models do: next-token prediction
  • Know why simple statistical approaches fail at this task
  • Grasp how neural networks learn from data (training loop intuition)
  • Have a mental model of transformer architecture
  • See the roadmap of what you’ll build in this course

A language model predicts the next token.

Given the text def hello(, what comes next? A language model outputs a probability distribution over all possible tokens:

Token Probability
) 31%
name 18%
self 12%
x 8%

Notice three points:

  1. Tokens aren’t words. The model works with subword pieces - fragments smaller than words. hello might become hel + lo. This keeps the vocabulary manageable - typically 30,000 to 50,000 tokens - yet handles any text, including rare words and unfamiliar code.

  2. The output is probabilities, not a single answer. The model expresses uncertainty. Sometimes ) is clearly right; sometimes several options make sense.

  3. This simple task scales remarkably. Predicting the next token well requires understanding syntax, semantics, context, and even reasoning. A model that excels at this task can write code, answer questions, and hold conversations. This is how GitHub Copilot suggests completions, how ChatGPT generates responses, and how your IDE’s autocomplete works - all next-token prediction at scale.

Try It

Try it: Type a function definition like def add( and notice how the model predicts argument names. Then try for i in - the predictions shift based on what typically follows loop constructs.

Predicting the next token sounds simple. How do you do it well?

Why Simple Approaches Fail

The obvious approach: count what tokens typically follow other tokens. After seeing for i in a million times, you learn that range follows.

This is called an n-gram model. It looks at the last N tokens to predict the next one. Simple, fast, and works surprisingly well for common patterns.

But n-grams have a fundamental limitation: they see only a fixed window of tokens.

The Context Problem

Consider this code:

def calculate_sum(numbers):
    total = 0
    for n in numbers:
        total +=

What token fills the blank? A human immediately knows the answer is n - it’s the loop variable. An n-gram model looking at just total += guesses blindly - perhaps 1, x, or value, tokens that commonly follow +=.

N-gram vs Transformer: A Comparison

The answer depends on context from ten or more tokens earlier. N-grams cannot reach that far. Increase the window size and another problem emerges: this exact sequence is new, so statistics offer no guidance.

Language is full of these long-range dependencies:

  • Matching brackets and parentheses
  • Variable references spanning multiple lines
  • Pronouns referring to earlier nouns
  • Comments describing code that follows

A good language model must consider all previous tokens and learn which ones matter for each prediction.

Neural Networks: Learning from Data

N-grams count patterns; neural networks learn them. The difference is fundamental.

A neural network is a function with adjustable parameters. Feed it an input, it produces an output. These adjustable parameters are called weights - numbers multiplied with inputs and summed. They determine which function the network computes, and we adjust them so the function does what we want.

How Neural Networks Learn

This process matters because transformers are neural networks. The same training loop you’ll see here is exactly how we’ll train our language model in Module 07.

Training follows a simple loop:

  1. Forward pass: Feed input through the network, get a prediction
  2. Compute loss: Measure how wrong the prediction is (the loss function quantifies prediction quality)
  3. Backward pass: Calculate how each weight contributed to the error (these sensitivity values are called gradients)
  4. Update weights: Nudge weights in the direction that reduces error

Repeat with millions of examples. For language models, this means feeding billions of tokens from books, code, and web text, learning from each wrong prediction. The weights shift until the network predicts well.

(You’ll implement the backward pass yourself in Module 02: Autograd. PyTorch computes gradients automatically, but implementing them sharpens intuition.)

Why This Matters for Language

Neural networks do not merely memorize patterns as n-grams do. They learn representations - internal encodings where similar concepts cluster together.

Consider the word “cat”: - An n-gram sees only the characters c-a-t - A neural network learns a vector where “cat” is close to “dog”, “kitten”, and “pet” but far from “quantum” and “derivative”

In transformers, these representations become even richer: the same word gets different vectors depending on context. “Bank” near “river” differs from “bank” near “money.”

N-grams lack this capacity entirely. They treat “cat sat on the mat” and “dog sat on the rug” as completely unrelated sequences. Neural networks recognize the structural similarity and generalize patterns from one to the other.

This representation learning lets neural networks: - Generalize to unseen word combinations - Handle context by learning which parts of the input matter - Scale with more data and compute

Transformers are neural networks distinguished by how they handle context: through attention.

The Transformer Solution

Attention examines all previous tokens and learns which ones matter.

The architecture flows like this:

Three key insights:

  1. Full context visibility. Older models read left-to-right, one token at a time. Transformers see the entire input at once. Each layer processes all positions simultaneously, making them highly parallelizable and efficient.

  2. Learned relevance. The attention mechanism learns which tokens matter for each prediction. When predicting after total +=, it learns to focus on the loop variable n, not the function name.

  3. Stacked layers. Multiple transformer layers build increasingly abstract representations. Early layers recognize syntax; later layers grasp meaning.

The modules ahead build each component from scratch.

What You’ll Build

Each module tackles one piece of this architecture:

Each module builds on the previous ones. By the end, you’ll have a working language model you fully understand - something you built piece by piece.

NoteKey Takeaways
  • Language models predict the next token from a probability distribution over the vocabulary
  • Simple counting (n-grams) fails because language has long-range dependencies that demand variable context
  • Neural networks learn patterns through iterative weight updates - they don’t just memorize, they generalize
  • Transformers use attention to dynamically focus on relevant context, however distant
  • You’ll build each component from scratch in the modules ahead, from tensors to text generation

Start with Tensors →