Module 00: What Is a Language Model?

~10 minutes · No prerequisites

NoteWhat You’ll Learn

By the end of this module, you will:

  • Understand what language models actually do (next-token prediction)
  • Know why simple statistical approaches fail at this task
  • Have a mental model of transformer architecture
  • See the roadmap of what you’ll build in this course

A language model predicts the next token.

Given the text def hello(, what comes next? A language model outputs a probability distribution over all possible tokens:

Token Probability
) 31%
name 18%
self 12%
x 8%

Three things to notice:

  1. Tokens aren’t words. The model works with subword pieces - fragments smaller than words. hello might become hel + lo. This keeps the vocabulary manageable (typically 30-50k tokens) while handling any text, including rare words and code the model has never seen before.

  2. The output is probabilities, not a single answer. The model expresses uncertainty. Sometimes ) is clearly right; sometimes several options make sense.

  3. This simple task scales remarkably. Predicting the next token well requires understanding syntax, semantics, context, and even reasoning. A model that excels at this task can write code, answer questions, and hold conversations. This is how GitHub Copilot suggests completions, how ChatGPT generates responses, and how your IDE’s autocomplete works - all next-token prediction at scale.

Try It

Try it: Type a function definition like def add( and notice how the model predicts argument names. Then try for i in - the predictions shift based on what typically follows loop constructs.

Predicting the next token sounds simple. But how do you do it well?

Why Simple Approaches Fail

The obvious approach: count what tokens typically follow other tokens. After seeing for i in thousands of times, you learn that range often comes next.

This is called an n-gram model. It looks at the last N tokens to predict the next one. Simple, fast, and works surprisingly well for common patterns.

But n-grams have a fundamental limitation that no amount of data can fix: they can only see a fixed window of tokens.

The Context Problem

Consider this code:

def calculate_sum(numbers):
    total = 0
    for n in numbers:
        total +=

What goes in the blank? A human immediately knows the answer is n - it’s the loop variable. But an n-gram model looking at just total += has no idea. It might guess 1 or x or value - common things that follow +=.

N-gram vs Transformer: A Comparison

The answer depends on context from 10+ tokens back. N-grams can’t reach that far. Increase the window size and you run into another problem: you’ve never seen this exact sequence before, so you have no statistics to rely on.

Language is full of these long-range dependencies:

  • Matching brackets and parentheses
  • Variable references spanning multiple lines
  • Pronouns referring to earlier nouns
  • Comments describing code that follows

A good language model needs to consider all previous tokens and learn which ones matter for the current prediction.

The Transformer Solution

Transformers solve this with a mechanism called attention: a way to look at all previous tokens and learn which ones matter.

The architecture flows like this:

Three key insights:

  1. Full context visibility. Unlike older models that read left-to-right one token at a time, transformers can see the entire input at once. Each layer processes all positions simultaneously, making them highly parallelizable and efficient.

  2. Learned relevance. The attention mechanism learns which tokens matter for each prediction. When predicting after total +=, it learns to focus on the loop variable n, not the function name.

  3. Stacked layers. Multiple transformer layers build increasingly abstract representations. Early layers might recognize syntax; later layers understand meaning.

That’s the high-level overview. In the modules ahead, you’ll build each component from scratch.

What You’ll Build

Each module ahead tackles one piece of this architecture:

The path is sequential: each module builds on the previous ones. By the end, you’ll have a working language model that you fully understand - not magic, but something you built piece by piece.

NoteKey Takeaways
  • Language models predict the next token from a probability distribution over the vocabulary
  • Simple counting (n-grams) fails because language has long-range dependencies that require more than a fixed window
  • Transformers use attention to dynamically focus on relevant context, no matter how far back
  • You’ll build each component from scratch in the modules ahead, from tensors to text generation

Start with Tensors →