Module 00: What Is a Language Model?

~10 minutes · No prerequisites

NoteWhat You’ll Learn

After this module, you will:

  • Understand what language models actually do (next-token prediction)
  • Know why simple statistical approaches fail at this task
  • Have a mental model of transformer architecture
  • See the roadmap of what you’ll build in this course

A language model predicts the next token.

Given the text def hello(, what comes next? A language model outputs a probability distribution over all possible tokens:

Token Probability
) 31%
name 18%
self 12%
x 8%

Notice three points:

  1. Tokens aren’t words. The model works with subword pieces - fragments smaller than words. hello might become hel + lo. This keeps the vocabulary manageable (typically 30-50k tokens) while handling any text, including rare words and unfamiliar code.

  2. The output is probabilities, not a single answer. The model expresses uncertainty. Sometimes ) is clearly right; sometimes several options make sense.

  3. This simple task scales remarkably. Predicting the next token well requires understanding syntax, semantics, context, and even reasoning. A model that excels at this task can write code, answer questions, and hold conversations. This is how GitHub Copilot suggests completions, how ChatGPT generates responses, and how your IDE’s autocomplete works - all next-token prediction at scale.

Try It

Try it: Type a function definition like def add( and notice how the model predicts argument names. Then try for i in - the predictions shift based on what typically follows loop constructs.

Predicting the next token sounds simple. But how do you do it well?

Why Simple Approaches Fail

The obvious approach: count what tokens typically follow other tokens. After seeing for i in thousands of times, you learn that range often comes next.

This is called an n-gram model. It looks at the last N tokens to predict the next one. Simple, fast, and works surprisingly well for common patterns.

But n-grams have a fundamental limitation that no amount of data can fix: they see only a fixed window of tokens.

The Context Problem

Consider this code:

def calculate_sum(numbers):
    total = 0
    for n in numbers:
        total +=

What token fills the blank? A human immediately knows the answer is n - it’s the loop variable. But an n-gram model looking at just total += has no idea. It might guess 1 or x or value - tokens that commonly follow +=.

N-gram vs Transformer: A Comparison

The answer depends on context from ten or more tokens earlier. N-grams cannot reach that far. Increase the window size and you face another problem: this exact sequence is new to you, so no statistics exist to guide the prediction.

Language is full of these long-range dependencies:

  • Matching brackets and parentheses
  • Variable references spanning multiple lines
  • Pronouns referring to earlier nouns
  • Comments describing code that follows

A good language model must consider all previous tokens and learn which ones matter for each prediction.

The Transformer Solution

Transformers solve this with attention: the mechanism that examines all previous tokens and learns which ones matter.

The architecture flows like this:

Three key insights:

  1. Full context visibility. Older models read left-to-right, one token at a time. Transformers see the entire input at once. Each layer processes all positions simultaneously, making them highly parallelizable and efficient.

  2. Learned relevance. The attention mechanism learns which tokens matter for each prediction. When predicting after total +=, it learns to focus on the loop variable n, not the function name.

  3. Stacked layers. Multiple transformer layers build increasingly abstract representations. Early layers recognize syntax; later layers grasp meaning.

That’s the overview. The modules ahead build each component from scratch.

What You’ll Build

Each module tackles one piece of this architecture:

Each module builds on the previous ones. By the end, you’ll have a working language model you fully understand - something you built piece by piece.

NoteKey Takeaways
  • Language models predict the next token from a probability distribution over the vocabulary
  • Simple counting (n-grams) fails because language has long-range dependencies that demand variable context
  • Transformers use attention to dynamically focus on relevant context, however distant
  • You’ll build each component from scratch in the modules ahead, from tensors to text generation

Start with Tensors →