Module 02: Autograd
Introduction
The magic that makes neural networks trainable. Automatic differentiation computes gradients through any computation - the foundation of backpropagation.
Autograd (automatic differentiation) is how we compute gradients automatically. When you do loss.backward() in PyTorch, autograd figures out how to adjust every parameter to reduce the loss.
Why is this essential for LLMs?
- Millions of parameters: An LLM has millions (or billions) of numbers to adjust. We can’t compute gradients by hand.
- Complex computations: Attention, embeddings, layer norms - the gradient must flow through all of them.
- Training loop: Every training step computes gradients to update weights.
Without autograd, deep learning wouldn’t be practical.
What You’ll Learn
By the end of this module, you will be able to:
- Understand how computational graphs enable automatic gradient computation
- Build a working scalar autograd engine from scratch
- Extend these principles to tensor operations
- Use PyTorch’s autograd for gradient computation
- Recognize common autograd pitfalls and how to avoid them
Intuition: The Computational Graph
Every computation builds a directed acyclic graph (DAG) dynamically as operations execute. This is called “define-by-run” - the graph structure is determined by the actual code path taken, which can differ each forward pass (useful for control flow like conditionals and loops).
Think of a computation as a graph of operations:
Forward pass: Compute values flowing down (x -> a -> b -> c)
Backward pass: Compute gradients flowing up (c -> b -> a -> x)
- dc/dc = 1 (output gradient is always 1)
- dc/db = 2b = 14 (derivative of b^2)
- dc/da = dc/db x db/da = 14 x 1 = 14
- dc/dx = dc/da x da/dx = 14 x 3 = 42
The chain rule connects everything: multiply local gradients as you go back.
Computational Graph: Forward and Backward
Here’s a more detailed example showing both passes:
The Math
Chain Rule
For composed functions f(g(x)):
d/dx[f(g(x))] = f'(g(x)) x g'(x)
This extends to any depth - just multiply local derivatives along the path.
Common Gradients
| Operation | Forward | Local Gradient |
|---|---|---|
c = a + b |
sum | dc/da = 1, dc/db = 1 |
c = a * b |
product | dc/da = b, dc/db = a |
c = a ** n |
power | dc/da = n x a^(n-1) |
c = exp(a) |
exp | dc/da = exp(a) |
c = log(a) |
log | dc/da = 1/a |
c = tanh(a) |
tanh | dc/da = 1 - tanh^2(a) |
c = relu(a) |
ReLU | dc/da = 1 if a > 0 else 0 |
Matrix Multiplication Gradients
For C = A @ B:
- dL/dA = dL/dC @ B.T
- dL/dB = A.T @ dL/dC
This is why matrix shapes matter so much!
Gradient Accumulation
When a value is used multiple times, gradients ADD:
Code Walkthrough
Let’s explore autograd interactively:
import torch
print(f"PyTorch version: {torch.__version__}")Building a Simple Autograd Engine
We’ll build a simple autograd engine (inspired by Andrej Karpathy’s micrograd). This handles scalar values only - PyTorch’s autograd extends these same principles to tensors of any shape, which is what makes it powerful for real neural networks.
class Value:
"""A scalar value that tracks its gradient."""
def __init__(self, data, children=(), op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(children)
self._op = op
self.label = label
def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad # d(a+b)/da = 1
other.grad += out.grad # d(a+b)/db = 1
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad # d(a*b)/da = b
other.grad += self.data * out.grad # d(a*b)/db = a
out._backward = _backward
return out
def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers"
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += (other * self.data ** (other - 1)) * out.grad
out._backward = _backward
return out
def __neg__(self):
return self * -1
def __sub__(self, other):
return self + (-other)
def __radd__(self, other):
return self + other
def __rmul__(self, other):
return self * other
def tanh(self):
import math
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t ** 2) * out.grad
out._backward = _backward
return out
def relu(self):
out = Value(max(0, self.data), (self,), 'relu')
def _backward():
self.grad += (self.data > 0) * out.grad
out._backward = _backward
return out
def exp(self):
import math
out = Value(math.exp(self.data), (self,), 'exp')
def _backward():
self.grad += out.data * out.grad
out._backward = _backward
return out
def log(self):
import math
out = Value(math.log(self.data), (self,), 'log')
def _backward():
self.grad += (1.0 / self.data) * out.grad
out._backward = _backward
return out
def __truediv__(self, other):
return self * (other ** -1)
def backward(self):
# Topological sort to process in correct order
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Go backwards, applying chain rule
self.grad = 1.0
for v in reversed(topo):
v._backward()Testing Our Value Class
# Create some Values
a = Value(2.0, label='a')
b = Value(3.0, label='b')
print(f"a = {a}")
print(f"b = {b}")
print(f"\nInitially, gradients are 0.0")# Perform a computation
c = a * b
c.label = 'c'
print(f"c = a * b = {c.data}")
print(f"\nNow let's compute gradients with backward():")
c.backward()
print(f"\ndc/da = {a.grad} (should be b = 3.0)")
print(f"dc/db = {b.grad} (should be a = 2.0)")Verifying Gradients Numerically
# Let's verify: increase 'a' by a tiny amount
epsilon = 0.0001
a_original = 2.0
b_val = 3.0
c_original = a_original * b_val
c_perturbed = (a_original + epsilon) * b_val
numerical_gradient = (c_perturbed - c_original) / epsilon
print(f"Numerical gradient dc/da = {numerical_gradient:.4f}")
print(f"Our computed gradient: {a.grad}")
print(f"\nThey match! (The tiny difference is numerical precision)")The Chain Rule in Action
# Let's trace through: c = (a + b) * b
a = Value(2.0, label='a')
b = Value(3.0, label='b')
sum_ab = a + b
sum_ab.label = 'sum'
c = sum_ab * b
c.label = 'c'
c.backward()
print(f"Expression: c = (a + b) * b")
print(f"a = {a.data}, b = {b.data}")
print(f"sum = a + b = {sum_ab.data}")
print(f"c = sum * b = {c.data}")
print(f"\nGradients:")
print(f"dc/da = {a.grad} (expected: b = 3)")
print(f"dc/db = {b.grad} (expected: sum + a = 5 + 3 = 8)")Training a Neuron
Let’s actually train a neuron to output a target value!
import matplotlib.pyplot as plt
# Training loop
x_val = 2.0
target_val = 0.8
# Learnable parameters (start with small random values)
w = 0.3
b = 0.1
learning_rate = 0.5
losses = []
print("Training a neuron to output 0.8 when input is 2.0")
print("=" * 50)
for step in range(20):
# Create fresh Values (gradients reset)
x = Value(x_val, label='x')
w_v = Value(w, label='w')
b_v = Value(b, label='b')
target = Value(target_val, label='target')
# Forward pass
y = (w_v * x + b_v).tanh()
loss = (y - target) ** 2
# Backward pass
loss.backward()
# Gradient descent update
w = w - learning_rate * w_v.grad
b = b - learning_rate * b_v.grad
losses.append(loss.data)
if step % 4 == 0:
print(f"Step {step:2d}: y={y.data:.4f}, loss={loss.data:.6f}, w={w:.4f}, b={b:.4f}")
print(f"\nFinal output: {y.data:.4f}")
print(f"Target: {target_val}")
print("Pretty close!")# Visualize the training
plt.figure(figsize=(10, 4))
plt.plot(losses, 'b-o')
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Loss Decreasing During Training')
plt.grid(True, alpha=0.3)
plt.show()
print("\nThe loss decreases because gradients tell us which way to adjust w and b!")The Evolutionary Leap: Scalar to Tensor
We’ve now mastered scalar autograd with our Value class. The chain rule, computational graphs, and backward passes all work the same way at any scale. But real neural networks operate on tensors with thousands or millions of elements. Let’s build a second autograd engine — this time for tensors — to see how the same principles scale up.
Why Scalars Don’t Scale
Our Value class works beautifully for understanding autograd. But try to imagine training a real neural network with it:
- A small MLP with 10,000 parameters creates 10,000
Valueobjects - Each forward pass creates thousands more intermediate
Valuenodes - Python loops for every dot product (catastrophically slow)
- Memory explodes with millions of tiny objects
The fix isn’t “optimize Python.” The fix is stop pretending a neural network is a pile of scalars.
# The problem: dot product with scalar Values
def slow_dot_product(values_a, values_b):
"""This is what we're doing now - Python loop over scalars."""
result = values_a[0] * values_b[0]
for a, b in zip(values_a[1:], values_b[1:]):
result = result + a * b
return result
# With 1000-dimensional vectors, that's 1000 Python operations
# A GPU can do this in ONE operationThe solution: upgrade from scalar Value to tensor Tensor.
Building a Tensor Autograd
Here’s the evolutionary leap: a Tensor class backed by NumPy arrays instead of Python floats.
import numpy as np
from typing import Optional, Tuple, Callable, Set
class Tensor:
"""
A NumPy-backed tensor with reverse-mode autodiff.
This is what PyTorch's tensors do internally (but in C++/CUDA).
"""
def __init__(self, data, requires_grad: bool = False, _prev: Set["Tensor"] = None, _op: str = ""):
# Convert to numpy array if needed
if isinstance(data, np.ndarray):
self.data = data.astype(np.float32)
else:
self.data = np.array(data, dtype=np.float32)
self.requires_grad = requires_grad
self.grad: Optional[np.ndarray] = None
self._prev = _prev or set()
self._op = _op
self._backward: Callable[[], None] = lambda: None
def __repr__(self) -> str:
return f"Tensor(shape={self.data.shape}, requires_grad={self.requires_grad})"
@property
def shape(self) -> Tuple[int, ...]:
return self.data.shape
def zero_grad(self) -> None:
self.grad = NoneAlready we see the key difference: self.data is a NumPy array, not a float. This means operations work on entire arrays at once.
The Tricky Part: Broadcasting Gradients
When you add a (batch, dim) tensor to a (dim,) bias, NumPy broadcasts the bias across all batches. But during backprop, we need to undo that broadcasting - the gradient for the bias should be summed across the batch dimension.
def _unbroadcast(grad: np.ndarray, target_shape: Tuple[int, ...]) -> np.ndarray:
"""
Undo NumPy broadcasting for gradients.
Example:
Forward: y = x + b where x is (B, D) and b is (D,)
Backward: grad wrt b should be sum over batch axis -> (D,)
Rules:
1. While grad has extra leading dims, sum them away
2. For dims where target had size 1, sum over that axis
"""
g = grad
# Remove leading dims added by broadcasting
while len(g.shape) > len(target_shape):
g = g.sum(axis=0)
# Sum over axes where target had size 1
for axis, (gdim, tdim) in enumerate(zip(g.shape, target_shape)):
if tdim == 1 and gdim != 1:
g = g.sum(axis=axis, keepdims=True)
return g.reshape(target_shape)
# Test it
grad = np.ones((3, 4)) # Gradient has batch dimension
bias_shape = (4,) # Bias was (4,) before broadcasting
result = _unbroadcast(grad, bias_shape)
print(f"Input grad shape: {grad.shape}")
print(f"Target shape: {bias_shape}")
print(f"Result shape: {result.shape}") # (4,) - summed over batch
print(f"Result values: {result}") # [3, 3, 3, 3] - sum of 3 ones per positionThis is the key insight PyTorch handles automatically. When you see RuntimeError: grad shape doesn't match, it’s usually a broadcasting gradient issue.
Tensor Arithmetic with Gradients
Now we add operations. Each operation stores a _backward function that knows how to push gradients to its inputs.
# Add these methods to our Tensor class (shown separately for clarity)
def tensor_add(self, other) -> "Tensor":
"""
Addition: c = a + b
Gradient: dc/da = 1, dc/db = 1 (with unbroadcasting)
"""
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(
self.data + other.data,
requires_grad=self.requires_grad or other.requires_grad,
_prev={self, other},
_op="+"
)
def _backward():
if out.grad is None:
return
if self.requires_grad:
g = _unbroadcast(out.grad, self.data.shape)
self.grad = g if self.grad is None else (self.grad + g)
if other.requires_grad:
g = _unbroadcast(out.grad, other.data.shape)
other.grad = g if other.grad is None else (other.grad + g)
out._backward = _backward
return out
def tensor_mul(self, other) -> "Tensor":
"""
Multiplication: c = a * b
Gradient: dc/da = b, dc/db = a (with unbroadcasting)
"""
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(
self.data * other.data,
requires_grad=self.requires_grad or other.requires_grad,
_prev={self, other},
_op="*"
)
def _backward():
if out.grad is None:
return
if self.requires_grad:
g = _unbroadcast(out.grad * other.data, self.data.shape)
self.grad = g if self.grad is None else (self.grad + g)
if other.requires_grad:
g = _unbroadcast(out.grad * self.data, other.data.shape)
other.grad = g if other.grad is None else (other.grad + g)
out._backward = _backward
return out
# Attach to Tensor class
Tensor.__add__ = tensor_add
Tensor.__radd__ = lambda self, other: tensor_add(self, other)
Tensor.__mul__ = tensor_mul
Tensor.__rmul__ = lambda self, other: tensor_mul(self, other)
Tensor.__neg__ = lambda self: tensor_mul(self, -1.0)
Tensor.__sub__ = lambda self, other: tensor_add(self, -other)Notice how similar this is to our scalar Value - just with arrays and _unbroadcast.
Matrix Multiplication: The Core of Neural Networks
Matrix multiplication is where tensors really shine. This single operation replaces thousands of scalar multiplications and additions.
def tensor_matmul(self, other) -> "Tensor":
"""
Matrix multiplication: C = A @ B
Forward: C[i,j] = sum_k A[i,k] * B[k,j]
Backward:
dA = dC @ B.T (gradient flows back through B transposed)
dB = A.T @ dC (gradient flows back through A transposed)
"""
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(
np.matmul(self.data, other.data),
requires_grad=self.requires_grad or other.requires_grad,
_prev={self, other},
_op="matmul"
)
def _backward():
if out.grad is None:
return
if self.requires_grad:
# dA = dC @ B.T
dA = np.matmul(out.grad, np.swapaxes(other.data, -1, -2))
dA = _unbroadcast(dA, self.data.shape)
self.grad = dA if self.grad is None else (self.grad + dA)
if other.requires_grad:
# dB = A.T @ dC
dB = np.matmul(np.swapaxes(self.data, -1, -2), out.grad)
dB = _unbroadcast(dB, other.data.shape)
other.grad = dB if other.grad is None else (other.grad + dB)
out._backward = _backward
return out
Tensor.matmul = tensor_matmul
# Test: simple 2x2 matmul
A = Tensor([[1, 2], [3, 4]], requires_grad=True)
B = Tensor([[5, 6], [7, 8]], requires_grad=True)
C = A.matmul(B)
print(f"A @ B =\n{C.data}")
# Backward pass
C.grad = np.ones_like(C.data) # Seed gradient
C._backward()
print(f"\ndA =\n{A.grad}")
print(f"\ndB =\n{B.grad}")This is the fundamental operation of neural networks. Every linear layer is just x @ W + b.
Backward Pass: Topological Sort
Just like our scalar Value, we need to traverse the graph in reverse order.
def tensor_backward(self) -> None:
"""
Reverse-mode autodiff: topologically sort and backprop.
"""
# Build topological order
topo = []
visited = set()
def build(v: Tensor):
if v not in visited:
visited.add(v)
for p in v._prev:
build(p)
topo.append(v)
build(self)
# Seed gradient and propagate
self.grad = np.ones_like(self.data, dtype=np.float32)
for v in reversed(topo):
v._backward()
Tensor.backward = tensor_backwardActivation Functions
Neural networks need nonlinearities. Here are the common ones:
def tensor_relu(self) -> "Tensor":
"""ReLU: max(0, x). Gradient: 1 if x > 0, else 0."""
out = Tensor(
np.maximum(self.data, 0.0),
requires_grad=self.requires_grad,
_prev={self},
_op="relu"
)
def _backward():
if out.grad is None or not self.requires_grad:
return
g = out.grad * (self.data > 0.0)
self.grad = g if self.grad is None else (self.grad + g)
out._backward = _backward
return out
def tensor_tanh(self) -> "Tensor":
"""Tanh: gradient is (1 - tanh^2)."""
t = np.tanh(self.data)
out = Tensor(t, requires_grad=self.requires_grad, _prev={self}, _op="tanh")
def _backward():
if out.grad is None or not self.requires_grad:
return
g = out.grad * (1.0 - t * t)
self.grad = g if self.grad is None else (self.grad + g)
out._backward = _backward
return out
Tensor.relu = tensor_relu
Tensor.tanh = tensor_tanhReduction Operations
We need sum and mean for computing losses.
def tensor_sum(self, axis=None, keepdims=False) -> "Tensor":
"""Sum over axis. Gradient broadcasts back to input shape."""
out = Tensor(
self.data.sum(axis=axis, keepdims=keepdims),
requires_grad=self.requires_grad,
_prev={self},
_op="sum"
)
def _backward():
if out.grad is None or not self.requires_grad:
return
g = out.grad
# Expand reduced axes back to input shape
if axis is None:
g = np.ones_like(self.data) * g
else:
axes = axis if isinstance(axis, tuple) else (axis,)
if not keepdims:
for ax in sorted(axes):
g = np.expand_dims(g, axis=ax)
g = np.ones_like(self.data) * g
self.grad = g if self.grad is None else (self.grad + g)
out._backward = _backward
return out
def tensor_mean(self, axis=None, keepdims=False) -> "Tensor":
"""Mean = sum / count."""
if axis is None:
denom = self.data.size
else:
axes = axis if isinstance(axis, tuple) else (axis,)
denom = np.prod([self.data.shape[ax] for ax in axes])
return self.sum(axis=axis, keepdims=keepdims) * (1.0 / float(denom))
Tensor.sum = tensor_sum
Tensor.mean = tensor_mean
# Test
x = Tensor([[1, 2, 3], [4, 5, 6]], requires_grad=True)
loss = x.mean()
print(f"Mean: {loss.data}")
loss.backward()
print(f"Gradient (each element contributes 1/6): \n{x.grad}")Putting It Together: A Tiny Neural Network
Let’s use our tensor autograd to train a small network:
# Complete example: train a 2-layer network on XOR
np.random.seed(42)
# XOR dataset
X = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]], requires_grad=False)
y = Tensor([[0], [1], [1], [0]], requires_grad=False)
# Weights (small random init)
W1 = Tensor(np.random.randn(2, 8) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros((1, 8)), requires_grad=True)
W2 = Tensor(np.random.randn(8, 1) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros((1, 1)), requires_grad=True)
params = [W1, b1, W2, b2]
lr = 0.5
for step in range(200):
# Forward
h = X.matmul(W1) + b1 # (4, 8)
h = h.tanh() # activation
out = h.matmul(W2) + b2 # (4, 1)
# MSE loss
diff = out + (y * -1.0) # out - y
loss = (diff * diff).mean()
# Backward
for p in params:
p.grad = None
loss.backward()
# SGD update
for p in params:
p.data -= lr * p.grad
if step % 50 == 0:
print(f"Step {step}: loss = {loss.data:.4f}")
print(f"\nFinal predictions:")
print(f" [0,0] -> {out.data[0,0]:.3f} (target: 0)")
print(f" [0,1] -> {out.data[1,0]:.3f} (target: 1)")
print(f" [1,0] -> {out.data[2,0]:.3f} (target: 1)")
print(f" [1,1] -> {out.data[3,0]:.3f} (target: 0)")We just trained a neural network using only NumPy and our ~100 lines of autograd code!
PyTorch’s Tensor Autograd
Now let’s see the same XOR problem in PyTorch:
import torch
# Same XOR problem
X_pt = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
y_pt = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
torch.manual_seed(42)
# Note: create tensor, multiply, THEN set requires_grad to keep as leaf tensors
W1_pt = (torch.randn(2, 8) * 0.5).requires_grad_(True)
b1_pt = torch.zeros(1, 8, requires_grad=True)
W2_pt = (torch.randn(8, 1) * 0.5).requires_grad_(True)
b2_pt = torch.zeros(1, 1, requires_grad=True)
params_pt = [W1_pt, b1_pt, W2_pt, b2_pt]
for step in range(200):
# Forward - identical logic!
h = (X_pt @ W1_pt + b1_pt).tanh()
out = h @ W2_pt + b2_pt
loss = ((out - y_pt) ** 2).mean()
# Backward - one line
loss.backward()
# Update - use .data to modify in-place, keeping tensors as leaves
with torch.no_grad():
for p in params_pt:
p.data -= 0.5 * p.grad
p.grad = None
if step % 50 == 0:
print(f"Step {step}: loss = {loss.item():.4f}")Key Insight
The logic is identical. PyTorch just:
- Handles
_unbroadcastautomatically - Uses optimized C++/CUDA kernels
- Provides convenient
loss.backward()without manual graph traversal - Has
torch.no_grad()context for updates
You now understand what happens inside requires_grad=True and backward().
PyTorch Autograd (Scalar Examples)
Now let’s see how PyTorch does the same thing with scalars:
# Create tensors that track gradients
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
# Forward pass
y = w * x + b # Linear function: y = 3*2 + 1 = 7
loss = y ** 2 # Square loss: loss = 49
# Backward pass - PyTorch computes all gradients automatically
loss.backward()
# Derivation using chain rule:
# dloss/dy = 2*y = 2*7 = 14
# dloss/dx = dloss/dy * dy/dx = 14 * w = 14 * 3 = 42
# dloss/dw = dloss/dy * dy/dw = 14 * x = 14 * 2 = 28
# dloss/db = dloss/dy * dy/db = 14 * 1 = 14
print(f"y = {y.item():.1f}, loss = {loss.item():.1f}")
print(f"dloss/dx = {x.grad.item():.1f} (= 2*y*w = 2*7*3)")
print(f"dloss/dw = {w.grad.item():.1f} (= 2*y*x = 2*7*2)")
print(f"dloss/db = {b.grad.item():.1f} (= 2*y*1 = 2*7)")# With vectors
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
w = torch.tensor([0.1, 0.2, 0.3], requires_grad=True)
# Forward pass
y = (x * w).sum() # Dot product
print(f"y = x . w = {y.item():.4f}")
# Backward pass
y.backward()
print(f"\ndy/dx = {x.grad}") # Should be w
print(f"dy/dw = {w.grad}") # Should be x# With matrices (like in attention)
Q = torch.randn(2, 3, requires_grad=True) # Query
K = torch.randn(2, 3, requires_grad=True) # Key
# Attention scores: Q @ K^T
scores = Q @ K.T
loss = scores.sum()
loss.backward()
print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
print(f"scores shape: {scores.shape}")
print(f"\ndloss/dQ shape: {Q.grad.shape}")
print(f"dloss/dK shape: {K.grad.shape}")
print("\nGradients have the same shape as the original tensors!")Interactive Exploration
Now that you’ve seen gradient computation in code, let’s explore it interactively. The widget below lets you modify input values and see how gradients change in real-time — demonstrating the chain rule in action.
TipTry This
Gradient depends on values: Change a from 2 to 4. Watch ∂c/∂b change (it equals a + 2b).
b has two paths: Notice ∂c/∂b is the sum of two contributions - one from the multiplication (= sum) and one via the addition (= b).
Zero gradients: Set b = 0. Now ∂c/∂a = 0 because the multiplication by b kills the gradient.
Negative gradients: Try negative values. Gradients can be negative, indicating the output decreases when the input increases.
Verify the formula: For any a, b values: ∂c/∂a should equal b, and ∂c/∂b should equal a + 2b.
Exercises
Exercise 1: Verify the Gradient of x^3
# Verify the gradient of x^3 at x=2
# Expected: d/dx[x^3] = 3x^2 = 12
x = Value(2.0, label='x')
y = x ** 3
y.backward()
print(f"x = {x.data}")
print(f"y = x^3 = {y.data}")
print(f"dy/dx = {x.grad} (expected: 12.0)")Exercise 2: Softmax Gradient
# Compute gradient of softmax numerator
# If y = exp(x) / (exp(x) + exp(z)), what is dy/dx?
x = Value(1.0, label='x')
z = Value(2.0, label='z')
exp_x = x.exp()
exp_z = z.exp()
y = exp_x / (exp_x + exp_z)
y.backward()
print(f"x = {x.data}, z = {z.data}")
print(f"y = softmax(x)[0] = {y.data:.4f}")
print(f"dy/dx = {x.grad:.4f}")
print(f"\nThis is the gradient that flows back through softmax!")Exercise 3: ReLU vs Tanh Gradients
# Compare ReLU vs tanh gradients
for x_val in [-2.0, -0.5, 0.5, 2.0]:
x_relu = Value(x_val)
x_tanh = Value(x_val)
y_relu = x_relu.relu()
y_tanh = x_tanh.tanh()
y_relu.backward()
y_tanh.backward()
print(f"x={x_val:5.1f}: ReLU grad={x_relu.grad:.3f}, tanh grad={x_tanh.grad:.3f}")
print("\nNotice: ReLU has 0 or 1, tanh is smooth but saturates for large |x|")Backpropagation in Neural Networks
Here’s how gradients flow backward through a neural network layer:
Key insights:
- Gradients flow backward: Starting from the loss, we trace back through every operation
- Chain rule connects layers: Multiply local gradients along the path
- Accumulation: If a value is used multiple times, gradients add up
- Shape matters: dL/dW must have the same shape as W for the update
Detaching Tensors and Stopping Gradients
Sometimes you want to stop gradient flow. PyTorch provides several mechanisms:
Using detach()
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2
z = y.detach() # z has no gradient history
print(f"y.requires_grad: {y.requires_grad}")
print(f"z.requires_grad: {z.requires_grad}")
# z is now a "constant" - no gradients flow through it
loss = z.sum()
# loss.backward() would NOT compute gradients for xUsing no_grad()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
# Compute without building graph (saves memory)
with torch.no_grad():
z = y * 3
print(f"Inside no_grad, z.requires_grad: {z.requires_grad}")
# Common use: inference
model_output = y # pretend this is model output
with torch.no_grad():
prediction = model_output.argmax() # no gradients neededWhen to Stop Gradients
- Inference: No training, no gradients needed
- Frozen layers: Transfer learning with some layers fixed
- Metrics: Computing accuracy, loss for logging (not backprop)
- Target values: In losses like MSE, the target is a constant
Memory Implications
Why Autograd Uses Memory
During the forward pass, autograd stores intermediate values needed for gradient computation:
# Each operation stores data for backward pass
x = torch.randn(1000, 1000, requires_grad=True)
y = x @ x.T # Stores x for gradient computation
z = y.relu() # Stores y (to know which elements > 0)
out = z.mean() # Stores z
# All of x, y, z stay in memory until backward() completes
out.backward()
# Now intermediate tensors can be freedMemory Grows with Depth
For a model with L layers: - Forward pass: O(L) memory for activations - Each activation tensor can be large (batch_size x hidden_dim) - LLMs with 100+ layers and large hidden dims = huge memory
Gradient Checkpointing
Trade compute for memory by recomputing activations during backward:
from torch.utils.checkpoint import checkpoint
def expensive_layer(x):
"""A layer we want to checkpoint."""
return x.relu().pow(2)
x = torch.randn(100, 100, requires_grad=True)
# Without checkpointing: stores intermediate activations
y_normal = expensive_layer(x)
# With checkpointing: discards intermediates, recomputes during backward
y_checkpoint = checkpoint(expensive_layer, x, use_reentrant=False)
print(f"Both produce same result: {torch.allclose(y_normal, y_checkpoint)}")In practice, checkpoint every few layers to reduce memory by ~sqrt(L).
Common Pitfalls
1. Forgetting to Zero Gradients
x = torch.tensor(2.0, requires_grad=True)
# First backward
y = x ** 2
y.backward()
print(f"After first backward: x.grad = {x.grad}")
# Second backward without zeroing - gradients ACCUMULATE!
y = x ** 2
y.backward()
print(f"After second backward: x.grad = {x.grad}") # 8.0, not 4.0!
# Always zero gradients before backward in training loops
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing: x.grad = {x.grad}") # Back to 4.02. In-place Operations
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
# This would break the graph (commented to avoid error):
# y.add_(1) # In-place operation on a tensor needed for gradient
# Instead, use out-of-place operations:
y = y + 1 # Creates new tensor, graph preserved
y.sum().backward()
print(f"x.grad: {x.grad}")3. Losing requires_grad
x = torch.tensor([1.0, 2.0], requires_grad=True)
# Can't call .numpy() directly on a tensor requiring grad:
# y = x.numpy() # RuntimeError: Can't call numpy() on tensor that requires grad
# Must detach first - this explicitly breaks the graph
y = x.detach().numpy()
print(f"y is now numpy array: {type(y)}")
# Converting back loses gradient tracking:
z = torch.from_numpy(y)
print(f"z.requires_grad: {z.requires_grad}") # False
# Be careful when mixing numpy and autograd4. Backward Through Non-Scalar
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2 # Vector output
# This fails (commented to avoid error):
# y.backward() # RuntimeError: need gradient argument
# For non-scalar outputs, provide a gradient tensor:
y.backward(torch.ones_like(y)) # Equivalent to y.sum().backward()
print(f"x.grad: {x.grad}")Summary
Key takeaways:
- Gradients flow backward through the computational graph
- Chain rule connects everything: multiply local gradients
- Gradients accumulate when a value is used multiple times
- Training = gradient descent: use gradients to adjust parameters
- PyTorch autograd does this automatically for any computation
- Memory matters: intermediate activations consume memory; use
detach(),no_grad(), or gradient checkpointing - Zero gradients: always zero gradients before each backward pass in training loops
What’s Next
In Module 03: Tokenization, we’ll learn how text gets converted into numbers that our model can process. Autograd will be working behind the scenes during training, but we need input data first!