/**
* Segmented step control for visualization stepping.
* @param {Object} options
* @param {number} options.min - Minimum step value (default 0)
* @param {number} options.max - Maximum step value
* @param {number} options.value - Initial value (default min)
* @param {string} options.label - Optional label text
* @returns {number} Current step value (reactive)
*/
stepControl = function({min = 0, max, value, label = null} = {}) {
const initialValue = value ?? min;
const steps = Array.from({length: max - min + 1}, (_, i) => min + i);
const container = htl.html`<div class="step-control">
${label ? htl.html`<span class="step-control-label">${label}</span>` : ''}
<div class="step-control-segments" role="group" aria-label="${label || 'Step control'}">
${steps.map(step => htl.html`<button
class="step-control-segment ${step === initialValue ? 'active' : ''}"
data-step="${step}"
aria-pressed="${step === initialValue}"
tabindex="${step === initialValue ? 0 : -1}"
>${step}</button>`)}
</div>
</div>`;
const segments = container.querySelectorAll('.step-control-segment');
let currentValue = initialValue;
function updateActive(newValue) {
currentValue = newValue;
segments.forEach(seg => {
const isActive = parseInt(seg.dataset.step) === newValue;
seg.classList.toggle('active', isActive);
seg.setAttribute('aria-pressed', isActive);
seg.tabIndex = isActive ? 0 : -1;
});
container.value = newValue;
container.dispatchEvent(new Event('input', {bubbles: true}));
}
// Click handler
segments.forEach(seg => {
seg.addEventListener('click', () => {
updateActive(parseInt(seg.dataset.step));
});
});
// Keyboard navigation
container.addEventListener('keydown', (e) => {
if (e.key === 'ArrowRight' || e.key === 'ArrowDown') {
e.preventDefault();
const next = Math.min(currentValue + 1, max);
updateActive(next);
segments[next - min].focus();
} else if (e.key === 'ArrowLeft' || e.key === 'ArrowUp') {
e.preventDefault();
const prev = Math.max(currentValue - 1, min);
updateActive(prev);
segments[prev - min].focus();
} else if (e.key === 'Home') {
e.preventDefault();
updateActive(min);
segments[0].focus();
} else if (e.key === 'End') {
e.preventDefault();
updateActive(max);
segments[max - min].focus();
}
});
container.value = initialValue;
return container;
}Module 01: Tensors
Introduction
Tensors are the foundation of deep learning. Master them before building a language model.
A tensor is a multi-dimensional array of numbers. NumPy users already know tensors. PyTorch tensors match NumPy’s interface but add GPU acceleration and automatic differentiation.
Why do we need them for LLMs?
- Text becomes numbers: The tokenizer converts every word to a list of numbers (a vector)
- Batching: Processing multiple sequences simultaneously improves efficiency
- Matrix operations: Attention, embeddings, and neural network layers are all matrix multiplications
What You’ll Learn
After this module, you can:
- Understand tensor shapes and what each dimension represents
- Perform element-wise operations, matrix multiplication, and broadcasting
- Convert between NumPy arrays and PyTorch tensors
- Move tensors between CPU and GPU for acceleration
- Recognize common LLM tensor shapes and their meanings
Tensor Dimensions
Classify tensors by dimension count:
Shape reveals what a tensor represents. In an LLM:
(vocab_size,)- A 1D tensor: scores for each word in vocabulary(seq_len, embed_dim)- A 2D tensor: one embedding vector per token(batch, seq_len, embed_dim)- A 3D tensor: multiple sequences at once
LLM Tensor Shapes
Tensors Are Just Arrays
Start with NumPy: understanding NumPy arrays means understanding tensors.
import numpy as np
# A tensor is just a multi-dimensional array of numbers
scalar = np.array(5.0) # 0D: a single number
vector = np.array([1, 2, 3]) # 1D: a list of numbers
matrix = np.array([[1, 2], # 2D: a grid of numbers
[3, 4]])
tensor_3d = np.zeros((2, 3, 4)) # 3D: a stack of grids
print(f"Scalar: shape={scalar.shape}, ndim={scalar.ndim}")
print(f"Vector: shape={vector.shape}, ndim={vector.ndim}")
print(f"Matrix: shape={matrix.shape}, ndim={matrix.ndim}")
print(f"3D Tensor: shape={tensor_3d.shape}, ndim={tensor_3d.ndim}")Scalar: shape=(), ndim=0
Vector: shape=(3,), ndim=1
Matrix: shape=(2, 2), ndim=2
3D Tensor: shape=(2, 3, 4), ndim=3
Shape and Dtype
Every array has two fundamental properties:
- Shape: The size of each dimension
(rows, cols, ...) - Dtype: The data type of elements (
float32,int64, etc.)
# Shape tells you what the data represents
embeddings = np.random.randn(4, 8) # 4 tokens, each with 8-dim embedding
print(f"Shape: {embeddings.shape}")
print(f"Dtype: {embeddings.dtype}")
print(f"Total elements: {embeddings.size}")
print(f"Memory: {embeddings.nbytes} bytes")Shape: (4, 8)
Dtype: float64
Total elements: 32
Memory: 256 bytes
Indexing and Slicing
NumPy’s powerful indexing is identical to PyTorch:
# Create a batch of sequences
batch = np.arange(24).reshape(2, 3, 4) # (batch=2, seq=3, features=4)
print(f"Full shape: {batch.shape}")
print(f"Original:\n{batch}\n")
# Get first sequence in first batch
print(f"batch[0, 0]: {batch[0, 0]}")
# Get all batches, first token only
print(f"batch[:, 0, :] shape: {batch[:, 0, :].shape}")
# Negative indexing: last element
print(f"batch[0, -1, :]: {batch[0, -1, :]}")
# Boolean indexing
mask = batch > 10
print(f"Elements > 10: {batch[mask]}")Full shape: (2, 3, 4)
Original:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
batch[0, 0]: [0 1 2 3]
batch[:, 0, :] shape: (2, 4)
batch[0, -1, :]: [ 8 9 10 11]
Elements > 10: [11 12 13 14 15 16 17 18 19 20 21 22 23]
The Core Operations
Neural networks depend on four fundamental operations: element-wise arithmetic, matrix multiplication, broadcasting, and reshaping. We examine each in NumPy, then in PyTorch.
Element-wise Operations
Apply the same operation to every element:
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# NumPy: element-wise arithmetic
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")
print(f"np.exp(a) = {np.exp(a)}")a + b = [5. 7. 9.]
a * b = [ 4. 10. 18.]
a ** 2 = [1. 4. 9.]
np.exp(a) = [ 2.71828183 7.3890561 20.08553692]
PyTorch works identically:
import torch
a_pt = torch.tensor([1.0, 2.0, 3.0])
b_pt = torch.tensor([4.0, 5.0, 6.0])
# PyTorch: same operations, same syntax
print(f"a + b = {a_pt + b_pt}")
print(f"a * b = {a_pt * b_pt}")
print(f"a ** 2 = {a_pt ** 2}")
print(f"torch.exp(a) = {torch.exp(a_pt)}")a + b = tensor([5., 7., 9.])
a * b = tensor([ 4., 10., 18.])
a ** 2 = tensor([1., 4., 9.])
torch.exp(a) = tensor([ 2.7183, 7.3891, 20.0855])
Matrix Multiplication
The workhorse of neural networks. For matrices A (m x n) and B (n x p), the result C = A @ B is (m x p):
# NumPy matrix multiplication
A = np.array([[1, 2],
[3, 4]]) # (2, 2)
B = np.array([[5, 6],
[7, 8]]) # (2, 2)
# Three equivalent ways
result1 = np.matmul(A, B)
result2 = A @ B
result3 = np.dot(A, B) # same for 2D arrays
print(f"A @ B =\n{result1}")
print(f"Result shape: {result1.shape}")A @ B =
[[19 22]
[43 50]]
Result shape: (2, 2)
The @ operator also works for batched operations:
# Batched matrix multiplication in NumPy
batch_A = np.random.randn(4, 3, 2) # 4 matrices of shape (3, 2)
batch_B = np.random.randn(4, 2, 5) # 4 matrices of shape (2, 5)
result = batch_A @ batch_B
print(f"Batch matmul: {batch_A.shape} @ {batch_B.shape} = {result.shape}")Batch matmul: (4, 3, 2) @ (4, 2, 5) = (4, 3, 5)
PyTorch’s @ and torch.matmul behave the same:
# PyTorch matrix multiplication
A_pt = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B_pt = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
result_pt = A_pt @ B_pt
print(f"PyTorch A @ B =\n{result_pt}")
# Batched
batch_A_pt = torch.randn(4, 3, 2)
batch_B_pt = torch.randn(4, 2, 5)
print(f"Batched: {(batch_A_pt @ batch_B_pt).shape}")PyTorch A @ B =
tensor([[19., 22.],
[43., 50.]])
Batched: torch.Size([4, 3, 5])
Broadcasting
When shapes differ, broadcasting expands the smaller array. This is crucial for adding biases, scaling, and many other operations.
The rules are simple: 1. Align shapes from the right 2. Dimensions must be equal OR one of them is 1 3. Missing dimensions are treated as 1
# NumPy broadcasting examples
x = np.ones((4, 3)) # (4, 3)
bias = np.array([1, 2, 3]) # (3,) - broadcasts to (4, 3)
result = x + bias
print(f"Shape {x.shape} + {bias.shape} = {result.shape}")
print(f"Result:\n{result}")Shape (4, 3) + (3,) = (4, 3)
Result:
[[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]]
# More broadcasting examples
batch = np.ones((2, 4, 3)) # (2, 4, 3)
scale = np.array([[[2]]]) # (1, 1, 1) - broadcasts to (2, 4, 3)
vector = np.array([1, 2, 3]) # (3,) - broadcasts to (2, 4, 3)
print(f"{batch.shape} * {scale.shape} = {(batch * scale).shape}")
print(f"{batch.shape} + {vector.shape} = {(batch + vector).shape}")(2, 4, 3) * (1, 1, 1) = (2, 4, 3)
(2, 4, 3) + (3,) = (2, 4, 3)
PyTorch broadcasting follows the exact same rules:
# PyTorch broadcasting
embeddings = torch.randn(4, 32, 64) # (batch, seq, embed)
bias = torch.randn(64) # (embed,)
result = embeddings + bias # broadcasts!
print(f"PyTorch: {embeddings.shape} + {bias.shape} = {result.shape}")PyTorch: torch.Size([4, 32, 64]) + torch.Size([64]) = torch.Size([4, 32, 64])
Key Insight: NumPy to PyTorch
PyTorch tensors are NumPy arrays with superpowers. The API is nearly identical:
| NumPy | PyTorch | Notes |
|---|---|---|
np.array([1,2,3]) |
torch.tensor([1,2,3]) |
Creation |
arr.shape |
tensor.shape |
Same attribute |
arr.dtype |
tensor.dtype |
Same attribute |
np.matmul(a, b) |
torch.matmul(a, b) |
Or use @ |
np.exp(x) |
torch.exp(x) |
Element-wise ops |
arr.reshape(2,3) |
tensor.reshape(2,3) |
Reshaping |
arr.T |
tensor.T |
Transpose |
Converting between them requires one function call:
# NumPy <-> PyTorch conversion
np_array = np.array([1.0, 2.0, 3.0])
pt_tensor = torch.from_numpy(np_array) # Shares memory!
back_to_np = pt_tensor.numpy() # Shares memory!
print(f"NumPy: {np_array}")
print(f"PyTorch: {pt_tensor}")
print(f"Back to NumPy: {back_to_np}")NumPy: [1. 2. 3.]
PyTorch: tensor([1., 2., 3.], dtype=torch.float64)
Back to NumPy: [1. 2. 3.]
Why PyTorch?
What does PyTorch add beyond NumPy? Three capabilities:
1. GPU Acceleration
NumPy runs only on CPU. PyTorch runs on both CPU and GPU, delivering 10-100x speedups on large matrices:
import time
# Create large matrices
size = 2000
np_a = np.random.randn(size, size).astype(np.float32)
np_b = np.random.randn(size, size).astype(np.float32)
# NumPy (CPU)
start = time.time()
np_result = np_a @ np_b
np_time = time.time() - start
print(f"NumPy (CPU): {np_time*1000:.1f} ms")
# PyTorch (CPU for comparison)
pt_a = torch.from_numpy(np_a)
pt_b = torch.from_numpy(np_b)
start = time.time()
pt_result = pt_a @ pt_b
torch.mps.synchronize() if torch.backends.mps.is_available() else None
pt_cpu_time = time.time() - start
print(f"PyTorch (CPU): {pt_cpu_time*1000:.1f} ms")
# PyTorch (GPU if available)
if torch.backends.mps.is_available() or torch.cuda.is_available():
device = "mps" if torch.backends.mps.is_available() else "cuda"
pt_a_gpu = pt_a.to(device)
pt_b_gpu = pt_b.to(device)
# Warm up: first GPU operation incurs overhead (kernel compilation, memory allocation)
_ = pt_a_gpu @ pt_b_gpu
# Synchronize: GPU operations are async, so we wait for completion before timing
torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
start = time.time()
pt_result_gpu = pt_a_gpu @ pt_b_gpu
# Must synchronize again to ensure operation completes before stopping timer
torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
pt_gpu_time = time.time() - start
print(f"PyTorch ({device.upper()}): {pt_gpu_time*1000:.1f} ms")
print(f"GPU speedup: {np_time/pt_gpu_time:.1f}x faster")NumPy (CPU): 164.3 ms
PyTorch (CPU): 189.7 ms
2. Automatic Differentiation
PyTorch tracks operations to compute gradients automatically. Automatic differentiation makes neural networks trainable:
# NumPy: you'd have to compute gradients by hand
x_np = np.array([2.0])
y_np = x_np ** 2 + 3 * x_np
# dy/dx = 2x + 3 = 7 at x=2... but you have to derive and code this yourself!
# PyTorch: automatic!
x_pt = torch.tensor([2.0], requires_grad=True)
y_pt = x_pt ** 2 + 3 * x_pt
y_pt.backward() # Compute gradient automatically
print(f"x = {x_pt.item()}")
print(f"y = x^2 + 3x = {y_pt.item()}")
print(f"dy/dx (computed automatically) = {x_pt.grad.item()}")x = 2.0
y = x^2 + 3x = 10.0
dy/dx (computed automatically) = 7.0
This automatic differentiation scales to millions of parameters. Module 02: Autograd explores this deeply.
3. Optimized Kernels
PyTorch uses optimized backends (cuBLAS, cuDNN, MPS) that outperform naive implementations, even on CPU. Operations like convolutions, attention, and batch normalization have specialized implementations.
# PyTorch's optimized softmax vs manual
x = torch.randn(1000, 1000)
# Manual softmax (correct but slower)
def manual_softmax(x):
exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
return exp_x / exp_x.sum(dim=-1, keepdim=True)
# PyTorch's optimized version
import time
start = time.time()
for _ in range(100):
_ = manual_softmax(x)
manual_time = time.time() - start
start = time.time()
for _ in range(100):
_ = torch.softmax(x, dim=-1)
pytorch_time = time.time() - start
print(f"Manual softmax: {manual_time*1000:.1f} ms")
print(f"PyTorch softmax: {pytorch_time*1000:.1f} ms")Manual softmax: 246.5 ms
PyTorch softmax: 104.1 ms
The Key Insight
PyTorch tensors are NumPy arrays with GPU acceleration and automatic differentiation:
- Same API, same intuition
- GPU acceleration for speed
- Automatic gradients for training
- Optimized kernels under the hood
Use NumPy for learning concepts; use PyTorch for production models.
Code Walkthrough
Explore tensors interactively:
import torch
print(f"PyTorch version: {torch.__version__}")
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")PyTorch version: 2.10.0+cu128
Device: cpu
Creating Tensors
# From a list
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(f"Shape: {x.shape}")
print(f"Dtype: {x.dtype}")
print(f"Device: {x.device}")
print(x)Shape: torch.Size([2, 3])
Dtype: torch.float32
Device: cpu
tensor([[1., 2., 3.],
[4., 5., 6.]])
# Random tensors (common for initialization)
random_tensor = torch.randn(2, 3, 4) # Normal distribution (mean=0, std=1)
print(f"Random tensor shape: {random_tensor.shape}")
print(f"Mean: {random_tensor.mean():.4f}, Std: {random_tensor.std():.4f}")Random tensor shape: torch.Size([2, 3, 4])
Mean: -0.0696, Std: 0.9891
Data Types (dtypes)
Dtype choice determines numerical precision and memory usage:
# Default is float32 (32 bits = 4 bytes per number)
t32 = torch.randn(1000, 1000)
print(f"float32: {t32.element_size()} bytes per element, total: {t32.numel() * t32.element_size() / 1e6:.1f} MB")
# float16 uses half the memory but lower precision
t16 = torch.randn(1000, 1000, dtype=torch.float16)
print(f"float16: {t16.element_size()} bytes per element, total: {t16.numel() * t16.element_size() / 1e6:.1f} MB")
# bfloat16: same exponent bits as float32 (8 bits) for better dynamic range,
# but fewer mantissa bits than float16, trading precision for stability
tbf16 = torch.randn(1000, 1000, dtype=torch.bfloat16)
print(f"bfloat16: {tbf16.element_size()} bytes per element")float32: 4 bytes per element, total: 4.0 MB
float16: 2 bytes per element, total: 2.0 MB
bfloat16: 2 bytes per element
When to use each:
- float32: Default, good for learning and debugging
- float16: Inference on GPUs with Tensor Cores, half memory
- bfloat16: Training large models, better numerical stability than float16
Reshaping
Multi-head attention requires reshaping to split the embedding dimension across heads:
# Reshape for multi-head attention
batch, seq, embed = 4, 32, 64
num_heads = 8
head_dim = embed // num_heads
x = torch.randn(batch, seq, embed)
print(f"Original: {x.shape}")
# Split into heads
x_heads = x.view(batch, seq, num_heads, head_dim)
print(f"After view: {x_heads.shape}")
# Transpose for attention computation
x_heads = x_heads.transpose(1, 2) # (batch, heads, seq, head_dim)
print(f"After transpose: {x_heads.shape}")Original: torch.Size([4, 32, 64])
After view: torch.Size([4, 32, 8, 8])
After transpose: torch.Size([4, 8, 32, 8])
Memory Layout: view vs reshape vs contiguous
Understanding memory layout prevents contiguity errors. Tensors store data in a flat 1D array; strides specify how many elements to skip when traversing each dimension. Operations like transpose change the logical order without moving data. The result: a non-contiguous tensor whose strides no longer match row-major layout:
# view() requires contiguous memory - it's a zero-copy operation
x = torch.randn(3, 4)
print(f"Original is contiguous: {x.is_contiguous()}")
# Transpose creates a non-contiguous view (same memory, different strides)
x_t = x.transpose(0, 1)
print(f"Transposed is contiguous: {x_t.is_contiguous()}")
# view() fails on non-contiguous tensors
try:
x_t.view(12) # This will fail
except RuntimeError as e:
print(f"Error: {e}")
# contiguous() makes a copy with proper memory layout
x_t_contig = x_t.contiguous()
print(f"After contiguous(): {x_t_contig.is_contiguous()}")
x_t_contig.view(12) # Now it works
print("view() works after contiguous()")
# reshape() handles this automatically (but may copy)
reshaped = x_t.reshape(12) # Always works
print(f"reshape() auto-handles non-contiguous: {reshaped.shape}")Original is contiguous: True
Transposed is contiguous: False
Error: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
After contiguous(): True
view() works after contiguous()
reshape() auto-handles non-contiguous: torch.Size([12])
Rule of thumb: Use reshape() by default; use view() when you need zero-copy behavior.
Matrix Multiplication
Matrix multiplication dominates neural network computation. The @ operator (or torch.matmul) handles batched operations automatically:
# Simulating Q @ K^T in attention
Q = torch.randn(2, 8, 32, 8) # (batch, heads, seq, head_dim)
K = torch.randn(2, 8, 32, 8)
# Attention scores
scores = Q @ K.transpose(-2, -1) # (batch, heads, seq, seq)
print(f"Q shape: {Q.shape}")
print(f"K^T shape: {K.transpose(-2, -1).shape}")
print(f"Scores shape: {scores.shape}")Q shape: torch.Size([2, 8, 32, 8])
K^T shape: torch.Size([2, 8, 8, 32])
Scores shape: torch.Size([2, 8, 32, 32])
Key insight: Leading dimensions broadcast automatically; the last two dimensions follow matrix multiplication rules: (m, k) @ (k, n) -> (m, n).
NotePreview: Softmax in LLMs
Softmax converts raw scores (logits) into a probability distribution. You’ll use it constantly in attention weights and next-token prediction.
\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}\]
import torch
logits = torch.tensor([2.0, 1.0, 0.1])
probs = torch.softmax(logits, dim=-1)
print(f"Logits: {logits.tolist()}")
print(f"Probs: {[f'{p:.3f}' for p in probs.tolist()]}")
print(f"Sum: {probs.sum():.3f}")Logits: [2.0, 1.0, 0.10000000149011612]
Probs: ['0.659', '0.242', '0.099']
Sum: 1.000
The highest logit (2.0) gets ~65% of the probability mass. The dim parameter specifies which dimension sums to 1.
Tip: Always use torch.softmax() — it handles numerical stability automatically. Module 05: Attention and Module 08: Generation cover softmax in depth.
Common Operations in LLMs
These operations appear everywhere in transformer models:
# Layer Normalization (normalizes features, not batch)
x = torch.randn(4, 32, 64) # (batch, seq, embed)
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
x_norm = (x - mean) / (std + 1e-5)
print(f"LayerNorm output shape: {x_norm.shape}")
print(f"Mean per token: {x_norm.mean(dim=-1)[0, :3]}") # Should be ~0
# Linear projection (the most common operation)
W = torch.randn(64, 256) # (in_features, out_features)
b = torch.randn(256) # (out_features,)
x = torch.randn(4, 32, 64) # (batch, seq, in_features)
out = x @ W + b # Broadcasting adds bias
print(f"Linear output shape: {out.shape}")LayerNorm output shape: torch.Size([4, 32, 64])
Mean per token: tensor([-1.8626e-08, -1.4901e-08, -2.0489e-08])
Linear output shape: torch.Size([4, 32, 256])
Broadcasting in Action
# Adding bias to all tokens in a batch
embeddings = torch.randn(4, 32, 64) # (batch, seq, embed)
bias = torch.randn(64) # (embed,)
result = embeddings + bias # Broadcasts!
print(f"Embeddings: {embeddings.shape}")
print(f"Bias: {bias.shape}")
print(f"Result: {result.shape}")Embeddings: torch.Size([4, 32, 64])
Bias: torch.Size([64])
Result: torch.Size([4, 32, 64])
Device Management (CPU vs GPU)
Moving tensors between devices is essential for GPU acceleration:
# Check available devices
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Create tensor on specific device
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = torch.randn(1000, 1000, device=device)
print(f"Tensor device: {x.device}")
# Move existing tensor to device
y = torch.randn(1000, 1000) # Created on CPU by default
y = y.to(device) # Move to GPU
print(f"After .to(): {y.device}")MPS available: False
CUDA available: False
Tensor device: cpu
After .to(): cpu
Common pitfall: PyTorch requires tensors to share the same device for operations:
# This would fail if devices differ:
# z = x_cpu + x_gpu # RuntimeError!
# Always ensure tensors are on the same device
a = torch.randn(100, device=device)
b = torch.randn(100, device=device)
c = a + b # Works!
print(f"Both on {device}: operation succeeded")Both on cpu: operation succeeded
Interactive Exploration
Enter two tensor shapes below. The widget shows how broadcasting aligns and expands each dimension.
TipTry This
Simple broadcast: Try
(4, 1)and(1, 3). Both have a 1, so they broadcast to(4, 3).Scalar broadcast: Try
(3, 4)and(1). A scalar broadcasts to any shape.Same shapes: Try
(2, 3)and(2, 3). No broadcasting needed - shapes are identical.Incompatible shapes: Try
(3, 4)and(2, 4). The first dimension (3 vs 2) can’t broadcast because neither is 1.Real-world example: Try
(32, 10, 64)(batch of sequences) and(64)(a bias vector). The bias broadcasts across batch and sequence dimensions.
Exercises
Exercise 1: Create an Embedding Lookup
# Create a vocabulary embedding table
vocab_size = 100
embed_dim = 32
embedding_table = torch.randn(vocab_size, embed_dim)
print(f"Embedding table: {embedding_table.shape}")
# Look up embeddings for token IDs
token_ids = torch.tensor([5, 23, 7, 42])
embeddings = embedding_table[token_ids]
print(f"Token IDs: {token_ids}")
print(f"Embeddings shape: {embeddings.shape}")Embedding table: torch.Size([100, 32])
Token IDs: tensor([ 5, 23, 7, 42])
Embeddings shape: torch.Size([4, 32])
Exercise 2: Simulate Simple Attention
seq_len = 6
embed_dim = 8
# Token embeddings
tokens = torch.randn(seq_len, embed_dim)
# Compute attention scores (dot product similarity)
scores = tokens @ tokens.T
print(f"Attention scores shape: {scores.shape}")
# Apply softmax to get weights
attention_weights = torch.softmax(scores, dim=-1)
print(f"Attention weights shape: {attention_weights.shape}")Attention scores shape: torch.Size([6, 6])
Attention weights shape: torch.Size([6, 6])
Exercise 3: Apply Attention
# Weighted combination of values
output = attention_weights @ tokens
print(f"Input: {tokens.shape}")
print(f"Weights: {attention_weights.shape}")
print(f"Output: {output.shape}")
print("\nEach output token is a weighted average of ALL input tokens!")Input: torch.Size([6, 8])
Weights: torch.Size([6, 6])
Output: torch.Size([6, 8])
Each output token is a weighted average of ALL input tokens!
Common Pitfalls
Avoid these common mistakes:
1. Shape Mismatches
Always print shapes when debugging. Most errors come from unexpected dimensions.
# BAD: Silent broadcasting can hide bugs
a = torch.randn(4, 3)
b = torch.randn(3) # Did you mean (4, 3)?
result = a + b # Works due to broadcasting, but may not be intended!
# GOOD: Verify shapes explicitly
print(f"a: {a.shape}, b: {b.shape}, result: {result.shape}")
assert a.shape == (4, 3), f"Expected (4, 3), got {a.shape}"a: torch.Size([4, 3]), b: torch.Size([3]), result: torch.Size([4, 3])
2. Device Mismatches
Tensors must be on the same device for operations.
device = "mps" if torch.backends.mps.is_available() else "cpu"
x_cpu = torch.randn(3)
x_gpu = torch.randn(3, device=device)
# BAD: This fails if device != cpu
# result = x_cpu + x_gpu # RuntimeError!
# GOOD: Ensure same device
x_cpu = x_cpu.to(device)
result = x_cpu + x_gpu
print(f"Both on {device}: operation succeeded")Both on cpu: operation succeeded
3. In-place Operations Break Gradients
Methods ending in _ modify tensors in-place and break gradient computation.
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2
# BAD: In-place modification on tensor needed for gradient
# y.add_(1) # RuntimeError: in-place operation on a leaf Variable
# GOOD: Create new tensor
y = y + 1
loss = y.sum()
loss.backward()
print(f"Gradient computed: {x.grad}")Gradient computed: tensor([2., 2., 2.])
4. Forgetting contiguous()
After transpose/permute, the tensor may not be contiguous in memory.
x = torch.randn(3, 4)
x_t = x.transpose(0, 1) # Shape (4, 3), but not contiguous
print(f"Original contiguous: {x.is_contiguous()}")
print(f"Transposed contiguous: {x_t.is_contiguous()}")
# BAD: view() requires contiguous memory
# x_t.view(12) # RuntimeError!
# GOOD: Make contiguous first, or use reshape
x_t_contig = x_t.contiguous().view(12) # Works
x_t_reshaped = x_t.reshape(12) # Also works (may copy)Original contiguous: True
Transposed contiguous: False
5. dtype Mismatches
Operations between different dtypes may silently upcast or fail.
a = torch.tensor([1.0, 2.0], dtype=torch.float32)
b = torch.tensor([1.0, 2.0], dtype=torch.float16)
# Silent upcast to float32 (may waste memory)
result = a + b.to(torch.float32)
print(f"Result dtype: {result.dtype}")
# GOOD: Be explicit about dtype
a_fp16 = a.to(torch.float16)
result = a_fp16 + b # Both float16
print(f"Explicit dtype: {result.dtype}")Result dtype: torch.float32
Explicit dtype: torch.float16
Summary
Key takeaways:
- Tensors are multi-dimensional arrays - their shape tells you what they represent
- Broadcasting automatically expands smaller tensors to match larger ones
- Matrix multiplication is the core operation - inner dimensions must match
- Reshaping reorganizes dimensions without changing total elements
- Memory layout matters - understand contiguous vs strided for efficient operations
- Device placement - use GPU (MPS/CUDA) for 10-100x speedup on large tensors
- Data types - float32 for learning, float16/bfloat16 for production
What’s Next
Module 02: Autograd shows how PyTorch automatically computes gradients through all these operations - the mechanism that makes neural networks trainable.