Module 01: Tensors

Introduction

Tensors are the foundation of deep learning. Master them before building a language model.

A tensor is a multi-dimensional array of numbers. NumPy users already know tensors. PyTorch tensors match NumPy’s interface but add GPU acceleration and automatic differentiation.

Why do we need them for LLMs?

  • Text becomes numbers: The tokenizer converts every word to a list of numbers (a vector)
  • Batching: Processing multiple sequences simultaneously improves efficiency
  • Matrix operations: Attention, embeddings, and neural network layers are all matrix multiplications

What You’ll Learn

After this module, you can:

  • Understand tensor shapes and what each dimension represents
  • Perform element-wise operations, matrix multiplication, and broadcasting
  • Convert between NumPy arrays and PyTorch tensors
  • Move tensors between CPU and GPU for acceleration
  • Recognize common LLM tensor shapes and their meanings

Tensor Dimensions

Classify tensors by dimension count:

Shape reveals what a tensor represents. In an LLM:

  • (vocab_size,) - A 1D tensor: scores for each word in vocabulary
  • (seq_len, embed_dim) - A 2D tensor: one embedding vector per token
  • (batch, seq_len, embed_dim) - A 3D tensor: multiple sequences at once

LLM Tensor Shapes

Tensors Are Just Arrays

Start with NumPy: understanding NumPy arrays means understanding tensors.

import numpy as np

# A tensor is just a multi-dimensional array of numbers
scalar = np.array(5.0)           # 0D: a single number
vector = np.array([1, 2, 3])     # 1D: a list of numbers
matrix = np.array([[1, 2],       # 2D: a grid of numbers
                   [3, 4]])
tensor_3d = np.zeros((2, 3, 4))  # 3D: a stack of grids

print(f"Scalar: shape={scalar.shape}, ndim={scalar.ndim}")
print(f"Vector: shape={vector.shape}, ndim={vector.ndim}")
print(f"Matrix: shape={matrix.shape}, ndim={matrix.ndim}")
print(f"3D Tensor: shape={tensor_3d.shape}, ndim={tensor_3d.ndim}")
Scalar: shape=(), ndim=0
Vector: shape=(3,), ndim=1
Matrix: shape=(2, 2), ndim=2
3D Tensor: shape=(2, 3, 4), ndim=3

Shape and Dtype

Every array has two fundamental properties:

  • Shape: The size of each dimension (rows, cols, ...)
  • Dtype: The data type of elements (float32, int64, etc.)
# Shape tells you what the data represents
embeddings = np.random.randn(4, 8)  # 4 tokens, each with 8-dim embedding
print(f"Shape: {embeddings.shape}")
print(f"Dtype: {embeddings.dtype}")
print(f"Total elements: {embeddings.size}")
print(f"Memory: {embeddings.nbytes} bytes")
Shape: (4, 8)
Dtype: float64
Total elements: 32
Memory: 256 bytes

Indexing and Slicing

NumPy’s powerful indexing is identical to PyTorch:

# Create a batch of sequences
batch = np.arange(24).reshape(2, 3, 4)  # (batch=2, seq=3, features=4)
print(f"Full shape: {batch.shape}")
print(f"Original:\n{batch}\n")

# Get first sequence in first batch
print(f"batch[0, 0]: {batch[0, 0]}")

# Get all batches, first token only
print(f"batch[:, 0, :] shape: {batch[:, 0, :].shape}")

# Negative indexing: last element
print(f"batch[0, -1, :]: {batch[0, -1, :]}")

# Boolean indexing
mask = batch > 10
print(f"Elements > 10: {batch[mask]}")
Full shape: (2, 3, 4)
Original:
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]

batch[0, 0]: [0 1 2 3]
batch[:, 0, :] shape: (2, 4)
batch[0, -1, :]: [ 8  9 10 11]
Elements > 10: [11 12 13 14 15 16 17 18 19 20 21 22 23]

The Core Operations

Neural networks depend on four fundamental operations: element-wise arithmetic, matrix multiplication, broadcasting, and reshaping. We examine each in NumPy, then in PyTorch.

Element-wise Operations

Apply the same operation to every element:

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# NumPy: element-wise arithmetic
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")
print(f"np.exp(a) = {np.exp(a)}")
a + b = [5. 7. 9.]
a * b = [ 4. 10. 18.]
a ** 2 = [1. 4. 9.]
np.exp(a) = [ 2.71828183  7.3890561  20.08553692]

PyTorch works identically:

import torch

a_pt = torch.tensor([1.0, 2.0, 3.0])
b_pt = torch.tensor([4.0, 5.0, 6.0])

# PyTorch: same operations, same syntax
print(f"a + b = {a_pt + b_pt}")
print(f"a * b = {a_pt * b_pt}")
print(f"a ** 2 = {a_pt ** 2}")
print(f"torch.exp(a) = {torch.exp(a_pt)}")
a + b = tensor([5., 7., 9.])
a * b = tensor([ 4., 10., 18.])
a ** 2 = tensor([1., 4., 9.])
torch.exp(a) = tensor([ 2.7183,  7.3891, 20.0855])

Matrix Multiplication

The workhorse of neural networks. For matrices A (m x n) and B (n x p), the result C = A @ B is (m x p):

# NumPy matrix multiplication
A = np.array([[1, 2],
              [3, 4]])   # (2, 2)
B = np.array([[5, 6],
              [7, 8]])   # (2, 2)

# Three equivalent ways
result1 = np.matmul(A, B)
result2 = A @ B
result3 = np.dot(A, B)  # same for 2D arrays

print(f"A @ B =\n{result1}")
print(f"Result shape: {result1.shape}")
A @ B =
[[19 22]
 [43 50]]
Result shape: (2, 2)

The @ operator also works for batched operations:

# Batched matrix multiplication in NumPy
batch_A = np.random.randn(4, 3, 2)  # 4 matrices of shape (3, 2)
batch_B = np.random.randn(4, 2, 5)  # 4 matrices of shape (2, 5)

result = batch_A @ batch_B
print(f"Batch matmul: {batch_A.shape} @ {batch_B.shape} = {result.shape}")
Batch matmul: (4, 3, 2) @ (4, 2, 5) = (4, 3, 5)

PyTorch’s @ and torch.matmul behave the same:

# PyTorch matrix multiplication
A_pt = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B_pt = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

result_pt = A_pt @ B_pt
print(f"PyTorch A @ B =\n{result_pt}")

# Batched
batch_A_pt = torch.randn(4, 3, 2)
batch_B_pt = torch.randn(4, 2, 5)
print(f"Batched: {(batch_A_pt @ batch_B_pt).shape}")
PyTorch A @ B =
tensor([[19., 22.],
        [43., 50.]])
Batched: torch.Size([4, 3, 5])

Broadcasting

When shapes differ, broadcasting expands the smaller array. This is crucial for adding biases, scaling, and many other operations.

The rules are simple: 1. Align shapes from the right 2. Dimensions must be equal OR one of them is 1 3. Missing dimensions are treated as 1

# NumPy broadcasting examples
x = np.ones((4, 3))       # (4, 3)
bias = np.array([1, 2, 3]) # (3,) - broadcasts to (4, 3)

result = x + bias
print(f"Shape {x.shape} + {bias.shape} = {result.shape}")
print(f"Result:\n{result}")
Shape (4, 3) + (3,) = (4, 3)
Result:
[[2. 3. 4.]
 [2. 3. 4.]
 [2. 3. 4.]
 [2. 3. 4.]]
# More broadcasting examples
batch = np.ones((2, 4, 3))  # (2, 4, 3)
scale = np.array([[[2]]])   # (1, 1, 1) - broadcasts to (2, 4, 3)
vector = np.array([1, 2, 3]) # (3,) - broadcasts to (2, 4, 3)

print(f"{batch.shape} * {scale.shape} = {(batch * scale).shape}")
print(f"{batch.shape} + {vector.shape} = {(batch + vector).shape}")
(2, 4, 3) * (1, 1, 1) = (2, 4, 3)
(2, 4, 3) + (3,) = (2, 4, 3)

PyTorch broadcasting follows the exact same rules:

# PyTorch broadcasting
embeddings = torch.randn(4, 32, 64)  # (batch, seq, embed)
bias = torch.randn(64)               # (embed,)

result = embeddings + bias  # broadcasts!
print(f"PyTorch: {embeddings.shape} + {bias.shape} = {result.shape}")
PyTorch: torch.Size([4, 32, 64]) + torch.Size([64]) = torch.Size([4, 32, 64])

Key Insight: NumPy to PyTorch

PyTorch tensors are NumPy arrays with superpowers. The API is nearly identical:

NumPy PyTorch Notes
np.array([1,2,3]) torch.tensor([1,2,3]) Creation
arr.shape tensor.shape Same attribute
arr.dtype tensor.dtype Same attribute
np.matmul(a, b) torch.matmul(a, b) Or use @
np.exp(x) torch.exp(x) Element-wise ops
arr.reshape(2,3) tensor.reshape(2,3) Reshaping
arr.T tensor.T Transpose

Converting between them requires one function call:

# NumPy <-> PyTorch conversion
np_array = np.array([1.0, 2.0, 3.0])
pt_tensor = torch.from_numpy(np_array)  # Shares memory!
back_to_np = pt_tensor.numpy()          # Shares memory!

print(f"NumPy: {np_array}")
print(f"PyTorch: {pt_tensor}")
print(f"Back to NumPy: {back_to_np}")
NumPy: [1. 2. 3.]
PyTorch: tensor([1., 2., 3.], dtype=torch.float64)
Back to NumPy: [1. 2. 3.]

Why PyTorch?

What does PyTorch add beyond NumPy? Three capabilities:

1. GPU Acceleration

NumPy runs only on CPU. PyTorch runs on both CPU and GPU, delivering 10-100x speedups on large matrices:

import time

# Create large matrices
size = 2000
np_a = np.random.randn(size, size).astype(np.float32)
np_b = np.random.randn(size, size).astype(np.float32)

# NumPy (CPU)
start = time.time()
np_result = np_a @ np_b
np_time = time.time() - start
print(f"NumPy (CPU): {np_time*1000:.1f} ms")

# PyTorch (CPU for comparison)
pt_a = torch.from_numpy(np_a)
pt_b = torch.from_numpy(np_b)
start = time.time()
pt_result = pt_a @ pt_b
torch.mps.synchronize() if torch.backends.mps.is_available() else None
pt_cpu_time = time.time() - start
print(f"PyTorch (CPU): {pt_cpu_time*1000:.1f} ms")

# PyTorch (GPU if available)
if torch.backends.mps.is_available() or torch.cuda.is_available():
    device = "mps" if torch.backends.mps.is_available() else "cuda"
    pt_a_gpu = pt_a.to(device)
    pt_b_gpu = pt_b.to(device)

    # Warm up: first GPU operation incurs overhead (kernel compilation, memory allocation)
    _ = pt_a_gpu @ pt_b_gpu
    # Synchronize: GPU operations are async, so we wait for completion before timing
    torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()

    start = time.time()
    pt_result_gpu = pt_a_gpu @ pt_b_gpu
    # Must synchronize again to ensure operation completes before stopping timer
    torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
    pt_gpu_time = time.time() - start
    print(f"PyTorch ({device.upper()}): {pt_gpu_time*1000:.1f} ms")
    print(f"GPU speedup: {np_time/pt_gpu_time:.1f}x faster")
NumPy (CPU): 164.3 ms
PyTorch (CPU): 189.7 ms

2. Automatic Differentiation

PyTorch tracks operations to compute gradients automatically. Automatic differentiation makes neural networks trainable:

# NumPy: you'd have to compute gradients by hand
x_np = np.array([2.0])
y_np = x_np ** 2 + 3 * x_np
# dy/dx = 2x + 3 = 7 at x=2... but you have to derive and code this yourself!

# PyTorch: automatic!
x_pt = torch.tensor([2.0], requires_grad=True)
y_pt = x_pt ** 2 + 3 * x_pt
y_pt.backward()  # Compute gradient automatically
print(f"x = {x_pt.item()}")
print(f"y = x^2 + 3x = {y_pt.item()}")
print(f"dy/dx (computed automatically) = {x_pt.grad.item()}")
x = 2.0
y = x^2 + 3x = 10.0
dy/dx (computed automatically) = 7.0

This automatic differentiation scales to millions of parameters. Module 02: Autograd explores this deeply.

3. Optimized Kernels

PyTorch uses optimized backends (cuBLAS, cuDNN, MPS) that outperform naive implementations, even on CPU. Operations like convolutions, attention, and batch normalization have specialized implementations.

# PyTorch's optimized softmax vs manual
x = torch.randn(1000, 1000)

# Manual softmax (correct but slower)
def manual_softmax(x):
    exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

# PyTorch's optimized version
import time

start = time.time()
for _ in range(100):
    _ = manual_softmax(x)
manual_time = time.time() - start

start = time.time()
for _ in range(100):
    _ = torch.softmax(x, dim=-1)
pytorch_time = time.time() - start

print(f"Manual softmax: {manual_time*1000:.1f} ms")
print(f"PyTorch softmax: {pytorch_time*1000:.1f} ms")
Manual softmax: 246.5 ms
PyTorch softmax: 104.1 ms

The Key Insight

PyTorch tensors are NumPy arrays with GPU acceleration and automatic differentiation:

  • Same API, same intuition
  • GPU acceleration for speed
  • Automatic gradients for training
  • Optimized kernels under the hood

Use NumPy for learning concepts; use PyTorch for production models.

Code Walkthrough

Explore tensors interactively:

import torch

print(f"PyTorch version: {torch.__version__}")
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")
PyTorch version: 2.10.0+cu128
Device: cpu

Creating Tensors

# From a list
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(f"Shape: {x.shape}")
print(f"Dtype: {x.dtype}")
print(f"Device: {x.device}")
print(x)
Shape: torch.Size([2, 3])
Dtype: torch.float32
Device: cpu
tensor([[1., 2., 3.],
        [4., 5., 6.]])
# Random tensors (common for initialization)
random_tensor = torch.randn(2, 3, 4)  # Normal distribution (mean=0, std=1)
print(f"Random tensor shape: {random_tensor.shape}")
print(f"Mean: {random_tensor.mean():.4f}, Std: {random_tensor.std():.4f}")
Random tensor shape: torch.Size([2, 3, 4])
Mean: -0.0696, Std: 0.9891

Data Types (dtypes)

Dtype choice determines numerical precision and memory usage:

# Default is float32 (32 bits = 4 bytes per number)
t32 = torch.randn(1000, 1000)
print(f"float32: {t32.element_size()} bytes per element, total: {t32.numel() * t32.element_size() / 1e6:.1f} MB")

# float16 uses half the memory but lower precision
t16 = torch.randn(1000, 1000, dtype=torch.float16)
print(f"float16: {t16.element_size()} bytes per element, total: {t16.numel() * t16.element_size() / 1e6:.1f} MB")

# bfloat16: same exponent bits as float32 (8 bits) for better dynamic range,
# but fewer mantissa bits than float16, trading precision for stability
tbf16 = torch.randn(1000, 1000, dtype=torch.bfloat16)
print(f"bfloat16: {tbf16.element_size()} bytes per element")
float32: 4 bytes per element, total: 4.0 MB
float16: 2 bytes per element, total: 2.0 MB
bfloat16: 2 bytes per element

When to use each:

  • float32: Default, good for learning and debugging
  • float16: Inference on GPUs with Tensor Cores, half memory
  • bfloat16: Training large models, better numerical stability than float16

Reshaping

Multi-head attention requires reshaping to split the embedding dimension across heads:

# Reshape for multi-head attention
batch, seq, embed = 4, 32, 64
num_heads = 8
head_dim = embed // num_heads

x = torch.randn(batch, seq, embed)
print(f"Original: {x.shape}")

# Split into heads
x_heads = x.view(batch, seq, num_heads, head_dim)
print(f"After view: {x_heads.shape}")

# Transpose for attention computation
x_heads = x_heads.transpose(1, 2)  # (batch, heads, seq, head_dim)
print(f"After transpose: {x_heads.shape}")
Original: torch.Size([4, 32, 64])
After view: torch.Size([4, 32, 8, 8])
After transpose: torch.Size([4, 8, 32, 8])

Memory Layout: view vs reshape vs contiguous

Understanding memory layout prevents contiguity errors. Tensors store data in a flat 1D array; strides specify how many elements to skip when traversing each dimension. Operations like transpose change the logical order without moving data. The result: a non-contiguous tensor whose strides no longer match row-major layout:

# view() requires contiguous memory - it's a zero-copy operation
x = torch.randn(3, 4)
print(f"Original is contiguous: {x.is_contiguous()}")

# Transpose creates a non-contiguous view (same memory, different strides)
x_t = x.transpose(0, 1)
print(f"Transposed is contiguous: {x_t.is_contiguous()}")

# view() fails on non-contiguous tensors
try:
    x_t.view(12)  # This will fail
except RuntimeError as e:
    print(f"Error: {e}")

# contiguous() makes a copy with proper memory layout
x_t_contig = x_t.contiguous()
print(f"After contiguous(): {x_t_contig.is_contiguous()}")
x_t_contig.view(12)  # Now it works
print("view() works after contiguous()")

# reshape() handles this automatically (but may copy)
reshaped = x_t.reshape(12)  # Always works
print(f"reshape() auto-handles non-contiguous: {reshaped.shape}")
Original is contiguous: True
Transposed is contiguous: False
Error: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
After contiguous(): True
view() works after contiguous()
reshape() auto-handles non-contiguous: torch.Size([12])

Rule of thumb: Use reshape() by default; use view() when you need zero-copy behavior.

Matrix Multiplication

Matrix multiplication dominates neural network computation. The @ operator (or torch.matmul) handles batched operations automatically:

# Simulating Q @ K^T in attention
Q = torch.randn(2, 8, 32, 8)  # (batch, heads, seq, head_dim)
K = torch.randn(2, 8, 32, 8)

# Attention scores
scores = Q @ K.transpose(-2, -1)  # (batch, heads, seq, seq)
print(f"Q shape: {Q.shape}")
print(f"K^T shape: {K.transpose(-2, -1).shape}")
print(f"Scores shape: {scores.shape}")
Q shape: torch.Size([2, 8, 32, 8])
K^T shape: torch.Size([2, 8, 8, 32])
Scores shape: torch.Size([2, 8, 32, 32])

Key insight: Leading dimensions broadcast automatically; the last two dimensions follow matrix multiplication rules: (m, k) @ (k, n) -> (m, n).

NotePreview: Softmax in LLMs

Softmax converts raw scores (logits) into a probability distribution. You’ll use it constantly in attention weights and next-token prediction.

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}\]

import torch

logits = torch.tensor([2.0, 1.0, 0.1])
probs = torch.softmax(logits, dim=-1)

print(f"Logits: {logits.tolist()}")
print(f"Probs:  {[f'{p:.3f}' for p in probs.tolist()]}")
print(f"Sum:    {probs.sum():.3f}")
Logits: [2.0, 1.0, 0.10000000149011612]
Probs:  ['0.659', '0.242', '0.099']
Sum:    1.000

The highest logit (2.0) gets ~65% of the probability mass. The dim parameter specifies which dimension sums to 1.

Tip: Always use torch.softmax() — it handles numerical stability automatically. Module 05: Attention and Module 08: Generation cover softmax in depth.

Common Operations in LLMs

These operations appear everywhere in transformer models:

# Layer Normalization (normalizes features, not batch)
x = torch.randn(4, 32, 64)  # (batch, seq, embed)
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
x_norm = (x - mean) / (std + 1e-5)
print(f"LayerNorm output shape: {x_norm.shape}")
print(f"Mean per token: {x_norm.mean(dim=-1)[0, :3]}")  # Should be ~0

# Linear projection (the most common operation)
W = torch.randn(64, 256)  # (in_features, out_features)
b = torch.randn(256)       # (out_features,)
x = torch.randn(4, 32, 64) # (batch, seq, in_features)
out = x @ W + b            # Broadcasting adds bias
print(f"Linear output shape: {out.shape}")
LayerNorm output shape: torch.Size([4, 32, 64])
Mean per token: tensor([-1.8626e-08, -1.4901e-08, -2.0489e-08])
Linear output shape: torch.Size([4, 32, 256])

Broadcasting in Action

# Adding bias to all tokens in a batch
embeddings = torch.randn(4, 32, 64)  # (batch, seq, embed)
bias = torch.randn(64)               # (embed,)

result = embeddings + bias  # Broadcasts!
print(f"Embeddings: {embeddings.shape}")
print(f"Bias: {bias.shape}")
print(f"Result: {result.shape}")
Embeddings: torch.Size([4, 32, 64])
Bias: torch.Size([64])
Result: torch.Size([4, 32, 64])

Device Management (CPU vs GPU)

Moving tensors between devices is essential for GPU acceleration:

# Check available devices
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Create tensor on specific device
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = torch.randn(1000, 1000, device=device)
print(f"Tensor device: {x.device}")

# Move existing tensor to device
y = torch.randn(1000, 1000)  # Created on CPU by default
y = y.to(device)             # Move to GPU
print(f"After .to(): {y.device}")
MPS available: False
CUDA available: False
Tensor device: cpu
After .to(): cpu

Common pitfall: PyTorch requires tensors to share the same device for operations:

# This would fail if devices differ:
# z = x_cpu + x_gpu  # RuntimeError!

# Always ensure tensors are on the same device
a = torch.randn(100, device=device)
b = torch.randn(100, device=device)
c = a + b  # Works!
print(f"Both on {device}: operation succeeded")
Both on cpu: operation succeeded

Interactive Exploration

Enter two tensor shapes below. The widget shows how broadcasting aligns and expands each dimension.

TipTry This
  1. Simple broadcast: Try (4, 1) and (1, 3). Both have a 1, so they broadcast to (4, 3).

  2. Scalar broadcast: Try (3, 4) and (1). A scalar broadcasts to any shape.

  3. Same shapes: Try (2, 3) and (2, 3). No broadcasting needed - shapes are identical.

  4. Incompatible shapes: Try (3, 4) and (2, 4). The first dimension (3 vs 2) can’t broadcast because neither is 1.

  5. Real-world example: Try (32, 10, 64) (batch of sequences) and (64) (a bias vector). The bias broadcasts across batch and sequence dimensions.

Exercises

Exercise 1: Create an Embedding Lookup

# Create a vocabulary embedding table
vocab_size = 100
embed_dim = 32

embedding_table = torch.randn(vocab_size, embed_dim)
print(f"Embedding table: {embedding_table.shape}")

# Look up embeddings for token IDs
token_ids = torch.tensor([5, 23, 7, 42])
embeddings = embedding_table[token_ids]
print(f"Token IDs: {token_ids}")
print(f"Embeddings shape: {embeddings.shape}")
Embedding table: torch.Size([100, 32])
Token IDs: tensor([ 5, 23,  7, 42])
Embeddings shape: torch.Size([4, 32])

Exercise 2: Simulate Simple Attention

seq_len = 6
embed_dim = 8

# Token embeddings
tokens = torch.randn(seq_len, embed_dim)

# Compute attention scores (dot product similarity)
scores = tokens @ tokens.T
print(f"Attention scores shape: {scores.shape}")

# Apply softmax to get weights
attention_weights = torch.softmax(scores, dim=-1)
print(f"Attention weights shape: {attention_weights.shape}")
Attention scores shape: torch.Size([6, 6])
Attention weights shape: torch.Size([6, 6])

Exercise 3: Apply Attention

# Weighted combination of values
output = attention_weights @ tokens
print(f"Input: {tokens.shape}")
print(f"Weights: {attention_weights.shape}")
print(f"Output: {output.shape}")
print("\nEach output token is a weighted average of ALL input tokens!")
Input: torch.Size([6, 8])
Weights: torch.Size([6, 6])
Output: torch.Size([6, 8])

Each output token is a weighted average of ALL input tokens!

Common Pitfalls

Avoid these common mistakes:

1. Shape Mismatches

Always print shapes when debugging. Most errors come from unexpected dimensions.

# BAD: Silent broadcasting can hide bugs
a = torch.randn(4, 3)
b = torch.randn(3)     # Did you mean (4, 3)?
result = a + b         # Works due to broadcasting, but may not be intended!

# GOOD: Verify shapes explicitly
print(f"a: {a.shape}, b: {b.shape}, result: {result.shape}")
assert a.shape == (4, 3), f"Expected (4, 3), got {a.shape}"
a: torch.Size([4, 3]), b: torch.Size([3]), result: torch.Size([4, 3])

2. Device Mismatches

Tensors must be on the same device for operations.

device = "mps" if torch.backends.mps.is_available() else "cpu"
x_cpu = torch.randn(3)
x_gpu = torch.randn(3, device=device)

# BAD: This fails if device != cpu
# result = x_cpu + x_gpu  # RuntimeError!

# GOOD: Ensure same device
x_cpu = x_cpu.to(device)
result = x_cpu + x_gpu
print(f"Both on {device}: operation succeeded")
Both on cpu: operation succeeded

3. In-place Operations Break Gradients

Methods ending in _ modify tensors in-place and break gradient computation.

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2

# BAD: In-place modification on tensor needed for gradient
# y.add_(1)  # RuntimeError: in-place operation on a leaf Variable

# GOOD: Create new tensor
y = y + 1
loss = y.sum()
loss.backward()
print(f"Gradient computed: {x.grad}")
Gradient computed: tensor([2., 2., 2.])

4. Forgetting contiguous()

After transpose/permute, the tensor may not be contiguous in memory.

x = torch.randn(3, 4)
x_t = x.transpose(0, 1)  # Shape (4, 3), but not contiguous

print(f"Original contiguous: {x.is_contiguous()}")
print(f"Transposed contiguous: {x_t.is_contiguous()}")

# BAD: view() requires contiguous memory
# x_t.view(12)  # RuntimeError!

# GOOD: Make contiguous first, or use reshape
x_t_contig = x_t.contiguous().view(12)  # Works
x_t_reshaped = x_t.reshape(12)          # Also works (may copy)
Original contiguous: True
Transposed contiguous: False

5. dtype Mismatches

Operations between different dtypes may silently upcast or fail.

a = torch.tensor([1.0, 2.0], dtype=torch.float32)
b = torch.tensor([1.0, 2.0], dtype=torch.float16)

# Silent upcast to float32 (may waste memory)
result = a + b.to(torch.float32)
print(f"Result dtype: {result.dtype}")

# GOOD: Be explicit about dtype
a_fp16 = a.to(torch.float16)
result = a_fp16 + b  # Both float16
print(f"Explicit dtype: {result.dtype}")
Result dtype: torch.float32
Explicit dtype: torch.float16

Summary

Key takeaways:

  1. Tensors are multi-dimensional arrays - their shape tells you what they represent
  2. Broadcasting automatically expands smaller tensors to match larger ones
  3. Matrix multiplication is the core operation - inner dimensions must match
  4. Reshaping reorganizes dimensions without changing total elements
  5. Memory layout matters - understand contiguous vs strided for efficient operations
  6. Device placement - use GPU (MPS/CUDA) for 10-100x speedup on large tensors
  7. Data types - float32 for learning, float16/bfloat16 for production

What’s Next

Module 02: Autograd shows how PyTorch automatically computes gradients through all these operations - the mechanism that makes neural networks trainable.