---
title: "Module 01: Tensors"
format:
html:
code-fold: false
toc: true
ipynb: default
jupyter: python3
---
{{< include ../_diagram-lib.qmd >}}
## Introduction
The foundation of everything in deep learning. Before we can build a language model, we need to understand tensors - the data structure that holds all our numbers.
A **tensor** is a multi-dimensional array of numbers. If you've used NumPy arrays, you already know what tensors are - PyTorch tensors are nearly identical, but with GPU acceleration and automatic differentiation built in.
Why do we need them for LLMs?
- **Text becomes numbers**: Every word/token gets converted to a list of numbers (a vector)
- **Batching**: We process multiple sequences at once for efficiency
- **Matrix operations**: Attention, embeddings, and neural network layers are all matrix multiplications
### What You'll Learn
By the end of this module, you will be able to:
- Understand tensor shapes and what each dimension represents
- Perform element-wise operations, matrix multiplication, and broadcasting
- Convert between NumPy arrays and PyTorch tensors
- Move tensors between CPU and GPU for acceleration
- Recognize common LLM tensor shapes and their meanings
## Tensor Dimensions
Think of tensors by their dimensions:
```{ojs}
//| echo: false
// Tensor dimension selector
viewof selectedDim = Inputs.range([0, 3], {
label: "Dimensions",
step: 1,
value: 0,
width: 300
})
```
```{ojs}
//| echo: false
// Dimension data
dimensionData = [
{
dim: 0,
name: "Scalar",
desc: "A single number",
shape: "()",
example: "5",
code: "torch.tensor(5.0)"
},
{
dim: 1,
name: "Vector",
desc: "A list of numbers",
shape: "(5,)",
example: "[1, 2, 3, 4, 5]",
code: "torch.tensor([1, 2, 3, 4, 5])"
},
{
dim: 2,
name: "Matrix",
desc: "A grid of numbers",
shape: "(2, 3)",
example: "[[1, 2, 3], [4, 5, 6]]",
code: "torch.tensor([[1,2,3], [4,5,6]])"
},
{
dim: 3,
name: "3D Tensor",
desc: "A stack of matrices",
shape: "(batch, seq, embed)",
example: "Multiple sequences in a batch",
code: "torch.randn(4, 32, 64)"
}
]
currentDim = dimensionData[selectedDim]
```
```{ojs}
//| echo: false
// Visual representation of tensor dimensions
{
const width = 580;
const height = 280;
const theme = diagramTheme;
const svg = d3.create("svg")
.attr("viewBox", `0 0 ${width} ${height}`)
.attr("width", "100%")
.attr("height", height)
.style("max-width", `${width}px`)
.style("font-family", "'JetBrains Mono', 'Fira Code', monospace");
// Background
svg.append("rect")
.attr("width", width)
.attr("height", height)
.attr("fill", theme.bg)
.attr("rx", 12);
const centerX = width / 2;
const vizY = 120;
// Draw progression indicators at top
const progressY = 30;
const stepWidth = 120;
const startX = (width - stepWidth * 4) / 2 + stepWidth / 2;
for (let i = 0; i <= 3; i++) {
const x = startX + i * stepWidth;
const isActive = i === selectedDim;
const isPast = i < selectedDim;
// Circle
svg.append("circle")
.attr("cx", x)
.attr("cy", progressY)
.attr("r", 16)
.attr("fill", isActive ? theme.highlight : isPast ? theme.accent : theme.nodeFill)
.attr("stroke", isActive ? theme.highlight : isPast ? theme.accent : theme.nodeStroke)
.attr("stroke-width", 2);
// Number
svg.append("text")
.attr("x", x)
.attr("y", progressY)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", isActive || isPast ? theme.textOnHighlight : theme.nodeText)
.attr("font-size", "12px")
.attr("font-weight", "600")
.text(`${i}D`);
// Connector line
if (i < 3) {
svg.append("line")
.attr("x1", x + 20)
.attr("y1", progressY)
.attr("x2", x + stepWidth - 20)
.attr("y2", progressY)
.attr("stroke", i < selectedDim ? theme.accent : theme.nodeStroke)
.attr("stroke-width", 2)
.attr("stroke-dasharray", i < selectedDim ? "0" : "4,4");
}
}
// Draw the visual representation based on dimension
const vizGroup = svg.append("g")
.attr("transform", `translate(${centerX}, ${vizY})`);
if (selectedDim === 0) {
// Scalar: single glowing point
vizGroup.append("circle")
.attr("cx", 0)
.attr("cy", 0)
.attr("r", 30)
.attr("fill", theme.highlight)
.attr("filter", `drop-shadow(0 0 12px ${theme.highlightGlow})`);
vizGroup.append("text")
.attr("x", 0)
.attr("y", 0)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", theme.textOnHighlight)
.attr("font-size", "24px")
.attr("font-weight", "bold")
.text("5");
} else if (selectedDim === 1) {
// Vector: row of cells
const cellSize = 40;
const values = [1, 2, 3, 4, 5];
const totalWidth = values.length * cellSize;
values.forEach((v, i) => {
const x = -totalWidth/2 + i * cellSize + cellSize/2;
vizGroup.append("rect")
.attr("x", x - cellSize/2 + 2)
.attr("y", -cellSize/2)
.attr("width", cellSize - 4)
.attr("height", cellSize)
.attr("fill", theme.accent)
.attr("rx", 4)
.attr("filter", `drop-shadow(0 0 6px ${theme.accentGlow})`);
vizGroup.append("text")
.attr("x", x)
.attr("y", 0)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", theme.textOnAccent)
.attr("font-size", "18px")
.attr("font-weight", "600")
.text(v);
});
// Index labels
vizGroup.append("text")
.attr("x", -totalWidth/2 - 15)
.attr("y", 0)
.attr("text-anchor", "end")
.attr("dominant-baseline", "central")
.attr("fill", theme.nodeText)
.attr("font-size", "11px")
.attr("opacity", 0.7)
.text("idx:");
values.forEach((_, i) => {
const x = -totalWidth/2 + i * cellSize + cellSize/2;
vizGroup.append("text")
.attr("x", x)
.attr("y", cellSize/2 + 14)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.6)
.text(i);
});
} else if (selectedDim === 2) {
// Matrix: 2D grid
const cellSize = 36;
const data = [[1, 2, 3], [4, 5, 6]];
const rows = data.length;
const cols = data[0].length;
data.forEach((row, r) => {
row.forEach((v, c) => {
const x = (c - cols/2) * cellSize + cellSize/2;
const y = (r - rows/2) * cellSize + cellSize/2;
vizGroup.append("rect")
.attr("x", x - cellSize/2 + 2)
.attr("y", y - cellSize/2 + 2)
.attr("width", cellSize - 4)
.attr("height", cellSize - 4)
.attr("fill", theme.accent)
.attr("rx", 4)
.attr("filter", `drop-shadow(0 0 4px ${theme.accentGlow})`);
vizGroup.append("text")
.attr("x", x)
.attr("y", y)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", theme.textOnAccent)
.attr("font-size", "16px")
.attr("font-weight", "600")
.text(v);
});
});
// Row/col labels
for (let r = 0; r < rows; r++) {
const y = (r - rows/2) * cellSize + cellSize/2;
vizGroup.append("text")
.attr("x", -cols/2 * cellSize - 12)
.attr("y", y)
.attr("text-anchor", "end")
.attr("dominant-baseline", "central")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.6)
.text(`[${r}]`);
}
for (let c = 0; c < cols; c++) {
const x = (c - cols/2) * cellSize + cellSize/2;
vizGroup.append("text")
.attr("x", x)
.attr("y", -rows/2 * cellSize - 10)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.6)
.text(`[${c}]`);
}
} else {
// 3D Tensor: stacked matrices with depth effect
const cellSize = 28;
const layers = 3;
const rows = 2;
const cols = 3;
const depthOffset = 18;
// Draw back to front
for (let l = layers - 1; l >= 0; l--) {
const layerGroup = vizGroup.append("g")
.attr("transform", `translate(${l * depthOffset - (layers-1) * depthOffset/2}, ${-l * depthOffset + (layers-1) * depthOffset/2})`);
const opacity = l === 0 ? 1 : 0.5 + (layers - l) * 0.15;
const layerColors = [theme.highlight, theme.accent, theme.nodeFill];
const layerColor = layerColors[l] || theme.nodeFill;
const textColor = l === 0 ? theme.textOnHighlight : l === 1 ? theme.textOnAccent : theme.nodeText;
for (let r = 0; r < rows; r++) {
for (let c = 0; c < cols; c++) {
const x = (c - cols/2) * cellSize + cellSize/2;
const y = (r - rows/2) * cellSize + cellSize/2;
layerGroup.append("rect")
.attr("x", x - cellSize/2 + 2)
.attr("y", y - cellSize/2 + 2)
.attr("width", cellSize - 4)
.attr("height", cellSize - 4)
.attr("fill", layerColor)
.attr("stroke", theme.nodeStroke)
.attr("stroke-width", 0.5)
.attr("rx", 3)
.attr("opacity", opacity);
if (l === 0) {
layerGroup.append("text")
.attr("x", x)
.attr("y", y)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", textColor)
.attr("font-size", "11px")
.attr("font-weight", "500")
.text((r * cols + c + 1));
}
}
}
}
// Labels for axes
vizGroup.append("text")
.attr("x", -80)
.attr("y", 50)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.7)
.text("batch");
vizGroup.append("text")
.attr("x", 0)
.attr("y", 65)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.7)
.text("seq");
vizGroup.append("text")
.attr("x", 80)
.attr("y", 50)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.7)
.text("embed");
}
// Info panel at bottom
const infoY = 210;
svg.append("text")
.attr("x", centerX)
.attr("y", infoY)
.attr("text-anchor", "middle")
.attr("fill", theme.highlight)
.attr("font-size", "18px")
.attr("font-weight", "700")
.text(`${currentDim.dim}D: ${currentDim.name}`);
svg.append("text")
.attr("x", centerX)
.attr("y", infoY + 22)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "13px")
.text(currentDim.desc);
svg.append("text")
.attr("x", centerX)
.attr("y", infoY + 44)
.attr("text-anchor", "middle")
.attr("fill", theme.accent)
.attr("font-size", "12px")
.attr("font-family", "'JetBrains Mono', 'Fira Code', monospace")
.text(`shape: ${currentDim.shape}`);
return svg.node();
}
```
The shape of a tensor tells you what it represents. In an LLM:
- `(vocab_size,)` - A 1D tensor: scores for each word in vocabulary
- `(seq_len, embed_dim)` - A 2D tensor: one embedding vector per token
- `(batch, seq_len, embed_dim)` - A 3D tensor: multiple sequences at once
### LLM Tensor Shapes
```{ojs}
//| echo: false
// LLM pipeline step selector
viewof llmStep = Inputs.range([0, 2], {
label: "Pipeline Stage",
step: 1,
value: 0,
width: 300
})
```
```{ojs}
//| echo: false
// LLM pipeline stages data
llmStages = [
{
name: "Token IDs",
shape: "(batch, seq_len)",
dims: [2, 4],
dimLabels: ["batch=2", "seq_len=4"],
example: "[[101, 2054, 2003, 102], [101, 7592, 999, 102]]",
desc: "Raw token indices from vocabulary"
},
{
name: "Embeddings",
shape: "(batch, seq_len, embed_dim)",
dims: [2, 4, 8],
dimLabels: ["batch=2", "seq=4", "embed=8"],
example: "Each token ID becomes a dense vector",
desc: "Tokens converted to learned dense vectors"
},
{
name: "Attention",
shape: "(batch, heads, seq, seq)",
dims: [2, 2, 4, 4],
dimLabels: ["batch=2", "heads=2", "seq=4", "seq=4"],
example: "Query-Key similarity scores",
desc: "Each head attends from every position to every position"
}
]
currentStage = llmStages[llmStep]
```
```{ojs}
//| echo: false
// LLM tensor shapes visualization
{
const width = 620;
const height = 340;
const theme = diagramTheme;
const svg = d3.create("svg")
.attr("viewBox", `0 0 ${width} ${height}`)
.attr("width", "100%")
.attr("height", height)
.style("max-width", `${width}px`)
.style("font-family", "'JetBrains Mono', 'Fira Code', monospace");
// Background with subtle grid
const defs = svg.append("defs");
// Grid pattern
const patternId = `grid-${Math.random().toString(36).substr(2, 9)}`;
defs.append("pattern")
.attr("id", patternId)
.attr("width", 20)
.attr("height", 20)
.attr("patternUnits", "userSpaceOnUse")
.append("path")
.attr("d", "M 20 0 L 0 0 0 20")
.attr("fill", "none")
.attr("stroke", theme.nodeStroke)
.attr("stroke-width", 0.3)
.attr("opacity", 0.3);
svg.append("rect")
.attr("width", width)
.attr("height", height)
.attr("fill", theme.bg)
.attr("rx", 12);
svg.append("rect")
.attr("width", width)
.attr("height", height)
.attr("fill", `url(#${patternId})`)
.attr("rx", 12);
// Pipeline flow at top
const pipelineY = 35;
const stageWidth = 160;
const stageStartX = (width - stageWidth * 3) / 2;
llmStages.forEach((stage, i) => {
const x = stageStartX + i * stageWidth + stageWidth / 2;
const isActive = i === llmStep;
const isPast = i < llmStep;
// Stage box
svg.append("rect")
.attr("x", x - 65)
.attr("y", pipelineY - 18)
.attr("width", 130)
.attr("height", 36)
.attr("rx", 6)
.attr("fill", isActive ? theme.highlight : isPast ? theme.accent : theme.nodeFill)
.attr("stroke", isActive ? theme.highlight : isPast ? theme.accent : theme.nodeStroke)
.attr("stroke-width", isActive ? 2 : 1)
.attr("filter", isActive ? `drop-shadow(0 0 8px ${theme.highlightGlow})` : "none");
svg.append("text")
.attr("x", x)
.attr("y", pipelineY)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", isActive || isPast ? theme.textOnHighlight : theme.nodeText)
.attr("font-size", "12px")
.attr("font-weight", "600")
.text(stage.name);
// Arrow to next stage
if (i < 2) {
const arrowX = x + 70;
svg.append("path")
.attr("d", `M${arrowX},${pipelineY} L${arrowX + 20},${pipelineY} L${arrowX + 15},${pipelineY - 5} M${arrowX + 20},${pipelineY} L${arrowX + 15},${pipelineY + 5}`)
.attr("stroke", i < llmStep ? theme.accent : theme.nodeStroke)
.attr("stroke-width", 2)
.attr("fill", "none");
}
});
// Visualization area
const vizY = 160;
const vizGroup = svg.append("g")
.attr("transform", `translate(${width / 2}, ${vizY})`);
if (llmStep === 0) {
// Token IDs: 2D grid of integer tokens
const cellW = 55;
const cellH = 32;
const tokens = [[101, 2054, 2003, 102], [101, 7592, 999, 102]];
const rows = tokens.length;
const cols = tokens[0].length;
// Draw cells
tokens.forEach((row, r) => {
row.forEach((tok, c) => {
const x = (c - cols/2) * cellW + cellW/2;
const y = (r - rows/2) * cellH + cellH/2;
vizGroup.append("rect")
.attr("x", x - cellW/2 + 2)
.attr("y", y - cellH/2 + 2)
.attr("width", cellW - 4)
.attr("height", cellH - 4)
.attr("fill", theme.accent)
.attr("rx", 4)
.attr("filter", `drop-shadow(0 0 4px ${theme.accentGlow})`);
vizGroup.append("text")
.attr("x", x)
.attr("y", y)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", theme.textOnAccent)
.attr("font-size", "14px")
.attr("font-weight", "500")
.text(tok);
});
});
// Axis labels
vizGroup.append("text")
.attr("x", -cols/2 * cellW - 25)
.attr("y", 0)
.attr("text-anchor", "end")
.attr("dominant-baseline", "central")
.attr("fill", theme.highlight)
.attr("font-size", "11px")
.attr("font-weight", "600")
.text("batch");
vizGroup.append("text")
.attr("x", 0)
.attr("y", -rows/2 * cellH - 15)
.attr("text-anchor", "middle")
.attr("fill", theme.highlight)
.attr("font-size", "11px")
.attr("font-weight", "600")
.text("seq_len");
} else if (llmStep === 1) {
// Embeddings: 3D with depth for embed_dim
const cellW = 50;
const cellH = 28;
const embedSlices = 3; // Show 3 slices of embed dimension
const depthOffset = 12;
const rows = 2;
const cols = 4;
for (let e = embedSlices - 1; e >= 0; e--) {
const layerGroup = vizGroup.append("g")
.attr("transform", `translate(${e * depthOffset - embedSlices * depthOffset/2}, ${-e * depthOffset + embedSlices * depthOffset/2})`);
const opacity = e === 0 ? 1 : 0.4 + e * 0.15;
const layerColor = e === 0 ? theme.accent : theme.nodeFill;
for (let r = 0; r < rows; r++) {
for (let c = 0; c < cols; c++) {
const x = (c - cols/2) * cellW + cellW/2;
const y = (r - rows/2) * cellH + cellH/2;
layerGroup.append("rect")
.attr("x", x - cellW/2 + 2)
.attr("y", y - cellH/2 + 2)
.attr("width", cellW - 4)
.attr("height", cellH - 4)
.attr("fill", layerColor)
.attr("stroke", theme.nodeStroke)
.attr("stroke-width", 0.5)
.attr("rx", 3)
.attr("opacity", opacity);
if (e === 0) {
// Show sample embedding values
const val = (Math.random() * 2 - 1).toFixed(1);
layerGroup.append("text")
.attr("x", x)
.attr("y", y)
.attr("text-anchor", "middle")
.attr("dominant-baseline", "central")
.attr("fill", theme.textOnAccent)
.attr("font-size", "10px")
.text(val);
}
}
}
}
// Dimension labels
vizGroup.append("text")
.attr("x", -cols/2 * cellW - 35)
.attr("y", 10)
.attr("text-anchor", "end")
.attr("fill", theme.highlight)
.attr("font-size", "10px")
.attr("font-weight", "600")
.text("batch");
vizGroup.append("text")
.attr("x", 0)
.attr("y", -rows/2 * cellH - 25)
.attr("text-anchor", "middle")
.attr("fill", theme.highlight)
.attr("font-size", "10px")
.attr("font-weight", "600")
.text("seq");
vizGroup.append("text")
.attr("x", cols/2 * cellW + 40)
.attr("y", -20)
.attr("text-anchor", "start")
.attr("fill", theme.highlight)
.attr("font-size", "10px")
.attr("font-weight", "600")
.text("embed_dim");
// Arrow showing depth
vizGroup.append("path")
.attr("d", `M${cols/2 * cellW + 15},-5 L${cols/2 * cellW + 30},-20`)
.attr("stroke", theme.highlight)
.attr("stroke-width", 1.5)
.attr("fill", "none")
.attr("marker-end", "none");
} else {
// Attention: show two heads side by side, each as seq x seq matrix
const headSize = 70;
const cellSize = headSize / 4;
const headSpacing = 100;
// Draw two attention heads
[-1, 1].forEach((side, headIdx) => {
const headX = side * headSpacing / 2;
const headGroup = vizGroup.append("g")
.attr("transform", `translate(${headX}, 0)`);
// Head label
headGroup.append("text")
.attr("x", 0)
.attr("y", -headSize/2 - 18)
.attr("text-anchor", "middle")
.attr("fill", headIdx === 0 ? theme.highlight : theme.accent)
.attr("font-size", "11px")
.attr("font-weight", "600")
.text(`Head ${headIdx + 1}`);
// Draw 4x4 attention matrix
for (let r = 0; r < 4; r++) {
for (let c = 0; c < 4; c++) {
const x = (c - 2) * cellSize + cellSize/2;
const y = (r - 2) * cellSize + cellSize/2;
// Attention weight (stronger on diagonal and recent tokens for illustration)
const weight = c <= r ? Math.max(0.2, 1 - (r - c) * 0.25) : 0;
const color = headIdx === 0 ? theme.highlight : theme.accent;
headGroup.append("rect")
.attr("x", x - cellSize/2 + 1)
.attr("y", y - cellSize/2 + 1)
.attr("width", cellSize - 2)
.attr("height", cellSize - 2)
.attr("fill", color)
.attr("opacity", weight * 0.8 + 0.1)
.attr("rx", 2);
}
}
// Axis labels for first head only
if (headIdx === 0) {
headGroup.append("text")
.attr("x", -headSize/2 - 12)
.attr("y", 0)
.attr("text-anchor", "end")
.attr("dominant-baseline", "central")
.attr("fill", theme.nodeText)
.attr("font-size", "9px")
.attr("opacity", 0.7)
.text("query");
headGroup.append("text")
.attr("x", 0)
.attr("y", headSize/2 + 14)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "9px")
.attr("opacity", 0.7)
.text("key");
}
});
// Causal mask note
vizGroup.append("text")
.attr("x", 0)
.attr("y", 75)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "10px")
.attr("opacity", 0.7)
.text("Causal mask: can only attend to past tokens");
}
// Info panel at bottom
const infoY = 280;
svg.append("rect")
.attr("x", width/2 - 200)
.attr("y", infoY - 20)
.attr("width", 400)
.attr("height", 50)
.attr("fill", theme.bgSecondary)
.attr("rx", 8)
.attr("opacity", 0.8);
svg.append("text")
.attr("x", width / 2)
.attr("y", infoY)
.attr("text-anchor", "middle")
.attr("fill", theme.highlight)
.attr("font-size", "14px")
.attr("font-weight", "bold")
.text(`Shape: ${currentStage.shape}`);
svg.append("text")
.attr("x", width / 2)
.attr("y", infoY + 20)
.attr("text-anchor", "middle")
.attr("fill", theme.nodeText)
.attr("font-size", "12px")
.text(currentStage.desc);
return svg.node();
}
```
## Tensors Are Just Arrays
Before diving into PyTorch, let's build intuition with NumPy. If you understand NumPy arrays, you already understand 90% of what tensors are.
```{python}
import numpy as np
# A tensor is just a multi-dimensional array of numbers
scalar = np.array(5.0) # 0D: a single number
vector = np.array([1, 2, 3]) # 1D: a list of numbers
matrix = np.array([[1, 2], # 2D: a grid of numbers
[3, 4]])
tensor_3d = np.zeros((2, 3, 4)) # 3D: a stack of grids
print(f"Scalar: shape={scalar.shape}, ndim={scalar.ndim}")
print(f"Vector: shape={vector.shape}, ndim={vector.ndim}")
print(f"Matrix: shape={matrix.shape}, ndim={matrix.ndim}")
print(f"3D Tensor: shape={tensor_3d.shape}, ndim={tensor_3d.ndim}")
```
### Shape and Dtype
Every array has two fundamental properties:
- **Shape**: The size of each dimension `(rows, cols, ...)`
- **Dtype**: The data type of elements (`float32`, `int64`, etc.)
```{python}
# Shape tells you what the data represents
embeddings = np.random.randn(4, 8) # 4 tokens, each with 8-dim embedding
print(f"Shape: {embeddings.shape}")
print(f"Dtype: {embeddings.dtype}")
print(f"Total elements: {embeddings.size}")
print(f"Memory: {embeddings.nbytes} bytes")
```
### Indexing and Slicing
NumPy's powerful indexing is identical to PyTorch:
```{python}
# Create a batch of sequences
batch = np.arange(24).reshape(2, 3, 4) # (batch=2, seq=3, features=4)
print(f"Full shape: {batch.shape}")
print(f"Original:\n{batch}\n")
# Get first sequence in first batch
print(f"batch[0, 0]: {batch[0, 0]}")
# Get all batches, first token only
print(f"batch[:, 0, :] shape: {batch[:, 0, :].shape}")
# Negative indexing: last element
print(f"batch[0, -1, :]: {batch[0, -1, :]}")
# Boolean indexing
mask = batch > 10
print(f"Elements > 10: {batch[mask]}")
```
## The Core Operations
Neural networks are built from a small set of fundamental operations. Let's understand them in NumPy first, then see the PyTorch equivalents.
### Element-wise Operations
Apply the same operation to every element:
```{python}
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# NumPy: element-wise arithmetic
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")
print(f"np.exp(a) = {np.exp(a)}")
```
PyTorch works identically:
```{python}
import torch
a_pt = torch.tensor([1.0, 2.0, 3.0])
b_pt = torch.tensor([4.0, 5.0, 6.0])
# PyTorch: same operations, same syntax
print(f"a + b = {a_pt + b_pt}")
print(f"a * b = {a_pt * b_pt}")
print(f"a ** 2 = {a_pt ** 2}")
print(f"torch.exp(a) = {torch.exp(a_pt)}")
```
### Matrix Multiplication
The workhorse of neural networks. For matrices A (m x n) and B (n x p), the result C = A @ B is (m x p):
```{python}
# NumPy matrix multiplication
A = np.array([[1, 2],
[3, 4]]) # (2, 2)
B = np.array([[5, 6],
[7, 8]]) # (2, 2)
# Three equivalent ways
result1 = np.matmul(A, B)
result2 = A @ B
result3 = np.dot(A, B) # same for 2D arrays
print(f"A @ B =\n{result1}")
print(f"Result shape: {result1.shape}")
```
The `@` operator also works for batched operations:
```{python}
# Batched matrix multiplication in NumPy
batch_A = np.random.randn(4, 3, 2) # 4 matrices of shape (3, 2)
batch_B = np.random.randn(4, 2, 5) # 4 matrices of shape (2, 5)
result = batch_A @ batch_B
print(f"Batch matmul: {batch_A.shape} @ {batch_B.shape} = {result.shape}")
```
PyTorch's `@` and `torch.matmul` behave the same:
```{python}
# PyTorch matrix multiplication
A_pt = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B_pt = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
result_pt = A_pt @ B_pt
print(f"PyTorch A @ B =\n{result_pt}")
# Batched
batch_A_pt = torch.randn(4, 3, 2)
batch_B_pt = torch.randn(4, 2, 5)
print(f"Batched: {(batch_A_pt @ batch_B_pt).shape}")
```
### Broadcasting
When shapes don't match, arrays are automatically expanded. This is crucial for adding biases, scaling, and many other operations.
The rules are simple:
1. Align shapes from the **right**
2. Dimensions must be **equal** OR **one of them is 1**
3. Missing dimensions are treated as 1
```{python}
# NumPy broadcasting examples
x = np.ones((4, 3)) # (4, 3)
bias = np.array([1, 2, 3]) # (3,) - broadcasts to (4, 3)
result = x + bias
print(f"Shape {x.shape} + {bias.shape} = {result.shape}")
print(f"Result:\n{result}")
```
```{python}
# More broadcasting examples
batch = np.ones((2, 4, 3)) # (2, 4, 3)
scale = np.array([[[2]]]) # (1, 1, 1) - broadcasts to (2, 4, 3)
vector = np.array([1, 2, 3]) # (3,) - broadcasts to (2, 4, 3)
print(f"{batch.shape} * {scale.shape} = {(batch * scale).shape}")
print(f"{batch.shape} + {vector.shape} = {(batch + vector).shape}")
```
PyTorch broadcasting follows the exact same rules:
```{python}
# PyTorch broadcasting
embeddings = torch.randn(4, 32, 64) # (batch, seq, embed)
bias = torch.randn(64) # (embed,)
result = embeddings + bias # broadcasts!
print(f"PyTorch: {embeddings.shape} + {bias.shape} = {result.shape}")
```
### Key Insight: NumPy to PyTorch
PyTorch tensors are essentially NumPy arrays with superpowers. The API is nearly identical:
| NumPy | PyTorch | Notes |
|-------|---------|-------|
| `np.array([1,2,3])` | `torch.tensor([1,2,3])` | Creation |
| `arr.shape` | `tensor.shape` | Same attribute |
| `arr.dtype` | `tensor.dtype` | Same attribute |
| `np.matmul(a, b)` | `torch.matmul(a, b)` | Or use `@` |
| `np.exp(x)` | `torch.exp(x)` | Element-wise ops |
| `arr.reshape(2,3)` | `tensor.reshape(2,3)` | Reshaping |
| `arr.T` | `tensor.T` | Transpose |
Converting between them is trivial:
```{python}
# NumPy <-> PyTorch conversion
np_array = np.array([1.0, 2.0, 3.0])
pt_tensor = torch.from_numpy(np_array) # Shares memory!
back_to_np = pt_tensor.numpy() # Shares memory!
print(f"NumPy: {np_array}")
print(f"PyTorch: {pt_tensor}")
print(f"Back to NumPy: {back_to_np}")
```
## Why PyTorch?
If PyTorch is so similar to NumPy, why use it at all? Three reasons:
### 1. GPU Acceleration
NumPy runs on CPU only. PyTorch can move computations to GPU for massive speedups on large matrices:
```{python}
import time
# Create large matrices
size = 2000
np_a = np.random.randn(size, size).astype(np.float32)
np_b = np.random.randn(size, size).astype(np.float32)
# NumPy (CPU)
start = time.time()
np_result = np_a @ np_b
np_time = time.time() - start
print(f"NumPy (CPU): {np_time*1000:.1f} ms")
# PyTorch (CPU for comparison)
pt_a = torch.from_numpy(np_a)
pt_b = torch.from_numpy(np_b)
start = time.time()
pt_result = pt_a @ pt_b
torch.mps.synchronize() if torch.backends.mps.is_available() else None
pt_cpu_time = time.time() - start
print(f"PyTorch (CPU): {pt_cpu_time*1000:.1f} ms")
# PyTorch (GPU if available)
if torch.backends.mps.is_available() or torch.cuda.is_available():
device = "mps" if torch.backends.mps.is_available() else "cuda"
pt_a_gpu = pt_a.to(device)
pt_b_gpu = pt_b.to(device)
# Warm up: first GPU operation incurs overhead (kernel compilation, memory allocation)
_ = pt_a_gpu @ pt_b_gpu
# Synchronize: GPU operations are async, so we wait for completion before timing
torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
start = time.time()
pt_result_gpu = pt_a_gpu @ pt_b_gpu
# Must synchronize again to ensure operation completes before stopping timer
torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
pt_gpu_time = time.time() - start
print(f"PyTorch ({device.upper()}): {pt_gpu_time*1000:.1f} ms")
print(f"GPU speedup: {np_time/pt_gpu_time:.1f}x faster")
```
### 2. Automatic Differentiation
PyTorch tracks operations to compute gradients automatically. This is the magic that makes neural networks trainable:
```{python}
# NumPy: you'd have to compute gradients by hand
x_np = np.array([2.0])
y_np = x_np ** 2 + 3 * x_np
# dy/dx = 2x + 3 = 7 at x=2... but you have to derive and code this yourself!
# PyTorch: automatic!
x_pt = torch.tensor([2.0], requires_grad=True)
y_pt = x_pt ** 2 + 3 * x_pt
y_pt.backward() # Compute gradient automatically
print(f"x = {x_pt.item()}")
print(f"y = x^2 + 3x = {y_pt.item()}")
print(f"dy/dx (computed automatically) = {x_pt.grad.item()}")
```
This automatic differentiation scales to millions of parameters. We'll explore this deeply in [Module 02: Autograd](../m02_autograd/lesson.qmd).
### 3. Optimized Kernels
PyTorch uses highly optimized backends (cuBLAS, cuDNN, MPS) that are faster than naive implementations, even on CPU. Operations like convolutions, attention, and batch normalization have specialized implementations.
```{python}
# PyTorch's optimized softmax vs manual
x = torch.randn(1000, 1000)
# Manual softmax (correct but not optimal)
def manual_softmax(x):
exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
return exp_x / exp_x.sum(dim=-1, keepdim=True)
# PyTorch's optimized version
import time
start = time.time()
for _ in range(100):
_ = manual_softmax(x)
manual_time = time.time() - start
start = time.time()
for _ in range(100):
_ = torch.softmax(x, dim=-1)
pytorch_time = time.time() - start
print(f"Manual softmax: {manual_time*1000:.1f} ms")
print(f"PyTorch softmax: {pytorch_time*1000:.1f} ms")
```
### The Key Insight
**PyTorch tensors are NumPy arrays with superpowers:**
- Same API, same intuition
- GPU acceleration when you need speed
- Automatic gradients when you need to train
- Optimized kernels under the hood
For learning, start with NumPy to understand the concepts. For building real models, use PyTorch for the superpowers.
## Code Walkthrough
Let's explore tensors interactively:
```{python}
import torch
print(f"PyTorch version: {torch.__version__}")
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")
```
### Creating Tensors
```{python}
# From a list
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(f"Shape: {x.shape}")
print(f"Dtype: {x.dtype}")
print(f"Device: {x.device}")
print(x)
```
```{python}
# Random tensors (common for initialization)
random_tensor = torch.randn(2, 3, 4) # Normal distribution (mean=0, std=1)
print(f"Random tensor shape: {random_tensor.shape}")
print(f"Mean: {random_tensor.mean():.4f}, Std: {random_tensor.std():.4f}")
```
### Data Types (dtypes)
Choosing the right dtype affects memory usage and numerical precision:
```{python}
# Default is float32 (32 bits = 4 bytes per number)
t32 = torch.randn(1000, 1000)
print(f"float32: {t32.element_size()} bytes per element, total: {t32.numel() * t32.element_size() / 1e6:.1f} MB")
# float16 uses half the memory but lower precision
t16 = torch.randn(1000, 1000, dtype=torch.float16)
print(f"float16: {t16.element_size()} bytes per element, total: {t16.numel() * t16.element_size() / 1e6:.1f} MB")
# bfloat16: same exponent bits as float32 (8 bits) for better dynamic range,
# but fewer mantissa bits than float16, trading precision for stability
tbf16 = torch.randn(1000, 1000, dtype=torch.bfloat16)
print(f"bfloat16: {tbf16.element_size()} bytes per element")
```
**When to use each:**
- **float32**: Default, good for learning and debugging
- **float16**: Inference on GPUs with Tensor Cores, half memory
- **bfloat16**: Training large models, better numerical stability than float16
### Reshaping
Reshaping is critical for multi-head attention, where we split the embedding dimension across multiple heads:
```{python}
# Reshape for multi-head attention
batch, seq, embed = 4, 32, 64
num_heads = 8
head_dim = embed // num_heads
x = torch.randn(batch, seq, embed)
print(f"Original: {x.shape}")
# Split into heads
x_heads = x.view(batch, seq, num_heads, head_dim)
print(f"After view: {x_heads.shape}")
# Transpose for attention computation
x_heads = x_heads.transpose(1, 2) # (batch, heads, seq, head_dim)
print(f"After transpose: {x_heads.shape}")
```
### Memory Layout: view vs reshape vs contiguous
Understanding memory layout helps avoid subtle bugs. Tensors store data in a flat 1D array, and **strides** tell PyTorch how many elements to skip to move along each dimension. When operations like transpose change the logical order without moving data, the tensor becomes "non-contiguous" — the strides no longer match a simple row-major layout:
```{python}
# view() requires contiguous memory - it's a zero-copy operation
x = torch.randn(3, 4)
print(f"Original is contiguous: {x.is_contiguous()}")
# Transpose creates a non-contiguous view (same memory, different strides)
x_t = x.transpose(0, 1)
print(f"Transposed is contiguous: {x_t.is_contiguous()}")
# view() fails on non-contiguous tensors
try:
x_t.view(12) # This will fail
except RuntimeError as e:
print(f"Error: {e}")
# contiguous() makes a copy with proper memory layout
x_t_contig = x_t.contiguous()
print(f"After contiguous(): {x_t_contig.is_contiguous()}")
x_t_contig.view(12) # Now it works
print("view() works after contiguous()")
# reshape() handles this automatically (but may copy)
reshaped = x_t.reshape(12) # Always works
print(f"reshape() auto-handles non-contiguous: {reshaped.shape}")
```
**Rule of thumb**: Use `reshape()` unless you specifically need zero-copy behavior.
### Matrix Multiplication
Matrix multiplication is the workhorse of neural networks. The `@` operator (or `torch.matmul`) handles batched operations automatically:
```{python}
# Simulating Q @ K^T in attention
Q = torch.randn(2, 8, 32, 8) # (batch, heads, seq, head_dim)
K = torch.randn(2, 8, 32, 8)
# Attention scores
scores = Q @ K.transpose(-2, -1) # (batch, heads, seq, seq)
print(f"Q shape: {Q.shape}")
print(f"K^T shape: {K.transpose(-2, -1).shape}")
print(f"Scores shape: {scores.shape}")
```
**Key insight**: The last two dimensions follow standard matrix multiplication rules `(m, k) @ (k, n) -> (m, n)`, while leading dimensions are broadcasted/batched.
### Softmax: Converting Scores to Probabilities
Softmax takes a vector of arbitrary real numbers (**logits**) and converts them into a **probability distribution** — values between 0 and 1 that sum to 1.
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$$
The exponential function amplifies differences — larger values get disproportionately larger weights:
```{python}
import torch
logits = torch.tensor([2.0, 1.0, 0.1])
probs = torch.softmax(logits, dim=-1)
print(f"Logits: {logits.tolist()}")
print(f"Probs: {[f'{p:.3f}' for p in probs.tolist()]}")
print(f"Sum: {probs.sum():.3f}")
```
The highest logit (2.0) gets ~65% of the probability mass. This is how attention weights and next-token predictions work.
**The `dim` parameter matters** — it specifies which dimension sums to 1:
```{python}
# Batch of scores: (batch=2, seq=3)
scores = torch.randn(2, 3)
weights = torch.softmax(scores, dim=-1) # Each row sums to 1
print(f"Row sums: {weights.sum(dim=-1)}") # [1.0, 1.0]
```
**Tip**: Always use `torch.softmax()` — it handles numerical stability automatically. For details on temperature scaling, see [Module 08: Generation](../m08_generation/lesson.qmd). For softmax in attention, see [Module 05: Attention](../m05_attention/lesson.qmd).
### Common Operations in LLMs
These operations appear everywhere in transformer models:
```{python}
# Layer Normalization (normalizes features, not batch)
x = torch.randn(4, 32, 64) # (batch, seq, embed)
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
x_norm = (x - mean) / (std + 1e-5)
print(f"LayerNorm output shape: {x_norm.shape}")
print(f"Mean per token: {x_norm.mean(dim=-1)[0, :3]}") # Should be ~0
# Linear projection (the most common operation)
W = torch.randn(64, 256) # (in_features, out_features)
b = torch.randn(256) # (out_features,)
x = torch.randn(4, 32, 64) # (batch, seq, in_features)
out = x @ W + b # Broadcasting adds bias
print(f"Linear output shape: {out.shape}")
```
### Broadcasting in Action
```{python}
# Adding bias to all tokens in a batch
embeddings = torch.randn(4, 32, 64) # (batch, seq, embed)
bias = torch.randn(64) # (embed,)
result = embeddings + bias # Broadcasts!
print(f"Embeddings: {embeddings.shape}")
print(f"Bias: {bias.shape}")
print(f"Result: {result.shape}")
```
### Device Management (CPU vs GPU)
Moving tensors between devices is essential for GPU acceleration:
```{python}
# Check available devices
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Create tensor on specific device
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = torch.randn(1000, 1000, device=device)
print(f"Tensor device: {x.device}")
# Move existing tensor to device
y = torch.randn(1000, 1000) # Created on CPU by default
y = y.to(device) # Move to GPU
print(f"After .to(): {y.device}")
```
**Common pitfall**: Operations require tensors on the same device:
```{python}
# This would fail if devices differ:
# z = x_cpu + x_gpu # RuntimeError!
# Always ensure tensors are on the same device
a = torch.randn(100, device=device)
b = torch.randn(100, device=device)
c = a + b # Works!
print(f"Both on {device}: operation succeeded")
```
## Interactive Exploration
Explore how broadcasting aligns tensor shapes. Enter two shapes and see how dimensions are matched and expanded.
```{ojs}
//| echo: false
// Dark mode detection - simple check (reactive via Quarto's built-in mechanism)
isDark = {
// Check if dark mode is active
// This runs at page load and OJS handles reactivity through its own mechanisms
return document.body.classList.contains('quarto-dark') ||
window.matchMedia('(prefers-color-scheme: dark)').matches;
}
// Theme colors
theme = isDark ? {
// Dark mode colors (matching Dark Reader style)
bgPrimary: '#181a1b',
bgSecondary: '#1f2122',
bgTertiary: '#2b2d2e',
textPrimary: '#e8e6e3',
textSecondary: '#c8c6c2',
textMuted: '#a8a6a2',
borderLight: '#3a3c3d',
borderMedium: '#4a4c4d',
// Semantic colors (muted for dark mode)
blue: '#6b8cae',
blueBg: '#1e3a5f',
green: '#5a9a7a',
greenBg: '#1a3d2e',
yellow: '#b89a5a',
yellowBg: '#3d3520',
red: '#a86b6b',
redBg: '#4a2020',
gray: '#4a4c4d',
grayBg: '#2b2d2e',
// Text on colored backgrounds
blueText: '#93c5fd',
greenText: '#86efac',
yellowText: '#fcd34d',
redText: '#fca5a5',
grayText: '#a8a6a2',
successBg: '#1a3d2e',
errorBg: '#4a2020',
successText: '#86efac',
errorText: '#fca5a5'
} : {
// Light mode colors
bgPrimary: '#ffffff',
bgSecondary: '#f9fafb',
bgTertiary: '#f3f4f6',
textPrimary: '#1e293b',
textSecondary: '#475569',
textMuted: '#6b7280',
borderLight: '#e5e7eb',
borderMedium: '#d1d5db',
// Semantic colors
blue: '#3b82f6',
blueBg: '#dbeafe',
green: '#10b981',
greenBg: '#dcfce7',
yellow: '#f59e0b',
yellowBg: '#fef3c7',
red: '#ef4444',
redBg: '#fecaca',
gray: '#9ca3af',
grayBg: '#e5e7eb',
// Text on colored backgrounds
blueText: '#1e40af',
greenText: '#166534',
yellowText: '#92400e',
redText: '#dc2626',
grayText: '#6b7280',
successBg: '#f0fdf4',
errorBg: '#fef2f2',
successText: '#166534',
errorText: '#dc2626'
}
// Parse shape string like "(3, 4, 1)" into array [3, 4, 1]
function parseShape(str) {
const cleaned = str.replace(/[\(\)\[\]]/g, '').trim();
if (!cleaned) return [];
const dims = cleaned.split(',').map(s => parseInt(s.trim())).filter(n => !isNaN(n) && n > 0);
return dims;
}
// Compute broadcast result
function broadcastShapes(shapeA, shapeB) {
const maxLen = Math.max(shapeA.length, shapeB.length);
// Pad shorter shape with 1s on the left
const paddedA = Array(maxLen - shapeA.length).fill(1).concat(shapeA);
const paddedB = Array(maxLen - shapeB.length).fill(1).concat(shapeB);
const result = [];
const dimInfoA = [];
const dimInfoB = [];
let compatible = true;
let errorDim = -1;
for (let i = 0; i < maxLen; i++) {
const a = paddedA[i];
const b = paddedB[i];
if (a === b) {
result.push(a);
dimInfoA.push({ value: a, broadcast: false, padded: i < maxLen - shapeA.length });
dimInfoB.push({ value: b, broadcast: false, padded: i < maxLen - shapeB.length });
} else if (a === 1) {
result.push(b);
dimInfoA.push({ value: a, broadcast: true, padded: i < maxLen - shapeA.length });
dimInfoB.push({ value: b, broadcast: false, padded: i < maxLen - shapeB.length });
} else if (b === 1) {
result.push(a);
dimInfoA.push({ value: a, broadcast: false, padded: i < maxLen - shapeA.length });
dimInfoB.push({ value: b, broadcast: true, padded: i < maxLen - shapeB.length });
} else {
// Incompatible
compatible = false;
errorDim = i;
result.push(null);
dimInfoA.push({ value: a, broadcast: false, error: true, padded: i < maxLen - shapeA.length });
dimInfoB.push({ value: b, broadcast: false, error: true, padded: i < maxLen - shapeB.length });
}
}
return { result, dimInfoA, dimInfoB, compatible, errorDim, paddedA, paddedB };
}
// Format shape for display
function formatShape(dims) {
if (dims.length === 0) return "()";
return "(" + dims.join(", ") + ")";
}
```
```{ojs}
//| echo: false
viewof shapeAInput = Inputs.text({
label: "Shape A",
value: "4, 1, 3",
placeholder: "e.g., 4, 1, 3",
width: 200
})
viewof shapeBInput = Inputs.text({
label: "Shape B",
value: "5, 3",
placeholder: "e.g., 5, 3",
width: 200
})
```
```{ojs}
//| echo: false
// Parse inputs
shapeA = parseShape(shapeAInput)
shapeB = parseShape(shapeBInput)
broadcast = broadcastShapes(shapeA, shapeB)
```
```{ojs}
//| echo: false
// Visualization
html`
<div style="font-family: system-ui; margin: 20px 0; color: ${theme.textPrimary};">
<table style="border-collapse: collapse; width: 100%; max-width: 600px; background: ${theme.bgPrimary};">
<thead>
<tr style="background: ${theme.bgTertiary};">
<th style="padding: 10px; text-align: left; border-bottom: 2px solid ${theme.borderMedium}; color: ${theme.textPrimary};">Tensor</th>
<th style="padding: 10px; text-align: left; border-bottom: 2px solid ${theme.borderMedium}; color: ${theme.textPrimary};">Original Shape</th>
<th style="padding: 10px; text-align: left; border-bottom: 2px solid ${theme.borderMedium}; color: ${theme.textPrimary};">Aligned (padded left)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 10px; font-weight: bold; color: ${theme.blueText};">A</td>
<td style="padding: 10px; font-family: monospace; color: ${theme.textPrimary};">${formatShape(shapeA)}</td>
<td style="padding: 10px; font-family: monospace;">
${broadcast.dimInfoA.map(d => html`<span style="
display: inline-block;
padding: 4px 8px;
margin: 2px;
border-radius: 4px;
background: ${d.error ? theme.redBg : d.broadcast ? theme.yellowBg : d.padded ? theme.grayBg : theme.blueBg};
border: 1px solid ${d.error ? theme.red : d.broadcast ? theme.yellow : d.padded ? theme.gray : theme.blue};
color: ${d.error ? theme.redText : d.broadcast ? theme.yellowText : d.padded ? theme.grayText : theme.blueText};
">${d.value}</span>`)}
</td>
</tr>
<tr>
<td style="padding: 10px; font-weight: bold; color: ${theme.greenText};">B</td>
<td style="padding: 10px; font-family: monospace; color: ${theme.textPrimary};">${formatShape(shapeB)}</td>
<td style="padding: 10px; font-family: monospace;">
${broadcast.dimInfoB.map(d => html`<span style="
display: inline-block;
padding: 4px 8px;
margin: 2px;
border-radius: 4px;
background: ${d.error ? theme.redBg : d.broadcast ? theme.yellowBg : d.padded ? theme.grayBg : theme.greenBg};
border: 1px solid ${d.error ? theme.red : d.broadcast ? theme.yellow : d.padded ? theme.gray : theme.green};
color: ${d.error ? theme.redText : d.broadcast ? theme.yellowText : d.padded ? theme.grayText : theme.greenText};
">${d.value}</span>`)}
</td>
</tr>
<tr style="background: ${broadcast.compatible ? theme.successBg : theme.errorBg};">
<td style="padding: 10px; font-weight: bold; color: ${theme.textPrimary};" colspan="2">Result</td>
<td style="padding: 10px; font-family: monospace; font-weight: bold;">
${broadcast.compatible
? html`<span style="color: ${theme.successText};">${formatShape(broadcast.result)}</span>`
: html`<span style="color: ${theme.errorText};">❌ Incompatible shapes</span>`
}
</td>
</tr>
</tbody>
</table>
<div style="margin-top: 15px; font-size: 13px;">
<span style="display: inline-block; padding: 2px 8px; background: ${theme.blueBg}; color: ${theme.blueText}; border-radius: 4px; margin-right: 8px;">Original</span>
<span style="display: inline-block; padding: 2px 8px; background: ${theme.grayBg}; color: ${theme.grayText}; border-radius: 4px; margin-right: 8px;">Padded (1)</span>
<span style="display: inline-block; padding: 2px 8px; background: ${theme.yellowBg}; color: ${theme.yellowText}; border-radius: 4px; margin-right: 8px;">Broadcast</span>
<span style="display: inline-block; padding: 2px 8px; background: ${theme.redBg}; color: ${theme.redText}; border-radius: 4px;">Error</span>
</div>
</div>
`
```
```{ojs}
//| echo: false
// Explanation
broadcast.compatible
? md`**Broadcasting succeeded!** The result shape ${formatShape(broadcast.result)} is computed by taking the maximum of each aligned dimension.`
: md`**Broadcasting failed!** Dimension ${broadcast.errorDim} has incompatible sizes: ${broadcast.paddedA[broadcast.errorDim]} vs ${broadcast.paddedB[broadcast.errorDim]}. For broadcasting to work, dimensions must be equal or one must be 1.`
```
::: {.callout-tip}
## Try This
1. **Simple broadcast**: Try `(4, 1)` and `(1, 3)`. Both have a 1, so they broadcast to `(4, 3)`.
2. **Scalar broadcast**: Try `(3, 4)` and `(1)`. A scalar broadcasts to any shape.
3. **Same shapes**: Try `(2, 3)` and `(2, 3)`. No broadcasting needed - shapes are identical.
4. **Incompatible shapes**: Try `(3, 4)` and `(2, 4)`. The first dimension (3 vs 2) can't broadcast because neither is 1.
5. **Real-world example**: Try `(32, 10, 64)` (batch of sequences) and `(64)` (a bias vector). The bias broadcasts across batch and sequence dimensions.
:::
## Exercises
### Exercise 1: Create an Embedding Lookup
```{python}
# Create a vocabulary embedding table
vocab_size = 100
embed_dim = 32
embedding_table = torch.randn(vocab_size, embed_dim)
print(f"Embedding table: {embedding_table.shape}")
# Look up embeddings for token IDs
token_ids = torch.tensor([5, 23, 7, 42])
embeddings = embedding_table[token_ids]
print(f"Token IDs: {token_ids}")
print(f"Embeddings shape: {embeddings.shape}")
```
### Exercise 2: Simulate Simple Attention
```{python}
import matplotlib.pyplot as plt
seq_len = 6
embed_dim = 8
# Token embeddings
tokens = torch.randn(seq_len, embed_dim)
# Compute attention scores (dot product similarity)
scores = tokens @ tokens.T
print(f"Attention scores shape: {scores.shape}")
# Apply softmax to get weights
attention_weights = torch.softmax(scores, dim=-1)
# Visualize
plt.figure(figsize=(6, 5))
plt.imshow(attention_weights.detach().numpy(), cmap='Blues')
plt.colorbar(label='Attention Weight')
plt.title('Simple Attention Pattern')
plt.xlabel('Key Token')
plt.ylabel('Query Token')
plt.show()
```
### Exercise 3: Apply Attention
```{python}
# Weighted combination of values
output = attention_weights @ tokens
print(f"Input: {tokens.shape}")
print(f"Weights: {attention_weights.shape}")
print(f"Output: {output.shape}")
print("\nEach output token is a weighted average of ALL input tokens!")
```
## Common Pitfalls
Before moving on, be aware of these common mistakes:
1. **Shape mismatches**: Always print shapes when debugging. Most errors come from unexpected dimensions.
2. **Device mismatches**: Tensors must be on the same device for operations. Use `.to(device)` consistently.
3. **In-place operations**: Methods ending in `_` (like `add_()`) modify tensors in-place, which can break gradient computation.
4. **Forgetting contiguous()**: After transpose/permute, call `.contiguous()` before `.view()`.
5. **dtype mismatches**: Operations between float32 and float16 may silently upcast or fail.
## Summary
Key takeaways:
1. **Tensors are multi-dimensional arrays** - their shape tells you what they represent
2. **Broadcasting** automatically expands smaller tensors to match larger ones
3. **Matrix multiplication** is the core operation - inner dimensions must match
4. **Reshaping** reorganizes dimensions without changing total elements
5. **Memory layout matters** - understand contiguous vs strided for efficient operations
6. **Device placement** - use GPU (MPS/CUDA) for 10-100x speedup on large tensors
7. **Data types** - float32 for learning, float16/bfloat16 for production
## What's Next
In [Module 02: Autograd](../m02_autograd/lesson.qmd), we'll see how PyTorch automatically computes gradients through all these operations - the magic that makes neural networks trainable.