Module 01: Tensors

Introduction

The foundation of everything in deep learning. Before we can build a language model, we need to understand tensors - the data structure that holds all our numbers.

A tensor is a multi-dimensional array of numbers. If you’ve used NumPy arrays, you already know what tensors are - PyTorch tensors are nearly identical, but with GPU acceleration and automatic differentiation built in.

Why do we need them for LLMs?

  • Text becomes numbers: Every word/token gets converted to a list of numbers (a vector)
  • Batching: We process multiple sequences at once for efficiency
  • Matrix operations: Attention, embeddings, and neural network layers are all matrix multiplications

What You’ll Learn

By the end of this module, you will be able to:

  • Understand tensor shapes and what each dimension represents
  • Perform element-wise operations, matrix multiplication, and broadcasting
  • Convert between NumPy arrays and PyTorch tensors
  • Move tensors between CPU and GPU for acceleration
  • Recognize common LLM tensor shapes and their meanings

Tensor Dimensions

Think of tensors by their dimensions:

The shape of a tensor tells you what it represents. In an LLM:

  • (vocab_size,) - A 1D tensor: scores for each word in vocabulary
  • (seq_len, embed_dim) - A 2D tensor: one embedding vector per token
  • (batch, seq_len, embed_dim) - A 3D tensor: multiple sequences at once

LLM Tensor Shapes

Tensors Are Just Arrays

Before diving into PyTorch, let’s build intuition with NumPy. If you understand NumPy arrays, you already understand 90% of what tensors are.

import numpy as np

# A tensor is just a multi-dimensional array of numbers
scalar = np.array(5.0)           # 0D: a single number
vector = np.array([1, 2, 3])     # 1D: a list of numbers
matrix = np.array([[1, 2],       # 2D: a grid of numbers
                   [3, 4]])
tensor_3d = np.zeros((2, 3, 4))  # 3D: a stack of grids

print(f"Scalar: shape={scalar.shape}, ndim={scalar.ndim}")
print(f"Vector: shape={vector.shape}, ndim={vector.ndim}")
print(f"Matrix: shape={matrix.shape}, ndim={matrix.ndim}")
print(f"3D Tensor: shape={tensor_3d.shape}, ndim={tensor_3d.ndim}")

Shape and Dtype

Every array has two fundamental properties:

  • Shape: The size of each dimension (rows, cols, ...)
  • Dtype: The data type of elements (float32, int64, etc.)
# Shape tells you what the data represents
embeddings = np.random.randn(4, 8)  # 4 tokens, each with 8-dim embedding
print(f"Shape: {embeddings.shape}")
print(f"Dtype: {embeddings.dtype}")
print(f"Total elements: {embeddings.size}")
print(f"Memory: {embeddings.nbytes} bytes")

Indexing and Slicing

NumPy’s powerful indexing is identical to PyTorch:

# Create a batch of sequences
batch = np.arange(24).reshape(2, 3, 4)  # (batch=2, seq=3, features=4)
print(f"Full shape: {batch.shape}")
print(f"Original:\n{batch}\n")

# Get first sequence in first batch
print(f"batch[0, 0]: {batch[0, 0]}")

# Get all batches, first token only
print(f"batch[:, 0, :] shape: {batch[:, 0, :].shape}")

# Negative indexing: last element
print(f"batch[0, -1, :]: {batch[0, -1, :]}")

# Boolean indexing
mask = batch > 10
print(f"Elements > 10: {batch[mask]}")

The Core Operations

Neural networks are built from a small set of fundamental operations. Let’s understand them in NumPy first, then see the PyTorch equivalents.

Element-wise Operations

Apply the same operation to every element:

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# NumPy: element-wise arithmetic
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")
print(f"np.exp(a) = {np.exp(a)}")

PyTorch works identically:

import torch

a_pt = torch.tensor([1.0, 2.0, 3.0])
b_pt = torch.tensor([4.0, 5.0, 6.0])

# PyTorch: same operations, same syntax
print(f"a + b = {a_pt + b_pt}")
print(f"a * b = {a_pt * b_pt}")
print(f"a ** 2 = {a_pt ** 2}")
print(f"torch.exp(a) = {torch.exp(a_pt)}")

Matrix Multiplication

The workhorse of neural networks. For matrices A (m x n) and B (n x p), the result C = A @ B is (m x p):

# NumPy matrix multiplication
A = np.array([[1, 2],
              [3, 4]])   # (2, 2)
B = np.array([[5, 6],
              [7, 8]])   # (2, 2)

# Three equivalent ways
result1 = np.matmul(A, B)
result2 = A @ B
result3 = np.dot(A, B)  # same for 2D arrays

print(f"A @ B =\n{result1}")
print(f"Result shape: {result1.shape}")

The @ operator also works for batched operations:

# Batched matrix multiplication in NumPy
batch_A = np.random.randn(4, 3, 2)  # 4 matrices of shape (3, 2)
batch_B = np.random.randn(4, 2, 5)  # 4 matrices of shape (2, 5)

result = batch_A @ batch_B
print(f"Batch matmul: {batch_A.shape} @ {batch_B.shape} = {result.shape}")

PyTorch’s @ and torch.matmul behave the same:

# PyTorch matrix multiplication
A_pt = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B_pt = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

result_pt = A_pt @ B_pt
print(f"PyTorch A @ B =\n{result_pt}")

# Batched
batch_A_pt = torch.randn(4, 3, 2)
batch_B_pt = torch.randn(4, 2, 5)
print(f"Batched: {(batch_A_pt @ batch_B_pt).shape}")

Broadcasting

When shapes don’t match, arrays are automatically expanded. This is crucial for adding biases, scaling, and many other operations.

The rules are simple: 1. Align shapes from the right 2. Dimensions must be equal OR one of them is 1 3. Missing dimensions are treated as 1

# NumPy broadcasting examples
x = np.ones((4, 3))       # (4, 3)
bias = np.array([1, 2, 3]) # (3,) - broadcasts to (4, 3)

result = x + bias
print(f"Shape {x.shape} + {bias.shape} = {result.shape}")
print(f"Result:\n{result}")
# More broadcasting examples
batch = np.ones((2, 4, 3))  # (2, 4, 3)
scale = np.array([[[2]]])   # (1, 1, 1) - broadcasts to (2, 4, 3)
vector = np.array([1, 2, 3]) # (3,) - broadcasts to (2, 4, 3)

print(f"{batch.shape} * {scale.shape} = {(batch * scale).shape}")
print(f"{batch.shape} + {vector.shape} = {(batch + vector).shape}")

PyTorch broadcasting follows the exact same rules:

# PyTorch broadcasting
embeddings = torch.randn(4, 32, 64)  # (batch, seq, embed)
bias = torch.randn(64)               # (embed,)

result = embeddings + bias  # broadcasts!
print(f"PyTorch: {embeddings.shape} + {bias.shape} = {result.shape}")

Key Insight: NumPy to PyTorch

PyTorch tensors are essentially NumPy arrays with superpowers. The API is nearly identical:

NumPy PyTorch Notes
np.array([1,2,3]) torch.tensor([1,2,3]) Creation
arr.shape tensor.shape Same attribute
arr.dtype tensor.dtype Same attribute
np.matmul(a, b) torch.matmul(a, b) Or use @
np.exp(x) torch.exp(x) Element-wise ops
arr.reshape(2,3) tensor.reshape(2,3) Reshaping
arr.T tensor.T Transpose

Converting between them is trivial:

# NumPy <-> PyTorch conversion
np_array = np.array([1.0, 2.0, 3.0])
pt_tensor = torch.from_numpy(np_array)  # Shares memory!
back_to_np = pt_tensor.numpy()          # Shares memory!

print(f"NumPy: {np_array}")
print(f"PyTorch: {pt_tensor}")
print(f"Back to NumPy: {back_to_np}")

Why PyTorch?

If PyTorch is so similar to NumPy, why use it at all? Three reasons:

1. GPU Acceleration

NumPy runs on CPU only. PyTorch can move computations to GPU for massive speedups on large matrices:

import time

# Create large matrices
size = 2000
np_a = np.random.randn(size, size).astype(np.float32)
np_b = np.random.randn(size, size).astype(np.float32)

# NumPy (CPU)
start = time.time()
np_result = np_a @ np_b
np_time = time.time() - start
print(f"NumPy (CPU): {np_time*1000:.1f} ms")

# PyTorch (CPU for comparison)
pt_a = torch.from_numpy(np_a)
pt_b = torch.from_numpy(np_b)
start = time.time()
pt_result = pt_a @ pt_b
torch.mps.synchronize() if torch.backends.mps.is_available() else None
pt_cpu_time = time.time() - start
print(f"PyTorch (CPU): {pt_cpu_time*1000:.1f} ms")

# PyTorch (GPU if available)
if torch.backends.mps.is_available() or torch.cuda.is_available():
    device = "mps" if torch.backends.mps.is_available() else "cuda"
    pt_a_gpu = pt_a.to(device)
    pt_b_gpu = pt_b.to(device)

    # Warm up: first GPU operation incurs overhead (kernel compilation, memory allocation)
    _ = pt_a_gpu @ pt_b_gpu
    # Synchronize: GPU operations are async, so we wait for completion before timing
    torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()

    start = time.time()
    pt_result_gpu = pt_a_gpu @ pt_b_gpu
    # Must synchronize again to ensure operation completes before stopping timer
    torch.mps.synchronize() if device == "mps" else torch.cuda.synchronize()
    pt_gpu_time = time.time() - start
    print(f"PyTorch ({device.upper()}): {pt_gpu_time*1000:.1f} ms")
    print(f"GPU speedup: {np_time/pt_gpu_time:.1f}x faster")

2. Automatic Differentiation

PyTorch tracks operations to compute gradients automatically. This is the magic that makes neural networks trainable:

# NumPy: you'd have to compute gradients by hand
x_np = np.array([2.0])
y_np = x_np ** 2 + 3 * x_np
# dy/dx = 2x + 3 = 7 at x=2... but you have to derive and code this yourself!

# PyTorch: automatic!
x_pt = torch.tensor([2.0], requires_grad=True)
y_pt = x_pt ** 2 + 3 * x_pt
y_pt.backward()  # Compute gradient automatically
print(f"x = {x_pt.item()}")
print(f"y = x^2 + 3x = {y_pt.item()}")
print(f"dy/dx (computed automatically) = {x_pt.grad.item()}")

This automatic differentiation scales to millions of parameters. We’ll explore this deeply in Module 02: Autograd.

3. Optimized Kernels

PyTorch uses highly optimized backends (cuBLAS, cuDNN, MPS) that are faster than naive implementations, even on CPU. Operations like convolutions, attention, and batch normalization have specialized implementations.

# PyTorch's optimized softmax vs manual
x = torch.randn(1000, 1000)

# Manual softmax (correct but not optimal)
def manual_softmax(x):
    exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

# PyTorch's optimized version
import time

start = time.time()
for _ in range(100):
    _ = manual_softmax(x)
manual_time = time.time() - start

start = time.time()
for _ in range(100):
    _ = torch.softmax(x, dim=-1)
pytorch_time = time.time() - start

print(f"Manual softmax: {manual_time*1000:.1f} ms")
print(f"PyTorch softmax: {pytorch_time*1000:.1f} ms")

The Key Insight

PyTorch tensors are NumPy arrays with superpowers:

  • Same API, same intuition
  • GPU acceleration when you need speed
  • Automatic gradients when you need to train
  • Optimized kernels under the hood

For learning, start with NumPy to understand the concepts. For building real models, use PyTorch for the superpowers.

Code Walkthrough

Let’s explore tensors interactively:

import torch

print(f"PyTorch version: {torch.__version__}")
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Device: {device}")

Creating Tensors

# From a list
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(f"Shape: {x.shape}")
print(f"Dtype: {x.dtype}")
print(f"Device: {x.device}")
print(x)
# Random tensors (common for initialization)
random_tensor = torch.randn(2, 3, 4)  # Normal distribution (mean=0, std=1)
print(f"Random tensor shape: {random_tensor.shape}")
print(f"Mean: {random_tensor.mean():.4f}, Std: {random_tensor.std():.4f}")

Data Types (dtypes)

Choosing the right dtype affects memory usage and numerical precision:

# Default is float32 (32 bits = 4 bytes per number)
t32 = torch.randn(1000, 1000)
print(f"float32: {t32.element_size()} bytes per element, total: {t32.numel() * t32.element_size() / 1e6:.1f} MB")

# float16 uses half the memory but lower precision
t16 = torch.randn(1000, 1000, dtype=torch.float16)
print(f"float16: {t16.element_size()} bytes per element, total: {t16.numel() * t16.element_size() / 1e6:.1f} MB")

# bfloat16: same exponent bits as float32 (8 bits) for better dynamic range,
# but fewer mantissa bits than float16, trading precision for stability
tbf16 = torch.randn(1000, 1000, dtype=torch.bfloat16)
print(f"bfloat16: {tbf16.element_size()} bytes per element")

When to use each:

  • float32: Default, good for learning and debugging
  • float16: Inference on GPUs with Tensor Cores, half memory
  • bfloat16: Training large models, better numerical stability than float16

Reshaping

Reshaping is critical for multi-head attention, where we split the embedding dimension across multiple heads:

# Reshape for multi-head attention
batch, seq, embed = 4, 32, 64
num_heads = 8
head_dim = embed // num_heads

x = torch.randn(batch, seq, embed)
print(f"Original: {x.shape}")

# Split into heads
x_heads = x.view(batch, seq, num_heads, head_dim)
print(f"After view: {x_heads.shape}")

# Transpose for attention computation
x_heads = x_heads.transpose(1, 2)  # (batch, heads, seq, head_dim)
print(f"After transpose: {x_heads.shape}")

Memory Layout: view vs reshape vs contiguous

Understanding memory layout helps avoid subtle bugs. Tensors store data in a flat 1D array, and strides tell PyTorch how many elements to skip to move along each dimension. When operations like transpose change the logical order without moving data, the tensor becomes “non-contiguous” — the strides no longer match a simple row-major layout:

# view() requires contiguous memory - it's a zero-copy operation
x = torch.randn(3, 4)
print(f"Original is contiguous: {x.is_contiguous()}")

# Transpose creates a non-contiguous view (same memory, different strides)
x_t = x.transpose(0, 1)
print(f"Transposed is contiguous: {x_t.is_contiguous()}")

# view() fails on non-contiguous tensors
try:
    x_t.view(12)  # This will fail
except RuntimeError as e:
    print(f"Error: {e}")

# contiguous() makes a copy with proper memory layout
x_t_contig = x_t.contiguous()
print(f"After contiguous(): {x_t_contig.is_contiguous()}")
x_t_contig.view(12)  # Now it works
print("view() works after contiguous()")

# reshape() handles this automatically (but may copy)
reshaped = x_t.reshape(12)  # Always works
print(f"reshape() auto-handles non-contiguous: {reshaped.shape}")

Rule of thumb: Use reshape() unless you specifically need zero-copy behavior.

Matrix Multiplication

Matrix multiplication is the workhorse of neural networks. The @ operator (or torch.matmul) handles batched operations automatically:

# Simulating Q @ K^T in attention
Q = torch.randn(2, 8, 32, 8)  # (batch, heads, seq, head_dim)
K = torch.randn(2, 8, 32, 8)

# Attention scores
scores = Q @ K.transpose(-2, -1)  # (batch, heads, seq, seq)
print(f"Q shape: {Q.shape}")
print(f"K^T shape: {K.transpose(-2, -1).shape}")
print(f"Scores shape: {scores.shape}")

Key insight: The last two dimensions follow standard matrix multiplication rules (m, k) @ (k, n) -> (m, n), while leading dimensions are broadcasted/batched.

Softmax: Converting Scores to Probabilities

Softmax takes a vector of arbitrary real numbers (logits) and converts them into a probability distribution — values between 0 and 1 that sum to 1.

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}\]

The exponential function amplifies differences — larger values get disproportionately larger weights:

import torch

logits = torch.tensor([2.0, 1.0, 0.1])
probs = torch.softmax(logits, dim=-1)

print(f"Logits: {logits.tolist()}")
print(f"Probs:  {[f'{p:.3f}' for p in probs.tolist()]}")
print(f"Sum:    {probs.sum():.3f}")

The highest logit (2.0) gets ~65% of the probability mass. This is how attention weights and next-token predictions work.

The dim parameter matters — it specifies which dimension sums to 1:

# Batch of scores: (batch=2, seq=3)
scores = torch.randn(2, 3)
weights = torch.softmax(scores, dim=-1)  # Each row sums to 1

print(f"Row sums: {weights.sum(dim=-1)}")  # [1.0, 1.0]

Tip: Always use torch.softmax() — it handles numerical stability automatically. For details on temperature scaling, see Module 08: Generation. For softmax in attention, see Module 05: Attention.

Common Operations in LLMs

These operations appear everywhere in transformer models:

# Layer Normalization (normalizes features, not batch)
x = torch.randn(4, 32, 64)  # (batch, seq, embed)
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
x_norm = (x - mean) / (std + 1e-5)
print(f"LayerNorm output shape: {x_norm.shape}")
print(f"Mean per token: {x_norm.mean(dim=-1)[0, :3]}")  # Should be ~0

# Linear projection (the most common operation)
W = torch.randn(64, 256)  # (in_features, out_features)
b = torch.randn(256)       # (out_features,)
x = torch.randn(4, 32, 64) # (batch, seq, in_features)
out = x @ W + b            # Broadcasting adds bias
print(f"Linear output shape: {out.shape}")

Broadcasting in Action

# Adding bias to all tokens in a batch
embeddings = torch.randn(4, 32, 64)  # (batch, seq, embed)
bias = torch.randn(64)               # (embed,)

result = embeddings + bias  # Broadcasts!
print(f"Embeddings: {embeddings.shape}")
print(f"Bias: {bias.shape}")
print(f"Result: {result.shape}")

Device Management (CPU vs GPU)

Moving tensors between devices is essential for GPU acceleration:

# Check available devices
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Create tensor on specific device
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = torch.randn(1000, 1000, device=device)
print(f"Tensor device: {x.device}")

# Move existing tensor to device
y = torch.randn(1000, 1000)  # Created on CPU by default
y = y.to(device)             # Move to GPU
print(f"After .to(): {y.device}")

Common pitfall: Operations require tensors on the same device:

# This would fail if devices differ:
# z = x_cpu + x_gpu  # RuntimeError!

# Always ensure tensors are on the same device
a = torch.randn(100, device=device)
b = torch.randn(100, device=device)
c = a + b  # Works!
print(f"Both on {device}: operation succeeded")

Interactive Exploration

Explore how broadcasting aligns tensor shapes. Enter two shapes and see how dimensions are matched and expanded.

TipTry This
  1. Simple broadcast: Try (4, 1) and (1, 3). Both have a 1, so they broadcast to (4, 3).

  2. Scalar broadcast: Try (3, 4) and (1). A scalar broadcasts to any shape.

  3. Same shapes: Try (2, 3) and (2, 3). No broadcasting needed - shapes are identical.

  4. Incompatible shapes: Try (3, 4) and (2, 4). The first dimension (3 vs 2) can’t broadcast because neither is 1.

  5. Real-world example: Try (32, 10, 64) (batch of sequences) and (64) (a bias vector). The bias broadcasts across batch and sequence dimensions.

Exercises

Exercise 1: Create an Embedding Lookup

# Create a vocabulary embedding table
vocab_size = 100
embed_dim = 32

embedding_table = torch.randn(vocab_size, embed_dim)
print(f"Embedding table: {embedding_table.shape}")

# Look up embeddings for token IDs
token_ids = torch.tensor([5, 23, 7, 42])
embeddings = embedding_table[token_ids]
print(f"Token IDs: {token_ids}")
print(f"Embeddings shape: {embeddings.shape}")

Exercise 2: Simulate Simple Attention

import matplotlib.pyplot as plt

seq_len = 6
embed_dim = 8

# Token embeddings
tokens = torch.randn(seq_len, embed_dim)

# Compute attention scores (dot product similarity)
scores = tokens @ tokens.T
print(f"Attention scores shape: {scores.shape}")

# Apply softmax to get weights
attention_weights = torch.softmax(scores, dim=-1)

# Visualize
plt.figure(figsize=(6, 5))
plt.imshow(attention_weights.detach().numpy(), cmap='Blues')
plt.colorbar(label='Attention Weight')
plt.title('Simple Attention Pattern')
plt.xlabel('Key Token')
plt.ylabel('Query Token')
plt.show()

Exercise 3: Apply Attention

# Weighted combination of values
output = attention_weights @ tokens
print(f"Input: {tokens.shape}")
print(f"Weights: {attention_weights.shape}")
print(f"Output: {output.shape}")
print("\nEach output token is a weighted average of ALL input tokens!")

Common Pitfalls

Before moving on, be aware of these common mistakes:

  1. Shape mismatches: Always print shapes when debugging. Most errors come from unexpected dimensions.
  2. Device mismatches: Tensors must be on the same device for operations. Use .to(device) consistently.
  3. In-place operations: Methods ending in _ (like add_()) modify tensors in-place, which can break gradient computation.
  4. Forgetting contiguous(): After transpose/permute, call .contiguous() before .view().
  5. dtype mismatches: Operations between float32 and float16 may silently upcast or fail.

Summary

Key takeaways:

  1. Tensors are multi-dimensional arrays - their shape tells you what they represent
  2. Broadcasting automatically expands smaller tensors to match larger ones
  3. Matrix multiplication is the core operation - inner dimensions must match
  4. Reshaping reorganizes dimensions without changing total elements
  5. Memory layout matters - understand contiguous vs strided for efficient operations
  6. Device placement - use GPU (MPS/CUDA) for 10-100x speedup on large tensors
  7. Data types - float32 for learning, float16/bfloat16 for production

What’s Next

In Module 02: Autograd, we’ll see how PyTorch automatically computes gradients through all these operations - the magic that makes neural networks trainable.