🎯 Automatic Differentiation: PyTorch’s Autograd Core
The Magic Behind Learning Machines
Imagine you’re a detective trying to find hidden treasure. You have a map, but the map only tells you “you’re getting warmer” or “you’re getting colder.” How do you find the treasure? You take small steps in different directions and see which way makes you “warmer.”
That’s exactly what neural networks do! And the tool that tells them “warmer” or “colder”? That’s Autograd – PyTorch’s automatic differentiation engine.
🎨 Our Analogy: The River Flow
Throughout this guide, think of calculations as water flowing through rivers:
- Numbers flow downstream like water
- Gradients (learning signals) flow upstream like salmon swimming back home
- The river system itself is the computational graph
1️⃣ What is Automatic Differentiation?
The Big Idea
When you throw a ball, your brain automatically calculates how hard and at what angle. You don’t solve physics equations – your brain just knows.
Automatic differentiation gives computers this same power. Instead of manually calculating how to adjust parameters, PyTorch figures it out automatically!
Simple Example
import torch
# Create a number we want to learn from
x = torch.tensor(3.0, requires_grad=True)
# Do some math
y = x * x + 2 * x + 1 # y = x² + 2x + 1
# Ask: "How does y change when x changes?"
y.backward()
# The answer!
print(x.grad) # Output: 8.0
What happened?
- When
x = 3, the gradient is2x + 2 = 8 - PyTorch calculated this automatically!
- No manual calculus needed 🎉
2️⃣ Computational Graphs
Rivers Have a Map
Every calculation in PyTorch creates an invisible map called a computational graph. Think of it as drawing the rivers where your numbers flow.
graph TD A[x = 3] --> B[x × x = 9] A --> C[2 × x = 6] B --> D[9 + 6 = 15] C --> D D --> E[15 + 1 = 16] style A fill:#e1f5fe style E fill:#c8e6c9
Why Does This Matter?
The graph remembers every step of your calculation. When it’s time to learn (backpropagation), PyTorch follows the map backward to figure out how to improve.
Example: Building a Graph
import torch
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a + b # Node 1: addition
d = a * b # Node 2: multiplication
e = c * d # Node 3: final result
# Graph built automatically!
e.backward()
print(a.grad) # How 'a' affects 'e'
print(b.grad) # How 'b' affects 'e'
3️⃣ requires_grad and Gradients
Telling PyTorch: “Watch This!”
Not every number needs gradients. Your input data? No gradients needed. Your learnable weights? Absolutely!
requires_grad=True is like putting a GPS tracker on a number – PyTorch will watch where it goes and remember how to trace back.
The Flag System
| Setting | Meaning |
|---|---|
requires_grad=True |
“Track this! I want to learn from it” |
requires_grad=False |
“Ignore this, just pass through” |
Example: Tracking vs Not Tracking
import torch
# Tracked tensor (learnable)
weights = torch.tensor([1.0, 2.0],
requires_grad=True)
# Untracked tensor (input data)
inputs = torch.tensor([3.0, 4.0])
# Only weights will receive gradients
output = (weights * inputs).sum()
output.backward()
print(weights.grad) # [3.0, 4.0]
print(inputs.grad) # None (not tracked!)
4️⃣ Leaf and Non-Leaf Tensors
The Family Tree of Numbers
In our river analogy:
- Leaf tensors = The mountain springs (sources)
- Non-leaf tensors = Points where rivers merge
What’s a Leaf Tensor?
A leaf tensor is one you create directly. It’s the starting point – the “leaves” of your computational family tree.
import torch
# ✅ Leaf tensor (you created it directly)
x = torch.tensor(5.0, requires_grad=True)
print(x.is_leaf) # True
# ❌ Non-leaf tensor (result of operation)
y = x * 2
print(y.is_leaf) # False
Why Does This Matter?
Only leaf tensors keep their gradients by default!
Non-leaf gradients are thrown away to save memory. If you need them, use retain_grad():
x = torch.tensor(3.0, requires_grad=True)
y = x * 2 # Non-leaf
y.retain_grad() # "Keep my gradient!"
z = y * 3
z.backward()
print(x.grad) # 6.0 (always kept)
print(y.grad) # 3.0 (kept because we asked)
5️⃣ Backpropagation Mechanics
Salmon Swimming Upstream
Remember our river analogy? Backpropagation is like salmon swimming upstream – going backward through the river system to reach the mountain springs.
The Chain Rule Made Simple
Imagine a chain of dominoes:
- Push the last one → it knocks the one before it
- Each domino’s “knock strength” multiplies
That’s the chain rule! Gradients multiply as they flow backward.
graph LR A[Input x] -->|forward| B[Hidden h] B -->|forward| C[Output y] C -->|backward| B B -->|backward| A style C fill:#ffcdd2 style A fill:#c8e6c9
Step-by-Step Example
import torch
x = torch.tensor(2.0, requires_grad=True)
# Forward pass: y = (x + 1)²
h = x + 1 # h = 3
y = h * h # y = 9
# Backward pass
y.backward()
# Gradient flows: dy/dh = 2h = 6
# dh/dx = 1
# dy/dx = 6 × 1 = 6
print(x.grad) # 6.0 ✓
6️⃣ Zeroing Gradients
Cleaning the Whiteboard
Here’s a common trap for beginners: PyTorch accumulates gradients!
Each time you call .backward(), gradients ADD to existing ones. It’s like writing on a whiteboard without erasing – things get messy fast.
The Problem
import torch
x = torch.tensor(3.0, requires_grad=True)
# First calculation
y1 = x * 2
y1.backward()
print(x.grad) # 2.0 ✓
# Second calculation (OOPS!)
y2 = x * 3
y2.backward()
print(x.grad) # 5.0 (2 + 3, accumulated!)
The Solution: Zero Your Gradients!
import torch
x = torch.tensor(3.0, requires_grad=True)
# First calculation
y1 = x * 2
y1.backward()
print(x.grad) # 2.0
# CLEAN THE WHITEBOARD
x.grad.zero_()
# Second calculation (correct!)
y2 = x * 3
y2.backward()
print(x.grad) # 3.0 ✓
In Neural Network Training
# Standard training loop pattern
optimizer.zero_grad() # Erase old gradients
loss.backward() # Calculate new gradients
optimizer.step() # Update weights
7️⃣ Gradient Accumulation
Sometimes You WANT to Accumulate!
Wait – if accumulation is usually bad, why does PyTorch do it?
Answer: Sometimes you NEED it! Especially when your GPU can’t fit a big batch.
The Mini-Batch Trick
Instead of processing 32 images at once (too big for GPU), process 8 images 4 times and accumulate!
graph TD A[Batch 1: 8 images] -->|grad| E[Accumulated Gradient] B[Batch 2: 8 images] -->|grad| E C[Batch 3: 8 images] -->|grad| E D[Batch 4: 8 images] -->|grad| E E -->|update| F[Update Weights Once] style E fill:#fff9c4 style F fill:#c8e6c9
Example: Accumulating Over Mini-Batches
import torch
model = MyModel()
optimizer = torch.optim.SGD(model.parameters())
accumulation_steps = 4
optimizer.zero_grad() # Start fresh
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets)
# Gradients accumulate automatically!
loss.backward()
# Update only every 4 batches
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad() # Reset for next cycle
🎯 Quick Summary
| Concept | What It Does | Key Point |
|---|---|---|
| Automatic Differentiation | Calculates gradients automatically | No manual calculus needed |
| Computational Graph | Records operations | Built during forward pass |
| requires_grad | Flags tensors for tracking | Only tracked tensors get gradients |
| Leaf Tensors | User-created tensors | Keep gradients by default |
| Backpropagation | Flows gradients backward | Uses chain rule |
| Zeroing Gradients | Clears accumulated gradients | Essential before each backward |
| Gradient Accumulation | Sums gradients across batches | Useful for large effective batches |
🏆 You Did It!
You now understand the heart of PyTorch’s learning engine!
Think of it this way:
- 🏔️ Your parameters are mountain springs (leaf tensors)
- 🌊 Forward pass = Water flowing downhill (computation)
- 🐟 Backward pass = Salmon swimming upstream (gradients)
- 🧹 Zero gradients = Cleaning up for the next journey
Every time you train a neural network, this beautiful dance happens automatically. PyTorch builds the graph, traces the path, and tells each parameter exactly how to improve.
Now go build something amazing! 🚀