🚀 Training Components: Optimization
The Art of Teaching Your Neural Network to Learn Better
🎯 The Big Picture: What is Optimization?
Imagine you’re teaching a puppy to find a hidden treat in a giant maze. The puppy sniffs around, takes wrong turns, backtracks, and slowly figures out the path. Optimization is exactly like this—it’s how we teach our neural network to find the best answers!
In PyTorch, optimizers are the “trainers” that tell our model how to improve step by step.
graph TD A["🧠Neural Network"] --> B["Makes Prediction"] B --> C["Compare with Answer"] C --> D["Calculate Error/Loss"] D --> E["🎯 Optimizer Steps In"] E --> F["Adjust Weights"] F --> A
🌟 Optimizers Overview
What’s an Optimizer?
Think of your neural network as a student learning math. The optimizer is the tutor who says:
- “You got this wrong, try smaller numbers”
- “Go faster here, slower there”
- “Remember what worked yesterday!”
In simple terms: An optimizer updates your model’s weights to reduce errors.
PyTorch’s Optimizer Family
import torch.optim as optim
# The most common optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001)
The Basic Training Loop
for epoch in range(100):
# 1. Forward pass
predictions = model(inputs)
loss = criterion(predictions, targets)
# 2. Zero gradients (fresh start!)
optimizer.zero_grad()
# 3. Backward pass
loss.backward()
# 4. Update weights
optimizer.step()
Why zero_grad()? Imagine writing on a whiteboard. If you don’t erase first, new writing mixes with old! We clear old gradients before calculating new ones.
🎢 SGD and Momentum
SGD: The Simplest Optimizer
Stochastic Gradient Descent is like rolling a ball down a hill to find the lowest point.
optimizer = optim.SGD(
model.parameters(),
lr=0.01 # Learning rate: step size
)
The Problem: Imagine walking through a valley. You might zigzag back and forth, wasting time!
graph TD A["Start High on Hill"] --> B["Calculate Slope"] B --> C["Take Step Downhill"] C --> D{At Bottom?} D -->|No| B D -->|Yes| E["Found Minimum! 🎉"]
Momentum: Adding Speed
Momentum is like giving our ball some weight so it keeps rolling in a consistent direction.
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9 # Remember 90% of previous direction
)
Real-World Analogy:
- Without momentum: Walking through sand, stopping and starting
- With momentum: Skating on ice, smooth and fast!
Nesterov Momentum: Look Before You Leap
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True # Peek ahead first!
)
Think of it like: Before taking a step, look where you’ll land and adjust accordingly.
🧙‍♂️ Adam and Variants
Adam: The Smart Optimizer
Adam = Adaptive Moment estimation
Adam is like having a GPS that:
- Remembers which routes worked before
- Adjusts speed for different road conditions
- Gets smarter over time
optimizer = optim.Adam(
model.parameters(),
lr=0.001,
betas=(0.9, 0.999), # Memory factors
eps=1e-8 # Prevents division by zero
)
Why Adam is Popular:
| Feature | Benefit |
|---|---|
| Adaptive learning rates | Each weight learns at its own pace |
| Momentum built-in | Smooth, consistent updates |
| Works out of the box | Less tuning needed |
AdamW: Adam with Weight Decay
Weight decay is like telling your model: “Don’t get too confident about any single weight!”
optimizer = optim.AdamW(
model.parameters(),
lr=0.001,
weight_decay=0.01 # Regularization strength
)
Difference: AdamW applies weight decay separately from gradients, leading to better results!
Other Adam Variants
# RAdam: Rectified Adam (more stable start)
optimizer = optim.RAdam(model.parameters(), lr=0.001)
# NAdam: Adam + Nesterov momentum
optimizer = optim.NAdam(model.parameters(), lr=0.001)
📦 Optimizer State Management
What’s Optimizer State?
The optimizer remembers things! Like a teacher keeping notes:
- Previous gradient directions
- How much each weight has changed
- Running averages
# View the optimizer's memory
print(optimizer.state_dict())
Saving and Loading State
# Save optimizer state
torch.save({
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
'epoch': epoch
}, 'checkpoint.pth')
# Load optimizer state
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state'])
optimizer.load_state_dict(checkpoint['optimizer_state'])
Why Save Optimizer State? When you resume training, the optimizer continues from where it left off—like bookmarking your place!
Resetting Optimizer State
# Clear all optimizer memory
optimizer.state = {}
# Or create fresh optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
🎨 Parameter Groups
What Are Parameter Groups?
Imagine you’re training a team:
- Goalkeepers need different training than strikers
- Some players learn faster than others
Parameter groups let you apply different settings to different parts of your model!
optimizer = optim.Adam([
{'params': model.backbone.parameters(),
'lr': 0.0001}, # Pre-trained: learn slowly
{'params': model.head.parameters(),
'lr': 0.001} # New layers: learn faster
])
Accessing and Modifying Groups
# See all parameter groups
for i, group in enumerate(optimizer.param_groups):
print(f"Group {i}: lr = {group['lr']}")
# Change learning rate for specific group
optimizer.param_groups[0]['lr'] = 0.0005
Common Use Cases
graph TD A["Parameter Groups"] --> B["Transfer Learning"] A --> C["Layer-wise LR"] A --> D["Freeze/Unfreeze Layers"] B --> E["Pre-trained: Low LR"] B --> F["New Layers: High LR"]
⏰ Learning Rate Schedulers
Why Change Learning Rate?
Think of learning like approaching a parking spot:
- Start: Big turns to get close
- Middle: Smaller adjustments
- End: Tiny movements to park perfectly
Schedulers automatically adjust the learning rate during training!
StepLR: Staircase Descent
scheduler = optim.lr_scheduler.StepLR(
optimizer,
step_size=10, # Every 10 epochs
gamma=0.1 # Multiply LR by 0.1
)
# Training loop
for epoch in range(50):
train_one_epoch()
scheduler.step() # Update LR
ExponentialLR: Smooth Decay
scheduler = optim.lr_scheduler.ExponentialLR(
optimizer,
gamma=0.95 # Multiply LR by 0.95 each epoch
)
ReduceLROnPlateau: Smart Scheduler
This one watches your progress and lowers LR when you’re stuck!
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min', # Lower loss is better
factor=0.5, # Cut LR in half
patience=5, # Wait 5 epochs before cutting
verbose=True # Print when LR changes
)
# In training loop
scheduler.step(val_loss) # Pass the metric!
🌡️ Warmup and Cyclic Schedules
Warmup: Start Slow, Go Fast
When you wake up, you don’t sprint immediately—you stretch first! Warmup gradually increases the learning rate from near-zero.
# Linear warmup + decay
def warmup_lambda(epoch):
warmup_epochs = 5
if epoch < warmup_epochs:
return epoch / warmup_epochs
return 1.0
scheduler = optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=warmup_lambda
)
Cosine Annealing: Wave Pattern
Like breathing in and out, the learning rate follows a smooth wave.
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=50, # Half-cycle length
eta_min=1e-6 # Minimum LR
)
OneCycleLR: The Super-Scheduler
One of the best schedulers! Warmup + high peak + cool down.
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.01,
steps_per_epoch=len(train_loader),
epochs=10
)
# Step AFTER EACH BATCH, not epoch!
for inputs, targets in train_loader:
# ... training code ...
scheduler.step()
CyclicLR: Bounce Up and Down
scheduler = optim.lr_scheduler.CyclicLR(
optimizer,
base_lr=0.001,
max_lr=0.01,
step_size_up=2000, # Steps to peak
mode='triangular'
)
graph TD A["Start Low"] --> B["Warmup Phase"] B --> C["Peak Performance"] C --> D["Cool Down"] D --> E["Fine-tune at Low LR"]
🔍 Learning Rate Finding
The LR Range Test
How do you know what learning rate to use? Test it!
The idea:
- Start with tiny LR
- Gradually increase it
- Watch when loss starts getting worse
- Pick a LR just before things go bad
Simple Implementation
lrs = []
losses = []
lr = 1e-7
lr_mult = 1.1 # Increase by 10% each step
for batch in train_loader:
# Set learning rate
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# Train step
loss = train_step(batch)
# Record
lrs.append(lr)
losses.append(loss.item())
# Stop if loss explodes
if loss > 4 * min(losses):
break
# Increase LR
lr *= lr_mult
# Plot and find the sweet spot!
import matplotlib.pyplot as plt
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
Using torch-lr-finder Library
# pip install torch-lr-finder
from torch_lr_finder import LRFinder
lr_finder = LRFinder(model, optimizer, criterion)
lr_finder.range_test(train_loader,
end_lr=10,
num_iter=100)
lr_finder.plot() # Shows suggested LR
lr_finder.reset()
Reading the LR Plot
| What You See | What It Means |
|---|---|
| Loss drops sharply | Good LR range |
| Loss is flat | LR too small |
| Loss explodes upward | LR too big |
Pick: The LR where loss drops fastest (not the minimum!)
🎓 Putting It All Together
Here’s a complete example combining everything:
import torch
import torch.nn as nn
import torch.optim as optim
# Model
model = MyModel()
# Optimizer with parameter groups
optimizer = optim.AdamW([
{'params': model.features.parameters(),
'lr': 1e-4},
{'params': model.classifier.parameters(),
'lr': 1e-3}
], weight_decay=0.01)
# OneCycle scheduler
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=[1e-3, 1e-2], # Per group!
steps_per_epoch=len(train_loader),
epochs=10
)
# Training loop
for epoch in range(10):
for batch in train_loader:
optimizer.zero_grad()
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
scheduler.step() # Step per batch!
print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()}")
🏆 Key Takeaways
| Concept | Remember This |
|---|---|
| Optimizer | The teacher that updates weights |
| SGD | Simple but needs momentum |
| Adam | Smart, adaptive, great default |
| Parameter Groups | Different LRs for different layers |
| Schedulers | Change LR during training |
| Warmup | Start slow, avoid explosions |
| LR Finding | Test before committing! |
🚀 You Did It!
You now understand how to:
- Choose the right optimizer for your task
- Use momentum to speed up training
- Set different learning rates for different layers
- Schedule learning rate changes automatically
- Find the best learning rate through testing
Remember: Optimization is like teaching. Be patient, adjust your approach, and watch your model learn! 🎉
