Optimization

Back

Loading concept...

🚀 Training Components: Optimization

The Art of Teaching Your Neural Network to Learn Better


🎯 The Big Picture: What is Optimization?

Imagine you’re teaching a puppy to find a hidden treat in a giant maze. The puppy sniffs around, takes wrong turns, backtracks, and slowly figures out the path. Optimization is exactly like this—it’s how we teach our neural network to find the best answers!

In PyTorch, optimizers are the “trainers” that tell our model how to improve step by step.

graph TD A["🧠 Neural Network"] --> B["Makes Prediction"] B --> C["Compare with Answer"] C --> D["Calculate Error/Loss"] D --> E["🎯 Optimizer Steps In"] E --> F["Adjust Weights"] F --> A

🌟 Optimizers Overview

What’s an Optimizer?

Think of your neural network as a student learning math. The optimizer is the tutor who says:

  • “You got this wrong, try smaller numbers”
  • “Go faster here, slower there”
  • “Remember what worked yesterday!”

In simple terms: An optimizer updates your model’s weights to reduce errors.

PyTorch’s Optimizer Family

import torch.optim as optim

# The most common optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001)

The Basic Training Loop

for epoch in range(100):
    # 1. Forward pass
    predictions = model(inputs)
    loss = criterion(predictions, targets)

    # 2. Zero gradients (fresh start!)
    optimizer.zero_grad()

    # 3. Backward pass
    loss.backward()

    # 4. Update weights
    optimizer.step()

Why zero_grad()? Imagine writing on a whiteboard. If you don’t erase first, new writing mixes with old! We clear old gradients before calculating new ones.


🎢 SGD and Momentum

SGD: The Simplest Optimizer

Stochastic Gradient Descent is like rolling a ball down a hill to find the lowest point.

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01  # Learning rate: step size
)

The Problem: Imagine walking through a valley. You might zigzag back and forth, wasting time!

graph TD A["Start High on Hill"] --> B["Calculate Slope"] B --> C["Take Step Downhill"] C --> D{At Bottom?} D -->|No| B D -->|Yes| E["Found Minimum! 🎉"]

Momentum: Adding Speed

Momentum is like giving our ball some weight so it keeps rolling in a consistent direction.

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9  # Remember 90% of previous direction
)

Real-World Analogy:

  • Without momentum: Walking through sand, stopping and starting
  • With momentum: Skating on ice, smooth and fast!

Nesterov Momentum: Look Before You Leap

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True  # Peek ahead first!
)

Think of it like: Before taking a step, look where you’ll land and adjust accordingly.


🧙‍♂️ Adam and Variants

Adam: The Smart Optimizer

Adam = Adaptive Moment estimation

Adam is like having a GPS that:

  • Remembers which routes worked before
  • Adjusts speed for different road conditions
  • Gets smarter over time
optimizer = optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),  # Memory factors
    eps=1e-8             # Prevents division by zero
)

Why Adam is Popular:

Feature Benefit
Adaptive learning rates Each weight learns at its own pace
Momentum built-in Smooth, consistent updates
Works out of the box Less tuning needed

AdamW: Adam with Weight Decay

Weight decay is like telling your model: “Don’t get too confident about any single weight!”

optimizer = optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # Regularization strength
)

Difference: AdamW applies weight decay separately from gradients, leading to better results!

Other Adam Variants

# RAdam: Rectified Adam (more stable start)
optimizer = optim.RAdam(model.parameters(), lr=0.001)

# NAdam: Adam + Nesterov momentum
optimizer = optim.NAdam(model.parameters(), lr=0.001)

📦 Optimizer State Management

What’s Optimizer State?

The optimizer remembers things! Like a teacher keeping notes:

  • Previous gradient directions
  • How much each weight has changed
  • Running averages
# View the optimizer's memory
print(optimizer.state_dict())

Saving and Loading State

# Save optimizer state
torch.save({
    'model_state': model.state_dict(),
    'optimizer_state': optimizer.state_dict(),
    'epoch': epoch
}, 'checkpoint.pth')

# Load optimizer state
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state'])
optimizer.load_state_dict(checkpoint['optimizer_state'])

Why Save Optimizer State? When you resume training, the optimizer continues from where it left off—like bookmarking your place!

Resetting Optimizer State

# Clear all optimizer memory
optimizer.state = {}

# Or create fresh optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

🎨 Parameter Groups

What Are Parameter Groups?

Imagine you’re training a team:

  • Goalkeepers need different training than strikers
  • Some players learn faster than others

Parameter groups let you apply different settings to different parts of your model!

optimizer = optim.Adam([
    {'params': model.backbone.parameters(),
     'lr': 0.0001},      # Pre-trained: learn slowly
    {'params': model.head.parameters(),
     'lr': 0.001}        # New layers: learn faster
])

Accessing and Modifying Groups

# See all parameter groups
for i, group in enumerate(optimizer.param_groups):
    print(f"Group {i}: lr = {group['lr']}")

# Change learning rate for specific group
optimizer.param_groups[0]['lr'] = 0.0005

Common Use Cases

graph TD A["Parameter Groups"] --> B["Transfer Learning"] A --> C["Layer-wise LR"] A --> D["Freeze/Unfreeze Layers"] B --> E["Pre-trained: Low LR"] B --> F["New Layers: High LR"]

⏰ Learning Rate Schedulers

Why Change Learning Rate?

Think of learning like approaching a parking spot:

  • Start: Big turns to get close
  • Middle: Smaller adjustments
  • End: Tiny movements to park perfectly

Schedulers automatically adjust the learning rate during training!

StepLR: Staircase Descent

scheduler = optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,  # Every 10 epochs
    gamma=0.1      # Multiply LR by 0.1
)

# Training loop
for epoch in range(50):
    train_one_epoch()
    scheduler.step()  # Update LR

ExponentialLR: Smooth Decay

scheduler = optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95  # Multiply LR by 0.95 each epoch
)

ReduceLROnPlateau: Smart Scheduler

This one watches your progress and lowers LR when you’re stuck!

scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',       # Lower loss is better
    factor=0.5,       # Cut LR in half
    patience=5,       # Wait 5 epochs before cutting
    verbose=True      # Print when LR changes
)

# In training loop
scheduler.step(val_loss)  # Pass the metric!

🌡️ Warmup and Cyclic Schedules

Warmup: Start Slow, Go Fast

When you wake up, you don’t sprint immediately—you stretch first! Warmup gradually increases the learning rate from near-zero.

# Linear warmup + decay
def warmup_lambda(epoch):
    warmup_epochs = 5
    if epoch < warmup_epochs:
        return epoch / warmup_epochs
    return 1.0

scheduler = optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=warmup_lambda
)

Cosine Annealing: Wave Pattern

Like breathing in and out, the learning rate follows a smooth wave.

scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=50,    # Half-cycle length
    eta_min=1e-6  # Minimum LR
)

OneCycleLR: The Super-Scheduler

One of the best schedulers! Warmup + high peak + cool down.

scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    steps_per_epoch=len(train_loader),
    epochs=10
)

# Step AFTER EACH BATCH, not epoch!
for inputs, targets in train_loader:
    # ... training code ...
    scheduler.step()

CyclicLR: Bounce Up and Down

scheduler = optim.lr_scheduler.CyclicLR(
    optimizer,
    base_lr=0.001,
    max_lr=0.01,
    step_size_up=2000,  # Steps to peak
    mode='triangular'
)
graph TD A["Start Low"] --> B["Warmup Phase"] B --> C["Peak Performance"] C --> D["Cool Down"] D --> E["Fine-tune at Low LR"]

🔍 Learning Rate Finding

The LR Range Test

How do you know what learning rate to use? Test it!

The idea:

  1. Start with tiny LR
  2. Gradually increase it
  3. Watch when loss starts getting worse
  4. Pick a LR just before things go bad

Simple Implementation

lrs = []
losses = []
lr = 1e-7
lr_mult = 1.1  # Increase by 10% each step

for batch in train_loader:
    # Set learning rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Train step
    loss = train_step(batch)

    # Record
    lrs.append(lr)
    losses.append(loss.item())

    # Stop if loss explodes
    if loss > 4 * min(losses):
        break

    # Increase LR
    lr *= lr_mult

# Plot and find the sweet spot!
import matplotlib.pyplot as plt
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')

Using torch-lr-finder Library

# pip install torch-lr-finder
from torch_lr_finder import LRFinder

lr_finder = LRFinder(model, optimizer, criterion)
lr_finder.range_test(train_loader,
                     end_lr=10,
                     num_iter=100)
lr_finder.plot()  # Shows suggested LR
lr_finder.reset()

Reading the LR Plot

What You See What It Means
Loss drops sharply Good LR range
Loss is flat LR too small
Loss explodes upward LR too big

Pick: The LR where loss drops fastest (not the minimum!)


🎓 Putting It All Together

Here’s a complete example combining everything:

import torch
import torch.nn as nn
import torch.optim as optim

# Model
model = MyModel()

# Optimizer with parameter groups
optimizer = optim.AdamW([
    {'params': model.features.parameters(),
     'lr': 1e-4},
    {'params': model.classifier.parameters(),
     'lr': 1e-3}
], weight_decay=0.01)

# OneCycle scheduler
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=[1e-3, 1e-2],  # Per group!
    steps_per_epoch=len(train_loader),
    epochs=10
)

# Training loop
for epoch in range(10):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = compute_loss(model, batch)
        loss.backward()
        optimizer.step()
        scheduler.step()  # Step per batch!

    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()}")

🏆 Key Takeaways

Concept Remember This
Optimizer The teacher that updates weights
SGD Simple but needs momentum
Adam Smart, adaptive, great default
Parameter Groups Different LRs for different layers
Schedulers Change LR during training
Warmup Start slow, avoid explosions
LR Finding Test before committing!

🚀 You Did It!

You now understand how to:

  • Choose the right optimizer for your task
  • Use momentum to speed up training
  • Set different learning rates for different layers
  • Schedule learning rate changes automatically
  • Find the best learning rate through testing

Remember: Optimization is like teaching. Be patient, adjust your approach, and watch your model learn! 🎉

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.