What is an optimizer in PyTorch?

An optimizer updates your model's weights to reduce errors. It's like a tutor that tells the model how to improve step by step during training.

What's the difference between SGD and Adam optimizers?

SGD is simple but needs momentum for smooth updates. Adam is adaptive, adjusts learning rates per weight, and works well out of the box.

Why use learning rate schedulers?

Schedulers adjust learning rate during training. Start with big steps to learn fast, then smaller steps to fine-tune, like parking a car.

PyTorch Optimization | Optimizers & Schedulers

🚀 Training Components: Optimization

The Art of Teaching Your Neural Network to Learn Better

🎯 The Big Picture: What is Optimization?

Imagine you’re teaching a puppy to find a hidden treat in a giant maze. The puppy sniffs around, takes wrong turns, backtracks, and slowly figures out the path. Optimization is exactly like this—it’s how we teach our neural network to find the best answers!

In PyTorch, optimizers are the “trainers” that tell our model how to improve step by step.

graph TD
    A["🧠 Neural Network"] --> B["Makes Prediction"]
    B --> C["Compare with Answer"]
    C --> D["Calculate Error/Loss"]
    D --> E["🎯 Optimizer Steps In"]
    E --> F["Adjust Weights"]
    F --> A

🌟 Optimizers Overview

What’s an Optimizer?

Think of your neural network as a student learning math. The optimizer is the tutor who says:

“You got this wrong, try smaller numbers”
“Go faster here, slower there”
“Remember what worked yesterday!”

In simple terms: An optimizer updates your model’s weights to reduce errors.

PyTorch’s Optimizer Family

import torch.optim as optim

# The most common optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001)

The Basic Training Loop

for epoch in range(100):
    # 1. Forward pass
    predictions = model(inputs)
    loss = criterion(predictions, targets)

    # 2. Zero gradients (fresh start!)
    optimizer.zero_grad()

    # 3. Backward pass
    loss.backward()

    # 4. Update weights
    optimizer.step()

Why zero_grad()? Imagine writing on a whiteboard. If you don’t erase first, new writing mixes with old! We clear old gradients before calculating new ones.

🎢 SGD and Momentum

SGD: The Simplest Optimizer

Stochastic Gradient Descent is like rolling a ball down a hill to find the lowest point.

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01  # Learning rate: step size
)

The Problem: Imagine walking through a valley. You might zigzag back and forth, wasting time!

graph TD
    A["Start High on Hill"] --> B["Calculate Slope"]
    B --> C["Take Step Downhill"]
    C --> D{At Bottom?}
    D -->|No| B
    D -->|Yes| E["Found Minimum! 🎉"]

Momentum: Adding Speed

Momentum is like giving our ball some weight so it keeps rolling in a consistent direction.

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9  # Remember 90% of previous direction
)

Real-World Analogy:

Without momentum: Walking through sand, stopping and starting
With momentum: Skating on ice, smooth and fast!

Nesterov Momentum: Look Before You Leap

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True  # Peek ahead first!
)

Think of it like: Before taking a step, look where you’ll land and adjust accordingly.

🧙‍♂️ Adam and Variants

Adam: The Smart Optimizer

Adam = Adaptive Moment estimation

Adam is like having a GPS that:

Remembers which routes worked before
Adjusts speed for different road conditions
Gets smarter over time

optimizer = optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),  # Memory factors
    eps=1e-8             # Prevents division by zero
)

Why Adam is Popular:

Feature	Benefit
Adaptive learning rates	Each weight learns at its own pace
Momentum built-in	Smooth, consistent updates
Works out of the box	Less tuning needed

AdamW: Adam with Weight Decay

Weight decay is like telling your model: “Don’t get too confident about any single weight!”

optimizer = optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # Regularization strength
)

Difference: AdamW applies weight decay separately from gradients, leading to better results!

Other Adam Variants

# RAdam: Rectified Adam (more stable start)
optimizer = optim.RAdam(model.parameters(), lr=0.001)

# NAdam: Adam + Nesterov momentum
optimizer = optim.NAdam(model.parameters(), lr=0.001)

📦 Optimizer State Management

What’s Optimizer State?

The optimizer remembers things! Like a teacher keeping notes:

Previous gradient directions
How much each weight has changed
Running averages

# View the optimizer's memory
print(optimizer.state_dict())

Saving and Loading State

# Save optimizer state
torch.save({
    'model_state': model.state_dict(),
    'optimizer_state': optimizer.state_dict(),
    'epoch': epoch
}, 'checkpoint.pth')

# Load optimizer state
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state'])
optimizer.load_state_dict(checkpoint['optimizer_state'])

Why Save Optimizer State? When you resume training, the optimizer continues from where it left off—like bookmarking your place!

Resetting Optimizer State

# Clear all optimizer memory
optimizer.state = {}

# Or create fresh optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

🎨 Parameter Groups

What Are Parameter Groups?

Imagine you’re training a team:

Goalkeepers need different training than strikers
Some players learn faster than others

Parameter groups let you apply different settings to different parts of your model!

optimizer = optim.Adam([
    {'params': model.backbone.parameters(),
     'lr': 0.0001},      # Pre-trained: learn slowly
    {'params': model.head.parameters(),
     'lr': 0.001}        # New layers: learn faster
])

Accessing and Modifying Groups

# See all parameter groups
for i, group in enumerate(optimizer.param_groups):
    print(f"Group {i}: lr = {group['lr']}")

# Change learning rate for specific group
optimizer.param_groups[0]['lr'] = 0.0005

Common Use Cases

graph TD
    A["Parameter Groups"] --> B["Transfer Learning"]
    A --> C["Layer-wise LR"]
    A --> D["Freeze/Unfreeze Layers"]
    B --> E["Pre-trained: Low LR"]
    B --> F["New Layers: High LR"]

⏰ Learning Rate Schedulers

Why Change Learning Rate?

Think of learning like approaching a parking spot:

Start: Big turns to get close
Middle: Smaller adjustments
End: Tiny movements to park perfectly

Schedulers automatically adjust the learning rate during training!

StepLR: Staircase Descent

scheduler = optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,  # Every 10 epochs
    gamma=0.1      # Multiply LR by 0.1
)

# Training loop
for epoch in range(50):
    train_one_epoch()
    scheduler.step()  # Update LR

ExponentialLR: Smooth Decay

scheduler = optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95  # Multiply LR by 0.95 each epoch
)

ReduceLROnPlateau: Smart Scheduler

This one watches your progress and lowers LR when you’re stuck!

scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',       # Lower loss is better
    factor=0.5,       # Cut LR in half
    patience=5,       # Wait 5 epochs before cutting
    verbose=True      # Print when LR changes
)

# In training loop
scheduler.step(val_loss)  # Pass the metric!

🌡️ Warmup and Cyclic Schedules

Warmup: Start Slow, Go Fast

When you wake up, you don’t sprint immediately—you stretch first! Warmup gradually increases the learning rate from near-zero.

# Linear warmup + decay
def warmup_lambda(epoch):
    warmup_epochs = 5
    if epoch < warmup_epochs:
        return epoch / warmup_epochs
    return 1.0

scheduler = optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=warmup_lambda
)

Cosine Annealing: Wave Pattern

Like breathing in and out, the learning rate follows a smooth wave.

scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=50,    # Half-cycle length
    eta_min=1e-6  # Minimum LR
)

OneCycleLR: The Super-Scheduler

One of the best schedulers! Warmup + high peak + cool down.

scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    steps_per_epoch=len(train_loader),
    epochs=10
)

# Step AFTER EACH BATCH, not epoch!
for inputs, targets in train_loader:
    # ... training code ...
    scheduler.step()

CyclicLR: Bounce Up and Down

scheduler = optim.lr_scheduler.CyclicLR(
    optimizer,
    base_lr=0.001,
    max_lr=0.01,
    step_size_up=2000,  # Steps to peak
    mode='triangular'
)

graph TD
    A["Start Low"] --> B["Warmup Phase"]
    B --> C["Peak Performance"]
    C --> D["Cool Down"]
    D --> E["Fine-tune at Low LR"]

🔍 Learning Rate Finding

The LR Range Test

How do you know what learning rate to use? Test it!

The idea:

Start with tiny LR
Gradually increase it
Watch when loss starts getting worse
Pick a LR just before things go bad

Simple Implementation

lrs = []
losses = []
lr = 1e-7
lr_mult = 1.1  # Increase by 10% each step

for batch in train_loader:
    # Set learning rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Train step
    loss = train_step(batch)

    # Record
    lrs.append(lr)
    losses.append(loss.item())

    # Stop if loss explodes
    if loss > 4 * min(losses):
        break

    # Increase LR
    lr *= lr_mult

# Plot and find the sweet spot!
import matplotlib.pyplot as plt
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')

Using torch-lr-finder Library

# pip install torch-lr-finder
from torch_lr_finder import LRFinder

lr_finder = LRFinder(model, optimizer, criterion)
lr_finder.range_test(train_loader,
                     end_lr=10,
                     num_iter=100)
lr_finder.plot()  # Shows suggested LR
lr_finder.reset()

Reading the LR Plot

What You See	What It Means
Loss drops sharply	Good LR range
Loss is flat	LR too small
Loss explodes upward	LR too big

Pick: The LR where loss drops fastest (not the minimum!)

🎓 Putting It All Together

Here’s a complete example combining everything:

import torch
import torch.nn as nn
import torch.optim as optim

# Model
model = MyModel()

# Optimizer with parameter groups
optimizer = optim.AdamW([
    {'params': model.features.parameters(),
     'lr': 1e-4},
    {'params': model.classifier.parameters(),
     'lr': 1e-3}
], weight_decay=0.01)

# OneCycle scheduler
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=[1e-3, 1e-2],  # Per group!
    steps_per_epoch=len(train_loader),
    epochs=10
)

# Training loop
for epoch in range(10):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = compute_loss(model, batch)
        loss.backward()
        optimizer.step()
        scheduler.step()  # Step per batch!

    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()}")

🏆 Key Takeaways

Concept	Remember This
Optimizer	The teacher that updates weights
SGD	Simple but needs momentum
Adam	Smart, adaptive, great default
Parameter Groups	Different LRs for different layers
Schedulers	Change LR during training
Warmup	Start slow, avoid explosions
LR Finding	Test before committing!

🚀 You Did It!

You now understand how to:

Choose the right optimizer for your task
Use momentum to speed up training
Set different learning rates for different layers
Schedule learning rate changes automatically
Find the best learning rate through testing

Remember: Optimization is like teaching. Be patient, adjust your approach, and watch your model learn! 🎉

Optimization

Unable to load concept

Coming Soon...

🚀 Training Components: Optimization

The Art of Teaching Your Neural Network to Learn Better

🎯 The Big Picture: What is Optimization?

🌟 Optimizers Overview

What’s an Optimizer?

PyTorch’s Optimizer Family

The Basic Training Loop

🎢 SGD and Momentum

SGD: The Simplest Optimizer

Momentum: Adding Speed

Nesterov Momentum: Look Before You Leap

🧙‍♂️ Adam and Variants

Adam: The Smart Optimizer

AdamW: Adam with Weight Decay

Other Adam Variants

📦 Optimizer State Management

What’s Optimizer State?

Saving and Loading State

Resetting Optimizer State

🎨 Parameter Groups

What Are Parameter Groups?

Accessing and Modifying Groups

Common Use Cases

⏰ Learning Rate Schedulers

Why Change Learning Rate?

StepLR: Staircase Descent

ExponentialLR: Smooth Decay

ReduceLROnPlateau: Smart Scheduler

🌡️ Warmup and Cyclic Schedules

Warmup: Start Slow, Go Fast

Cosine Annealing: Wave Pattern

OneCycleLR: The Super-Scheduler

CyclicLR: Bounce Up and Down

🔍 Learning Rate Finding

The LR Range Test

Simple Implementation

Using torch-lr-finder Library

Reading the LR Plot

🎓 Putting It All Together

🏆 Key Takeaways

🚀 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue