What is mixed precision training in PyTorch?

Mixed precision uses both FP16 and FP32 numbers during training. FP16 runs faster on GPUs while FP32 keeps accuracy for critical operations like weight updates.

What does GradScaler do in PyTorch?

GradScaler prevents gradient underflow in mixed precision training by scaling loss values up before backward pass, then scaling gradients back down.

When should I use gradient checkpointing?

Use gradient checkpointing when your model is too large for GPU memory. It saves 60-70% memory by recomputing activations during backward pass.

What does torch.compile do?

torch.compile optimizes your model into GPU-native code for 10-30% faster training. Best for long training runs since compilation takes initial time.

PyTorch Training Performance | Optimization Guide

🚀 PyTorch Training Performance: Making Your AI Learn FASTER!

The Race Car Story 🏎️

Imagine you’re teaching a race car to drive around a track. The car (your neural network) needs to learn every turn, every bump, every shortcut. But here’s the problem: learning takes time and fuel (memory).

What if we could make our race car learn twice as fast while using half the fuel? That’s exactly what PyTorch training performance optimizations do!

Today, we’ll unlock 6 secret turbo boosters for your AI race car.

🎯 What We’ll Learn

graph LR
    A["Training Performance"] --> B["Mixed Precision"]
    A --> C["GradScaler"]
    A --> D["torch.compile"]
    A --> E["cuDNN Settings"]
    A --> F["Gradient Checkpointing"]
    A --> G["Memory-Efficient Training"]

1️⃣ Mixed Precision Training

The Two-Pencil Trick ✏️

Imagine drawing a picture. You have two pencils:

Big thick pencil (32-bit): Very precise, but slow and uses lots of paper
Small thin pencil (16-bit): Less precise, but FAST and saves paper

Mixed precision means using BOTH pencils smartly!

Use the thin pencil for most drawing (forward/backward passes)
Use the thick pencil only for important details (weight updates)

Why Does This Work?

Your GPU is like a super-fast artist. It can draw twice as many thin-pencil lines in the same time as thick-pencil lines!

Simple Example

import torch

# OLD WAY - Always using the big pencil
model = model.float()  # 32-bit everywhere

# NEW WAY - Smart pencil switching!
from torch.cuda.amp import autocast

with autocast():
    # GPU automatically picks the best
    # pencil for each operation
    output = model(input)
    loss = criterion(output, target)

What Happens Inside autocast?

Operation	Precision Used	Why?
Matrix multiply	FP16 (fast!)	Safe with half precision
Softmax	FP32 (precise)	Needs accuracy
Loss calculation	FP32 (precise)	Small numbers matter

Real-World Benefit

🏆 Result: Training becomes 1.5x to 2x FASTER with almost NO accuracy loss!

2️⃣ GradScaler: The Safety Net 🛡️

The Problem with Tiny Numbers

When we use the thin pencil (FP16), some numbers become SO tiny they turn into zero. This is called underflow.

Imagine trying to measure an ant with a ruler that only shows meters. The ant would be “0 meters” — we lost information!

The Solution: Make Everything BIGGER First!

GradScaler is like a magnifying glass:

Before calculating: Multiply the loss by a BIG number (like 1000)
Calculate gradients: Now tiny numbers are big enough to see!
After calculating: Divide by 1000 to get the real values back

Simple Example

from torch.cuda.amp import GradScaler, autocast

# Create our magnifying glass
scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()

    # Use mixed precision
    with autocast():
        output = model(data)
        loss = criterion(output, target)

    # Scale up, calculate, scale down
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

What GradScaler Does Step-by-Step

graph TD
    A["Loss = 0.001"] --> B["Scale Up x1000"]
    B --> C["Scaled Loss = 1.0"]
    C --> D["Calculate Gradients"]
    D --> E["Scale Down ÷1000"]
    E --> F["Real Gradients"]

Smart Scaling

GradScaler is clever! If numbers overflow (become infinity), it:

Skips that training step
Reduces the scale factor
Tries again with safer numbers

🎯 Key Point: GradScaler + autocast = The perfect team for safe, fast training!

3️⃣ torch.compile: The Speed Translator 🔮

Speaking GPU Language

Imagine you’re giving instructions to a robot in English, but the robot speaks Robot-Language. Every time, a translator converts your words.

torch.compile is like teaching the robot English directly — no more translator needed!

Before torch.compile

You say: "Add A and B, then multiply by C"
         ↓
Translator converts each word
         ↓
Robot does the task
         ↓
(Slow because of translation)

After torch.compile

You say: "Add A and B, then multiply by C"
         ↓
torch.compile creates optimized Robot-Language ONCE
         ↓
Robot runs super fast!

Simple Example

import torch

# Your normal model
model = MyNeuralNetwork()

# Wave the magic wand!
model = torch.compile(model)

# Now training is FASTER
# (everything else stays the same!)
for data, target in dataloader:
    output = model(data)  # Runs optimized!
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Compile Modes

Mode	Speed	Compilation Time	Best For
`default`	Good	Medium	Most cases
`reduce-overhead`	Better	Longer	Repeated runs
`max-autotune`	Best	Longest	Production

# Choose your speed level
model = torch.compile(model, mode="max-autotune")

When to Use torch.compile?

✅ Use it when: Training takes hours/days ❌ Skip it when: Quick experiments (compilation takes time)

🚀 Result: 10-30% faster training after compilation!

4️⃣ torch.backends.cudnn Settings ⚙️

The GPU’s Secret Settings Menu

Your GPU has hidden settings like a video game! These settings control HOW the GPU does math.

The Two Magic Settings

Setting 1: benchmark = True

torch.backends.cudnn.benchmark = True

What it does: GPU tries MANY ways to do the same math, then remembers the fastest way.

Analogy: Like trying all routes to school once, then always taking the fastest one!

When to use: When your input sizes DON’T change (same image size, same batch size)

Setting 2: deterministic = True

torch.backends.cudnn.deterministic = True

What it does: GPU always does math the EXACT same way.

Analogy: Like always taking the same route, even if traffic varies.

When to use: When you need reproducible results (research papers, debugging)

The Trade-off

graph LR
    A["benchmark=True"] --> B["🚀 FASTER"]
    A --> C["🎲 Slightly random"]
    D["deterministic=True"] --> E["🐢 Slower"]
    D --> F["✅ Same results always"]

Practical Setup

# For maximum speed (training)
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False

# For reproducibility (debugging)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

💡 Pro Tip: Use benchmark=True during training, deterministic=True when sharing results!

5️⃣ Gradient Checkpointing: Trading Time for Memory 💾

The Photo Album Problem

Imagine taking photos during a road trip. You have two choices:

Keep all photos: Easy to look back, but fills your phone! (Out of memory)
Keep some photos: Less space used, but need to retake some later

Gradient checkpointing chooses option 2!

How Normal Training Works

graph TD
    A["Layer 1"] --> B["Save Output"]
    B --> C["Layer 2"]
    C --> D["Save Output"]
    D --> E["Layer 3"]
    E --> F["Save Output"]
    F --> G["Memory Full! 😰"]

How Checkpointing Works

graph TD
    A["Layer 1"] --> B["💾 Checkpoint"]
    B --> C["Layer 2"]
    C --> D["Forget"]
    D --> E["Layer 3"]
    E --> F["💾 Checkpoint"]
    F --> G["Memory OK! 😊"]

During backward pass, we recalculate the forgotten outputs. A bit slower, but uses WAY less memory!

Simple Example

from torch.utils.checkpoint import checkpoint

class BigModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = HugeLayer()
        self.layer2 = HugeLayer()
        self.layer3 = HugeLayer()

    def forward(self, x):
        # Checkpoint expensive layers
        x = checkpoint(self.layer1, x)
        x = checkpoint(self.layer2, x)
        x = self.layer3(x)
        return x

Memory vs Time Trade-off

Approach	Memory Usage	Speed
Normal	100%	100%
Checkpoint every layer	~30%	~70%
Checkpoint some layers	~50%	~85%

🎯 Use When: Your model is too big for GPU memory!

6️⃣ Memory-Efficient Training: Every Byte Counts! 📦

The Suitcase Packing Problem

Going on vacation with a small suitcase? You need to pack smart!

Memory-efficient training is about packing your GPU memory smartly.

Technique 1: Gradient Accumulation

Problem: Batch size 64 doesn’t fit in memory Solution: Process 4 mini-batches of 16, then update once!

accumulation_steps = 4
optimizer.zero_grad()

for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target)

    # Divide loss (so gradients add up correctly)
    loss = loss / accumulation_steps
    loss.backward()

    # Update only every 4 steps
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Technique 2: Clear Cache Regularly

# Free up unused GPU memory
torch.cuda.empty_cache()

Technique 3: Delete Unused Variables

output = model(data)
loss = criterion(output, target)

# Don't need output anymore!
del output
loss.backward()

Technique 4: Use In-place Operations

# Normal (creates new tensor)
x = x + 1

# In-place (modifies existing tensor)
x += 1  # or x.add_(1)

Memory Monitoring

# Check memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

The Complete Memory-Saving Recipe

# 1. Enable mixed precision
scaler = GradScaler()

# 2. Set cudnn for speed
torch.backends.cudnn.benchmark = True

# 3. Compile model
model = torch.compile(model)

# 4. Use gradient accumulation
accumulation_steps = 4

for epoch in range(num_epochs):
    for i, (data, target) in enumerate(dataloader):
        with autocast():
            output = model(data)
            loss = criterion(output, target) / accumulation_steps

        scaler.scale(loss).backward()

        if (i + 1) % accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            torch.cuda.empty_cache()

🏁 Putting It All Together

Your Training Performance Checklist

graph TD
    A["Start Training"] --> B{GPU Memory OK?}
    B -->|No| C["Add Gradient Checkpointing"]
    B -->|Yes| D{Need Speed?}
    D -->|Yes| E["Enable Mixed Precision"]
    E --> F["Add GradScaler"]
    F --> G["torch.compile model"]
    G --> H["cudnn.benchmark = True"]
    H --> I["🚀 FAST Training!"]
    C --> D

Quick Reference

Technique	Memory Saved	Speed Boost	Difficulty
Mixed Precision	50%	1.5-2x	Easy
GradScaler	-	Works with MP	Easy
torch.compile	-	10-30%	Easy
cudnn.benchmark	-	5-10%	Easy
Grad Checkpointing	60-70%	-20%	Medium
Gradient Accumulation	Lots!	-	Easy

🎉 You Did It!

You now know the 6 turbo boosters for PyTorch training:

Mixed Precision - Use two number sizes smartly
GradScaler - Keep tiny numbers safe
torch.compile - Speak GPU language directly
cuDNN settings - Unlock hidden GPU powers
Gradient Checkpointing - Trade time for memory
Memory-Efficient Training - Pack your GPU smartly

Your AI race car is now ready to zoom! 🏎️💨

“The best way to learn fast is to teach your computer to learn fast!”

Training Performance

Unable to load concept

Coming Soon...

🚀 PyTorch Training Performance: Making Your AI Learn FASTER!

The Race Car Story 🏎️

🎯 What We’ll Learn

1️⃣ Mixed Precision Training

The Two-Pencil Trick ✏️

Why Does This Work?

Simple Example

What Happens Inside autocast?

Real-World Benefit

2️⃣ GradScaler: The Safety Net 🛡️

The Problem with Tiny Numbers

The Solution: Make Everything BIGGER First!

Simple Example

What GradScaler Does Step-by-Step

Smart Scaling

3️⃣ torch.compile: The Speed Translator 🔮

Speaking GPU Language

Before torch.compile

After torch.compile

Simple Example

Compile Modes

When to Use torch.compile?

4️⃣ torch.backends.cudnn Settings ⚙️

The GPU’s Secret Settings Menu

The Two Magic Settings

Setting 1: benchmark = True

Setting 2: deterministic = True

The Trade-off

Practical Setup

5️⃣ Gradient Checkpointing: Trading Time for Memory 💾

The Photo Album Problem

How Normal Training Works

How Checkpointing Works

Simple Example

Memory vs Time Trade-off

6️⃣ Memory-Efficient Training: Every Byte Counts! 📦

The Suitcase Packing Problem

Technique 1: Gradient Accumulation

Technique 2: Clear Cache Regularly

Technique 3: Delete Unused Variables

Technique 4: Use In-place Operations

Memory Monitoring

The Complete Memory-Saving Recipe

🏁 Putting It All Together

Your Training Performance Checklist

Quick Reference

🎉 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue