๐ Scaling Training in PyTorch: Teaching Your AI to Work as a Team
Imagine you have ONE chef trying to cook dinner for 1000 people. Thatโs impossible, right? But what if you had 100 chefs working together? Thatโs what distributed training is all about!
๐ฏ The Big Picture: Why Scale?
Think of training a neural network like baking the worldโs biggest cake:
- Your oven (GPU) can only fit so much
- The recipe (model) might be HUGE
- You need the cake ready by tomorrow!
Solution? Use MANY ovens (GPUs) working together!
graph TD A["๐ Giant Model"] --> B["Split the Work"] B --> C["GPU 1"] B --> D["GPU 2"] B --> E["GPU 3"] B --> F["GPU N..."] C --> G["Combine Results"] D --> G E --> G F --> G G --> H["๐ Trained Model!"]
๐ง Distributed Training Setup
Before our chef team can cook, they need a kitchen setup. In PyTorch, this means:
What You Need:
- Multiple GPUs - Your cooking stations
- Backend - How chefs communicate (
ncclfor GPUs,gloofor CPUs) - World Size - Total number of chefs (processes)
- Rank - Each chefโs ID number (0, 1, 2โฆ)
Simple Setup Example:
import torch.distributed as dist
# Initialize the kitchen!
dist.init_process_group(
backend='nccl', # Fast GPU talk
world_size=4, # 4 GPUs total
rank=0 # I am GPU #0
)
Real Life Analogy:
๐ณ Itโs like setting up walkie-talkies for your chef team before cooking starts!
๐ข DataParallel (DP): The Lazy Approach
DataParallel is the easiest way to use multiple GPUs. Think of it like this:
One head chef (GPU 0) gives orders. Other chefs follow. The head chef does extra work collecting results.
How It Works:
graph TD A["Data Batch"] --> B["GPU 0 - Boss"] B -->|Copy Model| C["GPU 1"] B -->|Copy Model| D["GPU 2"] B -->|Copy Model| E["GPU 3"] C -->|Send Results| B D -->|Send Results| B E -->|Send Results| B B --> F["Update Weights"]
Code Example:
import torch.nn as nn
model = MyModel()
# Wrap it! That's all!
model = nn.DataParallel(model)
model = model.cuda()
# Train normally
output = model(input)
โ ๏ธ The Problem with DataParallel:
- GPU 0 does ALL the collecting work
- GPU 0 becomes a bottleneck (traffic jam!)
- Not efficient for serious training
๐ Like having ALL cars exit through ONE toll booth!
โก DistributedDataParallel (DDP): The Pro Way
DDP is like giving EACH chef their own walkie-talkie. No boss collecting everything. Everyone talks to everyone!
Why DDP is Better:
| DataParallel | DDP |
|---|---|
| One boss GPU | Everyone equal |
| Traffic jam | Smooth flow |
| Slow | FAST! |
graph TD A["Each GPU has full model"] --> B["GPU 0 trains on batch 0"] A --> C["GPU 1 trains on batch 1"] A --> D["GPU 2 trains on batch 2"] B <-->|All-Reduce| C C <-->|All-Reduce| D B <-->|All-Reduce| D B --> E["All have same weights!"] C --> E D --> E
Code Example:
import torch.distributed as dist
from torch.nn.parallel import DDP
# Initialize process group first
dist.init_process_group(backend='nccl')
# Create model on this GPU
model = MyModel().to(local_rank)
# Wrap with DDP
model = DDP(model, device_ids=[local_rank])
The Magic: All-Reduce
Instead of sending everything to one GPU:
- Each GPU calculates its gradients
- They AVERAGE gradients together
- Everyone gets the same result!
๐ค Like all chefs tasting the soup together and agreeing it needs more salt!
๐ฆธ FSDP: Fully Sharded Data Parallel
When your model is SO BIG that even ONE copy wonโt fit on a GPU, you need FSDP.
The Problem:
๐ Your elephant (model) is too big for any room (GPU)!
The Solution:
โ๏ธ Cut the elephant into pieces. Each room holds ONE piece. When you need the whole elephant, everyone brings their pieces together!
graph TD A["Giant Model 100GB"] --> B["Shard 1: 25GB"] A --> C["Shard 2: 25GB"] A --> D["Shard 3: 25GB"] A --> E["Shard 4: 25GB"] B --> F["GPU 0"] C --> G["GPU 1"] D --> H["GPU 2"] E --> I["GPU 3"]
Code Example:
from torch.distributed.fsdp import FSDP
# Wrap your model with FSDP
model = FSDP(
MyHugeModel(),
# Shard everything!
sharding_strategy=FULL_SHARD,
)
FSDP Modes:
| Mode | What Gets Sharded | Memory Saved |
|---|---|---|
| FULL_SHARD | Everything | Maximum ๐ช |
| SHARD_GRAD_OP | Gradients + Optimizer | Good |
| NO_SHARD | Nothing (like DDP) | None |
๐งฉ Think of FSDP like a puzzle. Each GPU holds some pieces. During forward pass, pieces come together. After backward, they split again!
๐ฅ Process Groups: Building Your Teams
Sometimes you want GPUs to talk in smaller teams. Thatโs what Process Groups are for!
Example Scenario:
You have 8 GPUs. You want:
- GPUs 0-3 to form Team A
- GPUs 4-7 to form Team B
graph TD subgraph Team A A0["GPU 0"] A1["GPU 1"] A2["GPU 2"] A3["GPU 3"] end subgraph Team B B0["GPU 4"] B1["GPU 5"] B2["GPU 6"] B3["GPU 7"] end A0 <--> A1 A1 <--> A2 A2 <--> A3 B0 <--> B1 B1 <--> B2 B2 <--> B3
Code Example:
import torch.distributed as dist
# Create Team A (GPUs 0-3)
team_a = dist.new_group(ranks=[0, 1, 2, 3])
# Create Team B (GPUs 4-7)
team_b = dist.new_group(ranks=[4, 5, 6, 7])
# Talk within your team only!
dist.all_reduce(tensor, group=team_a)
When to Use Process Groups:
- Model Parallelism - Different model parts on different teams
- Pipeline Parallelism - Stages handled by different teams
- Hybrid Strategies - Combine multiple approaches
๐ Like basketball teams. Players on same team pass to each other. Sometimes teams play together!
๐๏ธ Large Model Strategies
When your model has BILLIONS of parameters, you need special strategies!
Strategy 1: Tensor Parallelism
Split individual layers across GPUs.
graph LR A["Big Matrix"] --> B["Part 1 on GPU 0"] A --> C["Part 2 on GPU 1"] A --> D["Part 3 on GPU 2"]
๐ Like cutting a pizza. Each person eats a slice, not the whole thing!
Strategy 2: Pipeline Parallelism
Different model layers on different GPUs.
graph LR A["Input"] --> B["Layers 1-10<br/>GPU 0"] B --> C["Layers 11-20<br/>GPU 1"] C --> D["Layers 21-30<br/>GPU 2"] D --> E["Output"]
๐ญ Like an assembly line. Each worker does one step, passes it on!
Strategy 3: Combine Everything!
For really BIG models (like GPT-4), you combine:
- FSDP - Shard the parameters
- Tensor Parallel - Split layers horizontally
- Pipeline Parallel - Split layers vertically
# Real-world setup for large models
from torch.distributed.fsdp import FSDP
from torch.distributed.tensor.parallel import (
parallelize_module
)
# Step 1: Tensor Parallelism
model = parallelize_module(model, tp_mesh)
# Step 2: FSDP wrapper
model = FSDP(model, mesh=dp_mesh)
Memory Tricks:
| Technique | What It Does |
|---|---|
| Gradient Checkpointing | Trade compute for memory |
| Mixed Precision (FP16) | Half the memory |
| Activation Offloading | Move to CPU when not needed |
๐ฎ Quick Decision Guide
Question: How do I scale my training?
graph TD A["Does model fit<br/>on 1 GPU?"] -->|Yes| B["Is speed enough?"] A -->|No| C["Use FSDP"] B -->|Yes| D["No scaling needed!"] B -->|No| E["Use DDP"] C --> F["Model still too big?"] F -->|Yes| G["Add Tensor/Pipeline<br/>Parallelism"] F -->|No| H["Done! Train away!"]
๐ Key Takeaways
-
DataParallel - Easy but slow. Like one boss, many workers.
-
DDP - Fast and efficient. Everyone is equal!
-
FSDP - For HUGE models. Shard everything!
-
Process Groups - Make teams within your team.
-
Large Models - Combine strategies: FSDP + Tensor + Pipeline
๐ You Did It!
You now understand how to teach many GPUs to work together. Start with DDP for most cases. Move to FSDP when models get big. Add fancy strategies when youโre training the next GPT!
Remember: The best distributed system is one where everyone does their fair share! ๐ค
# Your journey begins here!
dist.init_process_group(backend='nccl')
# Now go train something amazing! ๐
