Distributed Training: Teaching Many Computers to Work Together
The Big Picture: A Pizza Party Analogy
Imagine you need to make 1000 pizzas for a huge party. You could:
- One chef, one oven β Takes forever!
- Many chefs, many ovens β Super fast!
Distributed training is like option 2. Instead of one computer doing all the work, we get MANY computers to share the job. Each computer learns a little piece, and together they learn the whole thing!
What is Distributed Training?
Think of training an AI model like teaching a classroom of students. If you have ONE teacher and ONE student, learning takes a long time. But if you have MANY teachers and MANY students working together, everyone learns faster!
The Core Idea
Single Computer:
π₯οΈ β processes 100 images/second
Four Computers Together:
π₯οΈπ₯οΈπ₯οΈπ₯οΈ β processes 400 images/second!
Real Example: Training GPT-like models would take YEARS on one computer. With distributed training, it takes WEEKS.
Distribution Strategies: Different Ways to Share the Work
Think of these strategies like different ways to divide work at a pizza factory.
1. Mirrored Strategy (Data Parallelism)
Analogy: Every chef has the SAME recipe, but different ingredients to work with.
graph TD A["Same Model Copy"] --> B["GPU 1: Batch A"] A --> C["GPU 2: Batch B"] A --> D["GPU 3: Batch C"] B --> E["Combine Results"] C --> E D --> E E --> F["Update All Models"]
How it works:
- Each GPU gets a COPY of the model
- Each GPU processes DIFFERENT data
- All GPUs share what they learned
- Everyone stays in sync!
TensorFlow Code:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = create_model()
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy'
)
2. Multi-Worker Strategy
Analogy: Multiple kitchens in different buildings, all making the same dish!
graph TD A["Worker 1 Machine"] --> D["Share Updates"] B["Worker 2 Machine"] --> D C["Worker 3 Machine"] --> D D --> E["Synchronized Model"]
When to use: When you have multiple physical machines, not just multiple GPUs in one machine.
Configuration Example:
# Worker 0 config
os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"worker": [
"machine1:12345",
"machine2:12345"
]
},
"task": {"type": "worker", "index": 0}
})
3. Parameter Server Strategy
Analogy: One head chef (parameter server) keeps the master recipe. Other chefs (workers) do the cooking and report back!
graph TD PS["Parameter Server"] --> W1["Worker 1"] PS --> W2["Worker 2"] PS --> W3["Worker 3"] W1 --> PS W2 --> PS W3 --> PS
Best for: Very large models that donβt fit on one GPU.
Distributed Datasets: Feeding Many Hungry GPUs
The Problem
If you have 4 GPUs but only send data to 1, the other 3 are just sitting there doing nothing!
Bad:
GPU 1: ππππ (working hard)
GPU 2: π΄ (sleeping)
GPU 3: π΄ (sleeping)
GPU 4: π΄ (sleeping)
Good:
GPU 1: π (working)
GPU 2: π (working)
GPU 3: π (working)
GPU 4: π (working)
The Solution: Distributed Datasets
# Create a distributed dataset
def make_dataset():
dataset = tf.data.Dataset.from_tensor_slices(
(images, labels)
)
dataset = dataset.shuffle(1000)
dataset = dataset.batch(GLOBAL_BATCH_SIZE)
return dataset
# Distribute it across GPUs
dist_dataset = strategy.experimental_distribute_dataset(
make_dataset()
)
Key Concept: Global vs Per-Replica Batch Size
| Type | Meaning | Example |
|---|---|---|
| Global Batch | Total across ALL GPUs | 256 |
| Per-Replica | Each GPU gets | 64 (if 4 GPUs) |
GLOBAL_BATCH_SIZE = 256
num_replicas = strategy.num_replicas_in_sync # 4
per_replica_batch = GLOBAL_BATCH_SIZE // num_replicas # 64
GPU Configuration: Setting Up Your Kitchen
Step 1: Check What You Have
gpus = tf.config.list_physical_devices('GPU')
print(f"Found {len(gpus)} GPUs!")
# Output: Found 4 GPUs!
Step 2: Enable Memory Growth
Problem: By default, TensorFlow grabs ALL GPU memory. This is like one chef taking over the entire kitchen!
Solution: Let GPUs grow memory as needed.
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(
gpu, True
)
Step 3: Limit Visible GPUs (Optional)
Sometimes you only want to use SOME GPUs:
# Only use GPU 0 and GPU 1
tf.config.set_visible_devices(
gpus[0:2], 'GPU'
)
Common GPU Settings Summary
graph TD A["GPU Configuration"] --> B["Memory Growth"] A --> C["Visible Devices"] A --> D["Memory Limit"] B --> E["Grow as needed"] C --> F["Choose which GPUs"] D --> G["Set max memory per GPU"]
Mixed Precision Training: Working Smarter, Not Harder
The Clever Trick
Imagine writing a shopping list. You donβt need perfect calligraphy for a shopping list - quick notes work fine!
Mixed Precision uses:
- FP16 (Half Precision): Quick calculations, less memory
- FP32 (Full Precision): When we need accuracy
Why It Matters
| Precision | Memory | Speed | Accuracy |
|---|---|---|---|
| FP32 | High | Slower | Perfect |
| FP16 | Low | 2-3x Faster | Good enough! |
| Mixed | Balanced | Fast | Best of both! |
How to Enable It
from tensorflow.keras import mixed_precision
# Set the policy globally
mixed_precision.set_global_policy('mixed_float16')
# Your model automatically uses mixed precision!
model = create_model()
The Magic: Loss Scaling
FP16 numbers are smaller, so tiny gradients might become ZERO (bad!). Loss scaling makes numbers bigger during training, then shrinks them back.
graph LR A["Small Gradient"] --> B["Scale Up x1000"] B --> C["Safe FP16 Range"] C --> D["Calculate"] D --> E["Scale Down /1000"] E --> F["Correct Result"]
TensorFlow handles this automatically when you use mixed_float16 policy!
Complete Mixed Precision Example
# Step 1: Enable mixed precision
policy = mixed_precision.set_global_policy(
'mixed_float16'
)
# Step 2: Create model (automatic FP16 layers!)
model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(10)
])
# Step 3: Use optimizer with loss scaling
optimizer = tf.keras.optimizers.Adam()
# Step 4: Train as usual
model.compile(optimizer=optimizer, loss='...')
model.fit(dataset, epochs=10)
Putting It All Together
Hereβs a complete example combining everything:
import tensorflow as tf
from tensorflow.keras import mixed_precision
# 1. Enable mixed precision
mixed_precision.set_global_policy('mixed_float16')
# 2. Set up distributed strategy
strategy = tf.distribute.MirroredStrategy()
print(f'Number of GPUs: {strategy.num_replicas_in_sync}')
# 3. Configure batch sizes
GLOBAL_BATCH = 256
PER_REPLICA = GLOBAL_BATCH // strategy.num_replicas_in_sync
# 4. Create distributed dataset
def make_dataset():
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.0
dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)
)
return dataset.shuffle(10000).batch(GLOBAL_BATCH)
dist_dataset = strategy.experimental_distribute_dataset(
make_dataset()
)
# 5. Create model within strategy scope
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True
),
metrics=['accuracy']
)
# 6. Train!
model.fit(dist_dataset, epochs=5)
Key Takeaways
- Distributed Training = Many computers working together
- MirroredStrategy = Same model, different data, sync results
- Distributed Datasets = Feed all GPUs equally
- GPU Configuration = Control memory and device usage
- Mixed Precision = Faster training with FP16/FP32 mix
Youβre Now Ready!
Youβve learned how to:
- Split training across multiple GPUs
- Choose the right distribution strategy
- Set up datasets for distributed training
- Configure GPUs properly
- Speed up training with mixed precision
Next step: Try it yourself! Start with MirroredStrategy on 2 GPUs and watch your training fly!
Before: π Training... 10 hours remaining
After: π Training... 2 hours remaining!
Youβve got this!
