How does MirroredStrategy work in TensorFlow?

Each GPU gets a copy of the model and processes different data batches. All GPUs synchronize and share what they learned after each step.

What is mixed precision training?

Mixed precision uses FP16 for fast calculations and FP32 when accuracy matters. It speeds up training 2-3x while using less GPU memory.

What is the difference between global and per-replica batch size?

Global batch is the total across all GPUs. Per-replica is what each GPU gets. With 4 GPUs and global batch 256, each GPU gets 64.

Distributed Training | TensorFlow Guide

Q: What is distributed training?

Distributed training uses multiple computers or GPUs to train AI models together. Each processes different data, sharing results to learn faster.

Distributed Training: Teaching Many Computers to Work Together

The Big Picture: A Pizza Party Analogy

Imagine you need to make 1000 pizzas for a huge party. You could:

One chef, one oven → Takes forever!
Many chefs, many ovens → Super fast!

Distributed training is like option 2. Instead of one computer doing all the work, we get MANY computers to share the job. Each computer learns a little piece, and together they learn the whole thing!

What is Distributed Training?

Think of training an AI model like teaching a classroom of students. If you have ONE teacher and ONE student, learning takes a long time. But if you have MANY teachers and MANY students working together, everyone learns faster!

The Core Idea

Single Computer:
🖥️ → processes 100 images/second

Four Computers Together:
🖥️🖥️🖥️🖥️ → processes 400 images/second!

Real Example: Training GPT-like models would take YEARS on one computer. With distributed training, it takes WEEKS.

Distribution Strategies: Different Ways to Share the Work

Think of these strategies like different ways to divide work at a pizza factory.

1. Mirrored Strategy (Data Parallelism)

Analogy: Every chef has the SAME recipe, but different ingredients to work with.

graph TD
    A["Same Model Copy"] --> B["GPU 1: Batch A"]
    A --> C["GPU 2: Batch B"]
    A --> D["GPU 3: Batch C"]
    B --> E["Combine Results"]
    C --> E
    D --> E
    E --> F["Update All Models"]

How it works:

Each GPU gets a COPY of the model
Each GPU processes DIFFERENT data
All GPUs share what they learned
Everyone stays in sync!

TensorFlow Code:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = create_model()
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy'
    )

2. Multi-Worker Strategy

Analogy: Multiple kitchens in different buildings, all making the same dish!

graph TD
    A["Worker 1 Machine"] --> D["Share Updates"]
    B["Worker 2 Machine"] --> D
    C["Worker 3 Machine"] --> D
    D --> E["Synchronized Model"]

When to use: When you have multiple physical machines, not just multiple GPUs in one machine.

Configuration Example:

# Worker 0 config
os.environ["TF_CONFIG"] = json.dumps({
    "cluster": {
        "worker": [
            "machine1:12345",
            "machine2:12345"
        ]
    },
    "task": {"type": "worker", "index": 0}
})

3. Parameter Server Strategy

Analogy: One head chef (parameter server) keeps the master recipe. Other chefs (workers) do the cooking and report back!

graph TD
    PS["Parameter Server"] --> W1["Worker 1"]
    PS --> W2["Worker 2"]
    PS --> W3["Worker 3"]
    W1 --> PS
    W2 --> PS
    W3 --> PS

Best for: Very large models that don’t fit on one GPU.

Distributed Datasets: Feeding Many Hungry GPUs

The Problem

If you have 4 GPUs but only send data to 1, the other 3 are just sitting there doing nothing!

Bad:

GPU 1: 🍕🍕🍕🍕 (working hard)
GPU 2: 😴 (sleeping)
GPU 3: 😴 (sleeping)
GPU 4: 😴 (sleeping)

Good:

GPU 1: 🍕 (working)
GPU 2: 🍕 (working)
GPU 3: 🍕 (working)
GPU 4: 🍕 (working)

The Solution: Distributed Datasets

# Create a distributed dataset
def make_dataset():
    dataset = tf.data.Dataset.from_tensor_slices(
        (images, labels)
    )
    dataset = dataset.shuffle(1000)
    dataset = dataset.batch(GLOBAL_BATCH_SIZE)
    return dataset

# Distribute it across GPUs
dist_dataset = strategy.experimental_distribute_dataset(
    make_dataset()
)

Key Concept: Global vs Per-Replica Batch Size

Type	Meaning	Example
Global Batch	Total across ALL GPUs	256
Per-Replica	Each GPU gets	64 (if 4 GPUs)

GLOBAL_BATCH_SIZE = 256
num_replicas = strategy.num_replicas_in_sync  # 4
per_replica_batch = GLOBAL_BATCH_SIZE // num_replicas  # 64

GPU Configuration: Setting Up Your Kitchen

Step 1: Check What You Have

gpus = tf.config.list_physical_devices('GPU')
print(f"Found {len(gpus)} GPUs!")
# Output: Found 4 GPUs!

Step 2: Enable Memory Growth

Problem: By default, TensorFlow grabs ALL GPU memory. This is like one chef taking over the entire kitchen!

Solution: Let GPUs grow memory as needed.

gpus = tf.config.list_physical_devices('GPU')

for gpu in gpus:
    tf.config.experimental.set_memory_growth(
        gpu, True
    )

Step 3: Limit Visible GPUs (Optional)

Sometimes you only want to use SOME GPUs:

# Only use GPU 0 and GPU 1
tf.config.set_visible_devices(
    gpus[0:2], 'GPU'
)

Common GPU Settings Summary

graph TD
    A["GPU Configuration"] --> B["Memory Growth"]
    A --> C["Visible Devices"]
    A --> D["Memory Limit"]
    B --> E["Grow as needed"]
    C --> F["Choose which GPUs"]
    D --> G["Set max memory per GPU"]

Mixed Precision Training: Working Smarter, Not Harder

The Clever Trick

Imagine writing a shopping list. You don’t need perfect calligraphy for a shopping list - quick notes work fine!

Mixed Precision uses:

FP16 (Half Precision): Quick calculations, less memory
FP32 (Full Precision): When we need accuracy

Why It Matters

Precision	Memory	Speed	Accuracy
FP32	High	Slower	Perfect
FP16	Low	2-3x Faster	Good enough!
Mixed	Balanced	Fast	Best of both!

How to Enable It

from tensorflow.keras import mixed_precision

# Set the policy globally
mixed_precision.set_global_policy('mixed_float16')

# Your model automatically uses mixed precision!
model = create_model()

The Magic: Loss Scaling

FP16 numbers are smaller, so tiny gradients might become ZERO (bad!). Loss scaling makes numbers bigger during training, then shrinks them back.

graph LR
    A["Small Gradient"] --> B["Scale Up x1000"]
    B --> C["Safe FP16 Range"]
    C --> D["Calculate"]
    D --> E["Scale Down /1000"]
    E --> F["Correct Result"]

TensorFlow handles this automatically when you use mixed_float16 policy!

Complete Mixed Precision Example

# Step 1: Enable mixed precision
policy = mixed_precision.set_global_policy(
    'mixed_float16'
)

# Step 2: Create model (automatic FP16 layers!)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(10)
])

# Step 3: Use optimizer with loss scaling
optimizer = tf.keras.optimizers.Adam()

# Step 4: Train as usual
model.compile(optimizer=optimizer, loss='...')
model.fit(dataset, epochs=10)

Putting It All Together

Here’s a complete example combining everything:

import tensorflow as tf
from tensorflow.keras import mixed_precision

# 1. Enable mixed precision
mixed_precision.set_global_policy('mixed_float16')

# 2. Set up distributed strategy
strategy = tf.distribute.MirroredStrategy()
print(f'Number of GPUs: {strategy.num_replicas_in_sync}')

# 3. Configure batch sizes
GLOBAL_BATCH = 256
PER_REPLICA = GLOBAL_BATCH // strategy.num_replicas_in_sync

# 4. Create distributed dataset
def make_dataset():
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
    x_train = x_train / 255.0
    dataset = tf.data.Dataset.from_tensor_slices(
        (x_train, y_train)
    )
    return dataset.shuffle(10000).batch(GLOBAL_BATCH)

dist_dataset = strategy.experimental_distribute_dataset(
    make_dataset()
)

# 5. Create model within strategy scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True
        ),
        metrics=['accuracy']
    )

# 6. Train!
model.fit(dist_dataset, epochs=5)

Key Takeaways

Distributed Training = Many computers working together
MirroredStrategy = Same model, different data, sync results
Distributed Datasets = Feed all GPUs equally
GPU Configuration = Control memory and device usage
Mixed Precision = Faster training with FP16/FP32 mix

You’re Now Ready!

You’ve learned how to:

Split training across multiple GPUs
Choose the right distribution strategy
Set up datasets for distributed training
Configure GPUs properly
Speed up training with mixed precision

Next step: Try it yourself! Start with MirroredStrategy on 2 GPUs and watch your training fly!

Before: 🐌 Training... 10 hours remaining
After:  🚀 Training... 2 hours remaining!

You’ve got this!

Distributed Training

Unable to load concept

Coming Soon...

Distributed Training: Teaching Many Computers to Work Together

The Big Picture: A Pizza Party Analogy

What is Distributed Training?

The Core Idea

Distribution Strategies: Different Ways to Share the Work

1. Mirrored Strategy (Data Parallelism)

2. Multi-Worker Strategy

3. Parameter Server Strategy

Distributed Datasets: Feeding Many Hungry GPUs

The Problem

The Solution: Distributed Datasets

Key Concept: Global vs Per-Replica Batch Size

GPU Configuration: Setting Up Your Kitchen

Step 1: Check What You Have

Step 2: Enable Memory Growth

Step 3: Limit Visible GPUs (Optional)

Common GPU Settings Summary

Mixed Precision Training: Working Smarter, Not Harder

The Clever Trick

Why It Matters

How to Enable It

The Magic: Loss Scaling

Complete Mixed Precision Example

Putting It All Together

Key Takeaways

You’re Now Ready!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue