Distributed Training

Back

Loading concept...

Distributed Training: Teaching Many Computers to Work Together

The Big Picture: A Pizza Party Analogy

Imagine you need to make 1000 pizzas for a huge party. You could:

  1. One chef, one oven β†’ Takes forever!
  2. Many chefs, many ovens β†’ Super fast!

Distributed training is like option 2. Instead of one computer doing all the work, we get MANY computers to share the job. Each computer learns a little piece, and together they learn the whole thing!


What is Distributed Training?

Think of training an AI model like teaching a classroom of students. If you have ONE teacher and ONE student, learning takes a long time. But if you have MANY teachers and MANY students working together, everyone learns faster!

The Core Idea

Single Computer:
πŸ–₯️ β†’ processes 100 images/second

Four Computers Together:
πŸ–₯️πŸ–₯️πŸ–₯️πŸ–₯️ β†’ processes 400 images/second!

Real Example: Training GPT-like models would take YEARS on one computer. With distributed training, it takes WEEKS.


Distribution Strategies: Different Ways to Share the Work

Think of these strategies like different ways to divide work at a pizza factory.

1. Mirrored Strategy (Data Parallelism)

Analogy: Every chef has the SAME recipe, but different ingredients to work with.

graph TD A["Same Model Copy"] --> B["GPU 1: Batch A"] A --> C["GPU 2: Batch B"] A --> D["GPU 3: Batch C"] B --> E["Combine Results"] C --> E D --> E E --> F["Update All Models"]

How it works:

  • Each GPU gets a COPY of the model
  • Each GPU processes DIFFERENT data
  • All GPUs share what they learned
  • Everyone stays in sync!

TensorFlow Code:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = create_model()
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy'
    )

2. Multi-Worker Strategy

Analogy: Multiple kitchens in different buildings, all making the same dish!

graph TD A["Worker 1 Machine"] --> D["Share Updates"] B["Worker 2 Machine"] --> D C["Worker 3 Machine"] --> D D --> E["Synchronized Model"]

When to use: When you have multiple physical machines, not just multiple GPUs in one machine.

Configuration Example:

# Worker 0 config
os.environ["TF_CONFIG"] = json.dumps({
    "cluster": {
        "worker": [
            "machine1:12345",
            "machine2:12345"
        ]
    },
    "task": {"type": "worker", "index": 0}
})

3. Parameter Server Strategy

Analogy: One head chef (parameter server) keeps the master recipe. Other chefs (workers) do the cooking and report back!

graph TD PS["Parameter Server"] --> W1["Worker 1"] PS --> W2["Worker 2"] PS --> W3["Worker 3"] W1 --> PS W2 --> PS W3 --> PS

Best for: Very large models that don’t fit on one GPU.


Distributed Datasets: Feeding Many Hungry GPUs

The Problem

If you have 4 GPUs but only send data to 1, the other 3 are just sitting there doing nothing!

Bad:

GPU 1: πŸ•πŸ•πŸ•πŸ• (working hard)
GPU 2: 😴 (sleeping)
GPU 3: 😴 (sleeping)
GPU 4: 😴 (sleeping)

Good:

GPU 1: πŸ• (working)
GPU 2: πŸ• (working)
GPU 3: πŸ• (working)
GPU 4: πŸ• (working)

The Solution: Distributed Datasets

# Create a distributed dataset
def make_dataset():
    dataset = tf.data.Dataset.from_tensor_slices(
        (images, labels)
    )
    dataset = dataset.shuffle(1000)
    dataset = dataset.batch(GLOBAL_BATCH_SIZE)
    return dataset

# Distribute it across GPUs
dist_dataset = strategy.experimental_distribute_dataset(
    make_dataset()
)

Key Concept: Global vs Per-Replica Batch Size

Type Meaning Example
Global Batch Total across ALL GPUs 256
Per-Replica Each GPU gets 64 (if 4 GPUs)
GLOBAL_BATCH_SIZE = 256
num_replicas = strategy.num_replicas_in_sync  # 4
per_replica_batch = GLOBAL_BATCH_SIZE // num_replicas  # 64

GPU Configuration: Setting Up Your Kitchen

Step 1: Check What You Have

gpus = tf.config.list_physical_devices('GPU')
print(f"Found {len(gpus)} GPUs!")
# Output: Found 4 GPUs!

Step 2: Enable Memory Growth

Problem: By default, TensorFlow grabs ALL GPU memory. This is like one chef taking over the entire kitchen!

Solution: Let GPUs grow memory as needed.

gpus = tf.config.list_physical_devices('GPU')

for gpu in gpus:
    tf.config.experimental.set_memory_growth(
        gpu, True
    )

Step 3: Limit Visible GPUs (Optional)

Sometimes you only want to use SOME GPUs:

# Only use GPU 0 and GPU 1
tf.config.set_visible_devices(
    gpus[0:2], 'GPU'
)

Common GPU Settings Summary

graph TD A["GPU Configuration"] --> B["Memory Growth"] A --> C["Visible Devices"] A --> D["Memory Limit"] B --> E["Grow as needed"] C --> F["Choose which GPUs"] D --> G["Set max memory per GPU"]

Mixed Precision Training: Working Smarter, Not Harder

The Clever Trick

Imagine writing a shopping list. You don’t need perfect calligraphy for a shopping list - quick notes work fine!

Mixed Precision uses:

  • FP16 (Half Precision): Quick calculations, less memory
  • FP32 (Full Precision): When we need accuracy

Why It Matters

Precision Memory Speed Accuracy
FP32 High Slower Perfect
FP16 Low 2-3x Faster Good enough!
Mixed Balanced Fast Best of both!

How to Enable It

from tensorflow.keras import mixed_precision

# Set the policy globally
mixed_precision.set_global_policy('mixed_float16')

# Your model automatically uses mixed precision!
model = create_model()

The Magic: Loss Scaling

FP16 numbers are smaller, so tiny gradients might become ZERO (bad!). Loss scaling makes numbers bigger during training, then shrinks them back.

graph LR A["Small Gradient"] --> B["Scale Up x1000"] B --> C["Safe FP16 Range"] C --> D["Calculate"] D --> E["Scale Down /1000"] E --> F["Correct Result"]

TensorFlow handles this automatically when you use mixed_float16 policy!

Complete Mixed Precision Example

# Step 1: Enable mixed precision
policy = mixed_precision.set_global_policy(
    'mixed_float16'
)

# Step 2: Create model (automatic FP16 layers!)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(10)
])

# Step 3: Use optimizer with loss scaling
optimizer = tf.keras.optimizers.Adam()

# Step 4: Train as usual
model.compile(optimizer=optimizer, loss='...')
model.fit(dataset, epochs=10)

Putting It All Together

Here’s a complete example combining everything:

import tensorflow as tf
from tensorflow.keras import mixed_precision

# 1. Enable mixed precision
mixed_precision.set_global_policy('mixed_float16')

# 2. Set up distributed strategy
strategy = tf.distribute.MirroredStrategy()
print(f'Number of GPUs: {strategy.num_replicas_in_sync}')

# 3. Configure batch sizes
GLOBAL_BATCH = 256
PER_REPLICA = GLOBAL_BATCH // strategy.num_replicas_in_sync

# 4. Create distributed dataset
def make_dataset():
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
    x_train = x_train / 255.0
    dataset = tf.data.Dataset.from_tensor_slices(
        (x_train, y_train)
    )
    return dataset.shuffle(10000).batch(GLOBAL_BATCH)

dist_dataset = strategy.experimental_distribute_dataset(
    make_dataset()
)

# 5. Create model within strategy scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True
        ),
        metrics=['accuracy']
    )

# 6. Train!
model.fit(dist_dataset, epochs=5)

Key Takeaways

  1. Distributed Training = Many computers working together
  2. MirroredStrategy = Same model, different data, sync results
  3. Distributed Datasets = Feed all GPUs equally
  4. GPU Configuration = Control memory and device usage
  5. Mixed Precision = Faster training with FP16/FP32 mix

You’re Now Ready!

You’ve learned how to:

  • Split training across multiple GPUs
  • Choose the right distribution strategy
  • Set up datasets for distributed training
  • Configure GPUs properly
  • Speed up training with mixed precision

Next step: Try it yourself! Start with MirroredStrategy on 2 GPUs and watch your training fly!

Before: 🐌 Training... 10 hours remaining
After:  πŸš€ Training... 2 hours remaining!

You’ve got this!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.