Data Sources and Formats

Loading concept...

๐Ÿš‚ TensorFlow Data Pipelines: Feeding Your AI Monster

Imagine you have a very hungry pet dragon. This dragon eats data instead of food. But you canโ€™t just throw random stuff at itโ€”you need to prepare meals properly, serve them at the right speed, and make sure the dragon gets exactly what it needs to grow strong. Thatโ€™s what Data Pipelines do in TensorFlow!


๐ŸŽฏ The Big Picture: Your Data Kitchen

Think of TensorFlow as a fancy restaurant kitchen:

  • Raw ingredients = Your data files (images, text, numbers)
  • Food prep = Data loading and parsing
  • Cooking = Transformations and processing
  • Serving = Feeding batches to your model

Letโ€™s learn how to become a master chef for your AI!


๐Ÿ“ฆ Advanced Dataset Operations

What Are These?

When you load data into TensorFlow, you get a special container called a Dataset. Think of it like a conveyor belt in a factoryโ€”data items roll by one at a time, and you can do cool things to them!

The Magic Tricks You Can Do

1. Batching - Group items together

# Like putting cookies in boxes
dataset = dataset.batch(32)
# Now 32 items travel together!

2. Shuffling - Mix things up

# Like shuffling a deck of cards
dataset = dataset.shuffle(1000)
# Prevents your model from cheating

3. Prefetching - Get ready ahead of time

# Like a waiter preparing the next dish
dataset = dataset.prefetch(
    tf.data.AUTOTUNE
)
# Your GPU never waits hungry!

4. Mapping - Transform each item

# Like adding sauce to every dish
dataset = dataset.map(
    lambda x: x / 255.0
)
# Normalize images to 0-1 range

5. Caching - Remember for later

# Like making a shortcut
dataset = dataset.cache()
# Data loads super fast next time!

The Perfect Pipeline Recipe

graph TD A[๐Ÿ“ Raw Data] --> B[๐Ÿ”€ Shuffle] B --> C[๐Ÿ”ง Map/Transform] C --> D[๐Ÿ“ฆ Batch] D --> E[๐Ÿ’พ Cache] E --> F[โšก Prefetch] F --> G[๐Ÿง  Model Training]

๐Ÿ’ฟ TFRecord Format

What is TFRecord?

Imagine you have 1 million tiny photos scattered everywhere. Finding and loading each one takes forever! TFRecord is like putting ALL your photos into ONE big photo album that opens super fast.

Why Use TFRecord?

Problem TFRecord Solution
Slow disk reads Sequential reading
Many small files One big file
Network bottleneck Efficient streaming
Random access Optimized for batches

The Secret Sauce

TFRecords use Protocol Buffers (protobuf)โ€”a special language that computers speak really fast. Itโ€™s like writing in shorthand instead of full sentences!


๐Ÿ› ๏ธ TFRecord API

Creating Your First TFRecord

Step 1: Define what goes inside

def create_example(image, label):
    feature = {
        'image': tf.train.Feature(
            bytes_list=tf.train.BytesList(
                value=[image.numpy()]
            )
        ),
        'label': tf.train.Feature(
            int64_list=tf.train.Int64List(
                value=[label]
            )
        )
    }
    return tf.train.Example(
        features=tf.train.Features(
            feature=feature
        )
    )

Step 2: Write to file

with tf.io.TFRecordWriter(
    'my_data.tfrecord'
) as writer:
    for image, label in dataset:
        example = create_example(
            image, label
        )
        writer.write(
            example.SerializeToString()
        )

The Three Feature Types

Type For Example
BytesList Images, strings Photo data
Int64List Integers Labels, counts
FloatList Decimals Prices, scores

๐Ÿ”“ Parsing TFRecords

Reading Your Data Back

Just like you need a key to open a locked box, you need a feature description to read TFRecords!

feature_description = {
    'image': tf.io.FixedLenFeature(
        [], tf.string
    ),
    'label': tf.io.FixedLenFeature(
        [], tf.int64
    )
}

def parse_function(example):
    return tf.io.parse_single_example(
        example,
        feature_description
    )

Complete Reading Pipeline

# 1. Create the dataset
raw_dataset = tf.data.TFRecordDataset(
    'my_data.tfrecord'
)

# 2. Parse each record
parsed_dataset = raw_dataset.map(
    parse_function
)

# 3. Decode images if needed
def decode_image(features):
    image = tf.io.decode_jpeg(
        features['image']
    )
    return image, features['label']

final_dataset = parsed_dataset.map(
    decode_image
)
graph TD A[๐Ÿ“ TFRecord File] --> B[๐Ÿ” Read Raw Bytes] B --> C[๐Ÿ—๏ธ Parse with Description] C --> D[๐Ÿ–ผ๏ธ Decode Images] D --> E[โœจ Ready Dataset]

๐Ÿ“ฅ Data Loading Methods

Method 1: From Memory (Small Data)

# Perfect for tiny datasets
dataset = tf.data.Dataset.from_tensor_slices(
    (images_array, labels_array)
)

Like carrying groceries in your handsโ€”good for a few items!

Method 2: From Files (Medium Data)

# For images in folders
dataset = tf.keras.utils.image_dataset_from_directory(
    'path/to/images/',
    batch_size=32,
    image_size=(224, 224)
)

Like having a shopping cartโ€”handles more stuff!

Method 3: From TFRecords (Big Data)

# For massive datasets
dataset = tf.data.TFRecordDataset(
    ['data1.tfrecord', 'data2.tfrecord']
)

Like having a delivery truckโ€”handles tons!

Method 4: From Generator (Infinite Data)

def data_generator():
    while True:
        yield generate_random_sample()

dataset = tf.data.Dataset.from_generator(
    data_generator,
    output_signature=(
        tf.TensorSpec((28, 28), tf.float32),
        tf.TensorSpec((), tf.int32)
    )
)

Like a magic box that never empties!

Choosing the Right Method

Data Size Best Method
< 1 GB from_tensor_slices
1-10 GB File-based loading
> 10 GB TFRecords
Infinite Generators

๐Ÿช Using TF Hub

What is TF Hub?

Imagine a store where smart people have already trained amazing AI models and put them on shelves for free! TensorFlow Hub is that store.

Why Use Pre-trained Models?

  • Save time: Training from scratch takes days/weeks
  • Better results: Built by experts with huge data
  • Easy to use: Just download and plug in!

Example: Image Classification

import tensorflow_hub as hub

# Download a pre-trained model
model_url = "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/5"

model = tf.keras.Sequential([
    hub.KerasLayer(model_url)
])

# Use it instantly!
predictions = model.predict(my_images)

Example: Text Embedding

# Turn text into numbers
embed_url = "https://tfhub.dev/google/universal-sentence-encoder/4"

embed = hub.load(embed_url)
embeddings = embed([
    "Hello world!",
    "How are you?"
])
# Each sentence โ†’ vector of numbers

Popular TF Hub Models

Task Model What It Does
Images MobileNet Classifies photos
Text BERT Understands language
Video I3D Recognizes actions
Audio YAMNet Identifies sounds

๐Ÿ“š Using TF Datasets (TFDS)

What is TFDS?

Remember that store (TF Hub) with pre-trained models? TF Datasets is like a grocery store, but for data instead of models!

Built-in Datasets

import tensorflow_datasets as tfds

# See all available datasets
print(tfds.list_builders())
# Over 200 datasets ready to use!

Loading a Dataset

# Load MNIST (handwritten digits)
dataset, info = tfds.load(
    'mnist',
    with_info=True,
    as_supervised=True
)

train_ds = dataset['train']
test_ds = dataset['test']

# Check the info
print(info.features)
print(f"Training examples: {info.splits['train'].num_examples}")

Example: Load & Prepare

# Load cats vs dogs dataset
(train, test), info = tfds.load(
    'cats_vs_dogs',
    split=['train[:80%]', 'train[80%:]'],
    with_info=True,
    as_supervised=True
)

# Resize and normalize
def preprocess(image, label):
    image = tf.image.resize(image, (224, 224))
    image = image / 255.0
    return image, label

train = train.map(preprocess)
train = train.batch(32).prefetch(1)

Powerful Split API

# Get exactly what you need!
tfds.load('mnist', split='train[:1000]')  # First 1000
tfds.load('mnist', split='train[-1000:]')  # Last 1000
tfds.load('mnist', split='train[10%:20%]')  # 10-20%
tfds.load('mnist', split='train+test')  # Combine splits
graph TD A[๐Ÿช TF Datasets] --> B{Choose Dataset} B --> C[๐Ÿ“Š MNIST] B --> D[๐Ÿฑ Cats vs Dogs] B --> E[๐Ÿ“ฐ IMDB Reviews] B --> F[๐ŸŽต Speech Commands] C --> G[โฌ‡๏ธ Auto Download] D --> G E --> G F --> G G --> H[โœ… Ready to Use!]

Popular TFDS Datasets

Name Type Size
mnist Images 70K digits
cifar10 Images 60K photos
imdb_reviews Text 50K reviews
squad Q&A 100K questions

๐ŸŽ“ The Complete Pipeline

Hereโ€™s how everything fits together:

graph TD A[๐Ÿ—„๏ธ Data Source] --> B{Format?} B -->|Small| C[from_tensor_slices] B -->|Large| D[TFRecord] B -->|Pre-built| E[TF Datasets] B -->|Transfer| F[TF Hub] C --> G[๐Ÿ“ฆ tf.data.Dataset] D --> G E --> G F --> G G --> H[๐Ÿ”€ Shuffle] H --> I[๐Ÿ”ง Map/Transform] I --> J[๐Ÿ“ฆ Batch] J --> K[โšก Prefetch] K --> L[๐Ÿง  Train Model]

๐Ÿ’ก Pro Tips

1. Always Use Prefetch

dataset = dataset.prefetch(tf.data.AUTOTUNE)

Your GPU stays busy while CPU loads next batch!

2. Cache Wisely

# Cache after expensive operations
dataset = dataset.map(expensive_fn).cache()

3. Parallel Processing

# Use multiple CPU cores
dataset = dataset.map(
    process_fn,
    num_parallel_calls=tf.data.AUTOTUNE
)

4. Profile Your Pipeline

# Find bottlenecks
tf.data.experimental.enable_debug_mode()

๐Ÿš€ You Did It!

You now understand:

  • โœ… Advanced Dataset Operations (batch, shuffle, map, cache, prefetch)
  • โœ… TFRecord Format (fast, efficient storage)
  • โœ… TFRecord API (creating and writing)
  • โœ… Parsing TFRecords (reading back)
  • โœ… Data Loading Methods (memory, files, generators)
  • โœ… TF Hub (pre-trained model store)
  • โœ… TF Datasets (ready-to-use data store)

Your AI dragon will never go hungry again! ๐Ÿ‰โœจ

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.