What is image classification in PyTorch?

Image classification answers 'What is this picture of?' It takes an image and outputs a single label like 'cat' or 'dog' using models like ResNet.

What is object detection?

Object detection combines classification with location. It finds multiple objects in an image and draws bounding boxes around each one with labels.

What is semantic segmentation?

Semantic segmentation labels every single pixel in an image. It tells you exactly which pixels belong to the cat, the sofa, or the background.

When should I use classification vs detection vs segmentation?

Use classification for 'what is this?' questions, detection for finding and locating objects, and segmentation for precise pixel-level boundaries.

PyTorch Computer Vision Tasks | CV Guide

🎨 PyTorch Computer Vision: Teaching Machines to See!

The Big Picture: What If Your Computer Could See Like You?

Imagine you’re a detective with a magnifying glass. You look at a picture and instantly know:

“That’s a cat!” (Image Classification)
“There’s a cat AND a dog, and here’s exactly where each one is!” (Object Detection)
“Every single tiny piece of this picture belongs to something specific!” (Semantic Segmentation)

That’s exactly what we’re teaching computers to do with PyTorch! Let’s go on this adventure together. 🚀

🌟 Our Universal Analogy: The Art Gallery

Think of computer vision like running a magical art gallery:

Image Classification = The gallery guide who looks at a painting and says “This is a landscape!”
Object Detection = The security guard who spots every valuable item AND draws boxes around them
Semantic Segmentation = The restoration expert who can tell you what color belongs to the sky, the grass, the trees—every single brushstroke!

📸 Part 1: Image Classification Pipeline

What Is It?

Image Classification answers one simple question: “What is this picture of?”

Show a computer a photo, and it tells you: “This is a dog” or “This is a car” or “This is pizza!” 🍕

Real Life Examples:

Your phone sorting photos into “Pets,” “Food,” “Nature”
Doctors checking X-rays for diseases
Farmers identifying sick crops from healthy ones

The Pipeline: Step by Step

Think of it like making a sandwich—each step matters!

graph TD
    A["📷 Get Image"] --> B["🔧 Prepare Image"]
    B --> C["🧠 Feed to Model"]
    C --> D["📊 Get Predictions"]
    D --> E["🏷️ Show Answer"]

Step 1: Get Your Image

from PIL import Image

# Open any image
img = Image.open("cat.jpg")

Just like picking up a photo to look at!

Step 2: Prepare the Image (Transform It)

Computers are picky eaters. They want images:

Same size (usually 224×224 pixels)
Numbers between 0 and 1
In a special format called a tensor

from torchvision import transforms

prepare = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

ready_image = prepare(img)

Why Normalize? Imagine everyone speaking at different volumes. Normalizing makes everyone speak at the same volume so the computer can understand better!

Step 3: Use a Pre-trained Model

Why train from scratch when smart people already did the work?

import torch
from torchvision import models

# Load a model that already knows
# 1000 different things!
model = models.resnet50(pretrained=True)
model.eval()  # Tell it we're testing

Pre-trained = A model that already learned from millions of images. Like hiring an expert instead of training a newbie!

Step 4: Make a Prediction

# Add batch dimension
input_batch = ready_image.unsqueeze(0)

# Get prediction
with torch.no_grad():
    output = model(input_batch)

# Find the winner!
_, predicted = output.max(1)
print(f"Prediction: {predicted.item()}")

The model outputs confidence scores for each of 1000 categories. The highest score wins!

Step 5: See Human-Readable Results

# Top 5 predictions
probs = torch.nn.functional.softmax(
    output[0], dim=0
)
top5_prob, top5_idx = torch.topk(probs, 5)

for i in range(5):
    print(f"{labels[top5_idx[i]]}: "
          f"{top5_prob[i].item()*100:.1f}%")

Output might look like:

tabby cat: 87.3%
tiger cat: 8.2%
Egyptian cat: 2.1%
lynx: 1.4%
Persian cat: 0.8%

💡 Key Insight: How Classification Works

graph TD
    A["Image Pixels"] --> B["Find Edges"]
    B --> C["Find Shapes"]
    C --> D["Find Parts"]
    D --> E["Recognize Object"]

    style A fill:#e1f5fe
    style E fill:#c8e6c9

The model learns layers of understanding:

Layer 1: “I see edges and colors”
Layer 2: “I see circles and lines”
Layer 3: “I see eyes and ears”
Layer 4: “That’s a cat!”

🎯 Part 2: Object Detection Basics

What Is It?

Object Detection = Classification + Location

Not just “there’s a cat” but “there’s a cat RIGHT HERE” (with a box around it!)

Real Life Examples:

Self-driving cars spotting pedestrians
Security cameras detecting intruders
Your camera focusing on faces

The Key Difference

Classification	Object Detection
“This is a dog”	“There’s a dog at position (50, 100, 200, 300)”
One answer per image	Multiple objects possible
Single label	Labels + Bounding Boxes

Bounding Boxes: Drawing Rectangles

A bounding box is just 4 numbers:

x1, y1: Top-left corner
x2, y2: Bottom-right corner

(x1, y1) ─────────┐
    │             │
    │   🐕 DOG    │
    │             │
    └───────────(x2, y2)

Using a Pre-trained Detector

from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn
)

# Load pre-trained detector
model = fasterrcnn_resnet50_fpn(
    pretrained=True
)
model.eval()

# Prepare image (just tensor, no resize!)
img_tensor = transforms.ToTensor()(img)

# Detect!
with torch.no_grad():
    predictions = model([img_tensor])

Understanding the Output

pred = predictions[0]

# What did we find?
boxes = pred['boxes']   # Where things are
labels = pred['labels'] # What things are
scores = pred['scores'] # How confident

# Example output:
# boxes: [[50, 100, 200, 300],
#         [400, 50, 600, 250]]
# labels: [18, 1]  # 18=dog, 1=person
# scores: [0.95, 0.87]

Drawing Boxes on Images

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(1)
ax.imshow(img)

for box, label, score in zip(
    boxes, labels, scores
):
    if score > 0.5:  # Only confident ones
        x1, y1, x2, y2 = box
        rect = patches.Rectangle(
            (x1, y1), x2-x1, y2-y1,
            linewidth=2,
            edgecolor='red',
            facecolor='none'
        )
        ax.add_patch(rect)
        ax.text(x1, y1,
            f'{LABELS[label]}: {score:.2f}')

plt.show()

Popular Detection Models

graph TD
    A["Object Detection Models"] --> B["Faster R-CNN"]
    A --> C["YOLO"]
    A --> D["SSD"]

    B --> E["Very Accurate"]
    C --> F["Super Fast"]
    D --> G["Good Balance"]

Faster R-CNN: Best accuracy, slower
YOLO: Real-time speed, good accuracy
SSD: Middle ground

🎭 Part 3: Semantic Segmentation

What Is It?

Semantic Segmentation = Color every single pixel!

Not just “there’s a cat” but “these exact pixels are the cat, these are the sofa, these are the floor”

Think of it like a coloring book in reverse—the computer colors based on what things ARE.

The Visual Difference

Original Photo      →    Segmentation Mask

🏠🌳🚗🧑              🟦🟩🟥🟨
House Tree Car Person    Sky Grass Car Person

Every pixel gets a label. Every. Single. One.

Real Life Examples:

Self-driving cars knowing road vs sidewalk
Medical imaging outlining tumors precisely
Photo editing apps selecting objects automatically

Using DeepLabV3

from torchvision.models.segmentation import (
    deeplabv3_resnet101
)

# Load segmentation model
model = deeplabv3_resnet101(pretrained=True)
model.eval()

# Prepare image
input_tensor = prepare(img).unsqueeze(0)

# Segment!
with torch.no_grad():
    output = model(input_tensor)['out']

# Get class for each pixel
predictions = output.argmax(1)

Understanding the Output

The output is a mask the same size as your image.

Each pixel has a number representing its class:

0 = background
1 = airplane
2 = bicycle
15 = person
etc.

# predictions shape: [1, H, W]
# Each value is a class ID

mask = predictions[0].numpy()
# mask[100, 200] might = 15 (person)
# mask[300, 400] might = 0 (background)

Visualizing the Segmentation

import numpy as np

# Create colorful visualization
def decode_segmap(mask):
    # Define colors for each class
    colors = np.array([
        [0, 0, 0],       # background
        [128, 0, 0],     # airplane
        [0, 128, 0],     # bicycle
        # ... more colors
        [192, 128, 128], # person
    ])

    r = np.zeros_like(mask)
    g = np.zeros_like(mask)
    b = np.zeros_like(mask)

    for class_id in range(21):
        idx = mask == class_id
        r[idx] = colors[class_id, 0]
        g[idx] = colors[class_id, 1]
        b[idx] = colors[class_id, 2]

    return np.stack([r, g, b], axis=2)

colored_mask = decode_segmap(mask)
plt.imshow(colored_mask)

The Three Tasks Compared

graph LR
    A["📷 Same Image"] --> B["Classification"]
    A --> C["Detection"]
    A --> D["Segmentation"]

    B --> E["🏷️ Label: Dog"]
    C --> F["📦 Box + Label"]
    D --> G["🎨 Every Pixel Labeled"]

Task	Question	Output
Classification	What is it?	Single label
Detection	What & where?	Boxes + labels
Segmentation	What is everything?	Pixel-by-pixel labels

🎓 Bringing It All Together

The Complete PyTorch CV Toolkit

from torchvision import models

# Classification
classifier = models.resnet50(pretrained=True)

# Detection
detector = models.detection.fasterrcnn_resnet50_fpn(
    pretrained=True
)

# Segmentation
segmenter = models.segmentation.deeplabv3_resnet101(
    pretrained=True
)

All three share the same basic workflow:

Load a pre-trained model
Prepare your image
Run inference
Interpret the output

When to Use What?

Use Case	Best Task
“Is this a hot dog or not?”	Classification
“Find all faces in this photo”	Detection
“Remove the background precisely”	Segmentation
“Count cars in parking lot”	Detection
“Measure tumor size exactly”	Segmentation

🚀 You Did It!

You now understand the three pillars of computer vision in PyTorch:

Classification - “What is this?”
Detection - “What is this and where?”
Segmentation - “What is every single pixel?”

These are the building blocks for:

Self-driving cars
Medical diagnosis
Augmented reality
Photo editing
Security systems
And so much more!

The computer can now see because you taught it how. How cool is that? 🎉

“The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” — Marcel Proust

Now your computer has those new eyes. Go build something amazing!

Computer Vision Tasks

Unable to load concept

Coming Soon...

🎨 PyTorch Computer Vision: Teaching Machines to See!

The Big Picture: What If Your Computer Could See Like You?

🌟 Our Universal Analogy: The Art Gallery

📸 Part 1: Image Classification Pipeline

What Is It?

Real Life Examples:

The Pipeline: Step by Step

Step 1: Get Your Image

Step 2: Prepare the Image (Transform It)

Step 3: Use a Pre-trained Model

Step 4: Make a Prediction

Step 5: See Human-Readable Results

💡 Key Insight: How Classification Works

🎯 Part 2: Object Detection Basics

What Is It?

Real Life Examples:

The Key Difference

Bounding Boxes: Drawing Rectangles

Using a Pre-trained Detector

Understanding the Output

Drawing Boxes on Images

Popular Detection Models

🎭 Part 3: Semantic Segmentation

What Is It?

The Visual Difference

Real Life Examples:

Using DeepLabV3

Understanding the Output

Visualizing the Segmentation

The Three Tasks Compared

🎓 Bringing It All Together

The Complete PyTorch CV Toolkit

When to Use What?

🚀 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue