Computer Vision Tasks

Back

Loading concept...

🎨 PyTorch Computer Vision: Teaching Machines to See!

The Big Picture: What If Your Computer Could See Like You?

Imagine you’re a detective with a magnifying glass. You look at a picture and instantly know:

  • “That’s a cat!” (Image Classification)
  • “There’s a cat AND a dog, and here’s exactly where each one is!” (Object Detection)
  • “Every single tiny piece of this picture belongs to something specific!” (Semantic Segmentation)

That’s exactly what we’re teaching computers to do with PyTorch! Let’s go on this adventure together. 🚀


🌟 Our Universal Analogy: The Art Gallery

Think of computer vision like running a magical art gallery:

  • Image Classification = The gallery guide who looks at a painting and says “This is a landscape!”
  • Object Detection = The security guard who spots every valuable item AND draws boxes around them
  • Semantic Segmentation = The restoration expert who can tell you what color belongs to the sky, the grass, the trees—every single brushstroke!

📸 Part 1: Image Classification Pipeline

What Is It?

Image Classification answers one simple question: “What is this picture of?”

Show a computer a photo, and it tells you: “This is a dog” or “This is a car” or “This is pizza!” 🍕

Real Life Examples:

  • Your phone sorting photos into “Pets,” “Food,” “Nature”
  • Doctors checking X-rays for diseases
  • Farmers identifying sick crops from healthy ones

The Pipeline: Step by Step

Think of it like making a sandwich—each step matters!

graph TD A["📷 Get Image"] --> B["🔧 Prepare Image"] B --> C["🧠 Feed to Model"] C --> D["📊 Get Predictions"] D --> E["🏷️ Show Answer"]

Step 1: Get Your Image

from PIL import Image

# Open any image
img = Image.open("cat.jpg")

Just like picking up a photo to look at!


Step 2: Prepare the Image (Transform It)

Computers are picky eaters. They want images:

  • Same size (usually 224×224 pixels)
  • Numbers between 0 and 1
  • In a special format called a tensor
from torchvision import transforms

prepare = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

ready_image = prepare(img)

Why Normalize? Imagine everyone speaking at different volumes. Normalizing makes everyone speak at the same volume so the computer can understand better!


Step 3: Use a Pre-trained Model

Why train from scratch when smart people already did the work?

import torch
from torchvision import models

# Load a model that already knows
# 1000 different things!
model = models.resnet50(pretrained=True)
model.eval()  # Tell it we're testing

Pre-trained = A model that already learned from millions of images. Like hiring an expert instead of training a newbie!


Step 4: Make a Prediction

# Add batch dimension
input_batch = ready_image.unsqueeze(0)

# Get prediction
with torch.no_grad():
    output = model(input_batch)

# Find the winner!
_, predicted = output.max(1)
print(f"Prediction: {predicted.item()}")

The model outputs confidence scores for each of 1000 categories. The highest score wins!


Step 5: See Human-Readable Results

# Top 5 predictions
probs = torch.nn.functional.softmax(
    output[0], dim=0
)
top5_prob, top5_idx = torch.topk(probs, 5)

for i in range(5):
    print(f"{labels[top5_idx[i]]}: "
          f"{top5_prob[i].item()*100:.1f}%")

Output might look like:

tabby cat: 87.3%
tiger cat: 8.2%
Egyptian cat: 2.1%
lynx: 1.4%
Persian cat: 0.8%

💡 Key Insight: How Classification Works

graph TD A["Image Pixels"] --> B["Find Edges"] B --> C["Find Shapes"] C --> D["Find Parts"] D --> E["Recognize Object"] style A fill:#e1f5fe style E fill:#c8e6c9

The model learns layers of understanding:

  1. Layer 1: “I see edges and colors”
  2. Layer 2: “I see circles and lines”
  3. Layer 3: “I see eyes and ears”
  4. Layer 4: “That’s a cat!”

🎯 Part 2: Object Detection Basics

What Is It?

Object Detection = Classification + Location

Not just “there’s a cat” but “there’s a cat RIGHT HERE” (with a box around it!)

Real Life Examples:

  • Self-driving cars spotting pedestrians
  • Security cameras detecting intruders
  • Your camera focusing on faces

The Key Difference

Classification Object Detection
“This is a dog” “There’s a dog at position (50, 100, 200, 300)”
One answer per image Multiple objects possible
Single label Labels + Bounding Boxes

Bounding Boxes: Drawing Rectangles

A bounding box is just 4 numbers:

  • x1, y1: Top-left corner
  • x2, y2: Bottom-right corner
(x1, y1) ─────────┐
    │             │
    │   🐕 DOG    │
    │             │
    └───────────(x2, y2)

Using a Pre-trained Detector

from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn
)

# Load pre-trained detector
model = fasterrcnn_resnet50_fpn(
    pretrained=True
)
model.eval()

# Prepare image (just tensor, no resize!)
img_tensor = transforms.ToTensor()(img)

# Detect!
with torch.no_grad():
    predictions = model([img_tensor])

Understanding the Output

pred = predictions[0]

# What did we find?
boxes = pred['boxes']   # Where things are
labels = pred['labels'] # What things are
scores = pred['scores'] # How confident

# Example output:
# boxes: [[50, 100, 200, 300],
#         [400, 50, 600, 250]]
# labels: [18, 1]  # 18=dog, 1=person
# scores: [0.95, 0.87]

Drawing Boxes on Images

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(1)
ax.imshow(img)

for box, label, score in zip(
    boxes, labels, scores
):
    if score > 0.5:  # Only confident ones
        x1, y1, x2, y2 = box
        rect = patches.Rectangle(
            (x1, y1), x2-x1, y2-y1,
            linewidth=2,
            edgecolor='red',
            facecolor='none'
        )
        ax.add_patch(rect)
        ax.text(x1, y1,
            f'{LABELS[label]}: {score:.2f}')

plt.show()

Popular Detection Models

graph TD A["Object Detection Models"] --> B["Faster R-CNN"] A --> C["YOLO"] A --> D["SSD"] B --> E["Very Accurate"] C --> F["Super Fast"] D --> G["Good Balance"]
  • Faster R-CNN: Best accuracy, slower
  • YOLO: Real-time speed, good accuracy
  • SSD: Middle ground

🎭 Part 3: Semantic Segmentation

What Is It?

Semantic Segmentation = Color every single pixel!

Not just “there’s a cat” but “these exact pixels are the cat, these are the sofa, these are the floor”

Think of it like a coloring book in reverse—the computer colors based on what things ARE.


The Visual Difference

Original Photo      →    Segmentation Mask

🏠🌳🚗🧑              🟦🟩🟥🟨
House Tree Car Person    Sky Grass Car Person

Every pixel gets a label. Every. Single. One.


Real Life Examples:

  • Self-driving cars knowing road vs sidewalk
  • Medical imaging outlining tumors precisely
  • Photo editing apps selecting objects automatically

Using DeepLabV3

from torchvision.models.segmentation import (
    deeplabv3_resnet101
)

# Load segmentation model
model = deeplabv3_resnet101(pretrained=True)
model.eval()

# Prepare image
input_tensor = prepare(img).unsqueeze(0)

# Segment!
with torch.no_grad():
    output = model(input_tensor)['out']

# Get class for each pixel
predictions = output.argmax(1)

Understanding the Output

The output is a mask the same size as your image.

Each pixel has a number representing its class:

  • 0 = background
  • 1 = airplane
  • 2 = bicycle
  • 15 = person
  • etc.
# predictions shape: [1, H, W]
# Each value is a class ID

mask = predictions[0].numpy()
# mask[100, 200] might = 15 (person)
# mask[300, 400] might = 0 (background)

Visualizing the Segmentation

import numpy as np

# Create colorful visualization
def decode_segmap(mask):
    # Define colors for each class
    colors = np.array([
        [0, 0, 0],       # background
        [128, 0, 0],     # airplane
        [0, 128, 0],     # bicycle
        # ... more colors
        [192, 128, 128], # person
    ])

    r = np.zeros_like(mask)
    g = np.zeros_like(mask)
    b = np.zeros_like(mask)

    for class_id in range(21):
        idx = mask == class_id
        r[idx] = colors[class_id, 0]
        g[idx] = colors[class_id, 1]
        b[idx] = colors[class_id, 2]

    return np.stack([r, g, b], axis=2)

colored_mask = decode_segmap(mask)
plt.imshow(colored_mask)

The Three Tasks Compared

graph LR A["📷 Same Image"] --> B["Classification"] A --> C["Detection"] A --> D["Segmentation"] B --> E["🏷️ Label: Dog"] C --> F["📦 Box + Label"] D --> G["🎨 Every Pixel Labeled"]
Task Question Output
Classification What is it? Single label
Detection What & where? Boxes + labels
Segmentation What is everything? Pixel-by-pixel labels

🎓 Bringing It All Together

The Complete PyTorch CV Toolkit

from torchvision import models

# Classification
classifier = models.resnet50(pretrained=True)

# Detection
detector = models.detection.fasterrcnn_resnet50_fpn(
    pretrained=True
)

# Segmentation
segmenter = models.segmentation.deeplabv3_resnet101(
    pretrained=True
)

All three share the same basic workflow:

  1. Load a pre-trained model
  2. Prepare your image
  3. Run inference
  4. Interpret the output

When to Use What?

Use Case Best Task
“Is this a hot dog or not?” Classification
“Find all faces in this photo” Detection
“Remove the background precisely” Segmentation
“Count cars in parking lot” Detection
“Measure tumor size exactly” Segmentation

🚀 You Did It!

You now understand the three pillars of computer vision in PyTorch:

  1. Classification - “What is this?”
  2. Detection - “What is this and where?”
  3. Segmentation - “What is every single pixel?”

These are the building blocks for:

  • Self-driving cars
  • Medical diagnosis
  • Augmented reality
  • Photo editing
  • Security systems
  • And so much more!

The computer can now see because you taught it how. How cool is that? 🎉


“The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” — Marcel Proust

Now your computer has those new eyes. Go build something amazing!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.