🎨 PyTorch Computer Vision: Teaching Machines to See!
The Big Picture: What If Your Computer Could See Like You?
Imagine you’re a detective with a magnifying glass. You look at a picture and instantly know:
- “That’s a cat!” (Image Classification)
- “There’s a cat AND a dog, and here’s exactly where each one is!” (Object Detection)
- “Every single tiny piece of this picture belongs to something specific!” (Semantic Segmentation)
That’s exactly what we’re teaching computers to do with PyTorch! Let’s go on this adventure together. 🚀
🌟 Our Universal Analogy: The Art Gallery
Think of computer vision like running a magical art gallery:
- Image Classification = The gallery guide who looks at a painting and says “This is a landscape!”
- Object Detection = The security guard who spots every valuable item AND draws boxes around them
- Semantic Segmentation = The restoration expert who can tell you what color belongs to the sky, the grass, the trees—every single brushstroke!
📸 Part 1: Image Classification Pipeline
What Is It?
Image Classification answers one simple question: “What is this picture of?”
Show a computer a photo, and it tells you: “This is a dog” or “This is a car” or “This is pizza!” 🍕
Real Life Examples:
- Your phone sorting photos into “Pets,” “Food,” “Nature”
- Doctors checking X-rays for diseases
- Farmers identifying sick crops from healthy ones
The Pipeline: Step by Step
Think of it like making a sandwich—each step matters!
graph TD A["📷 Get Image"] --> B["🔧 Prepare Image"] B --> C["🧠 Feed to Model"] C --> D["📊 Get Predictions"] D --> E["🏷️ Show Answer"]
Step 1: Get Your Image
from PIL import Image
# Open any image
img = Image.open("cat.jpg")
Just like picking up a photo to look at!
Step 2: Prepare the Image (Transform It)
Computers are picky eaters. They want images:
- Same size (usually 224×224 pixels)
- Numbers between 0 and 1
- In a special format called a tensor
from torchvision import transforms
prepare = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
ready_image = prepare(img)
Why Normalize? Imagine everyone speaking at different volumes. Normalizing makes everyone speak at the same volume so the computer can understand better!
Step 3: Use a Pre-trained Model
Why train from scratch when smart people already did the work?
import torch
from torchvision import models
# Load a model that already knows
# 1000 different things!
model = models.resnet50(pretrained=True)
model.eval() # Tell it we're testing
Pre-trained = A model that already learned from millions of images. Like hiring an expert instead of training a newbie!
Step 4: Make a Prediction
# Add batch dimension
input_batch = ready_image.unsqueeze(0)
# Get prediction
with torch.no_grad():
output = model(input_batch)
# Find the winner!
_, predicted = output.max(1)
print(f"Prediction: {predicted.item()}")
The model outputs confidence scores for each of 1000 categories. The highest score wins!
Step 5: See Human-Readable Results
# Top 5 predictions
probs = torch.nn.functional.softmax(
output[0], dim=0
)
top5_prob, top5_idx = torch.topk(probs, 5)
for i in range(5):
print(f"{labels[top5_idx[i]]}: "
f"{top5_prob[i].item()*100:.1f}%")
Output might look like:
tabby cat: 87.3%
tiger cat: 8.2%
Egyptian cat: 2.1%
lynx: 1.4%
Persian cat: 0.8%
💡 Key Insight: How Classification Works
graph TD A["Image Pixels"] --> B["Find Edges"] B --> C["Find Shapes"] C --> D["Find Parts"] D --> E["Recognize Object"] style A fill:#e1f5fe style E fill:#c8e6c9
The model learns layers of understanding:
- Layer 1: “I see edges and colors”
- Layer 2: “I see circles and lines”
- Layer 3: “I see eyes and ears”
- Layer 4: “That’s a cat!”
🎯 Part 2: Object Detection Basics
What Is It?
Object Detection = Classification + Location
Not just “there’s a cat” but “there’s a cat RIGHT HERE” (with a box around it!)
Real Life Examples:
- Self-driving cars spotting pedestrians
- Security cameras detecting intruders
- Your camera focusing on faces
The Key Difference
| Classification | Object Detection |
|---|---|
| “This is a dog” | “There’s a dog at position (50, 100, 200, 300)” |
| One answer per image | Multiple objects possible |
| Single label | Labels + Bounding Boxes |
Bounding Boxes: Drawing Rectangles
A bounding box is just 4 numbers:
- x1, y1: Top-left corner
- x2, y2: Bottom-right corner
(x1, y1) ─────────┐
│ │
│ 🐕 DOG │
│ │
└───────────(x2, y2)
Using a Pre-trained Detector
from torchvision.models.detection import (
fasterrcnn_resnet50_fpn
)
# Load pre-trained detector
model = fasterrcnn_resnet50_fpn(
pretrained=True
)
model.eval()
# Prepare image (just tensor, no resize!)
img_tensor = transforms.ToTensor()(img)
# Detect!
with torch.no_grad():
predictions = model([img_tensor])
Understanding the Output
pred = predictions[0]
# What did we find?
boxes = pred['boxes'] # Where things are
labels = pred['labels'] # What things are
scores = pred['scores'] # How confident
# Example output:
# boxes: [[50, 100, 200, 300],
# [400, 50, 600, 250]]
# labels: [18, 1] # 18=dog, 1=person
# scores: [0.95, 0.87]
Drawing Boxes on Images
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots(1)
ax.imshow(img)
for box, label, score in zip(
boxes, labels, scores
):
if score > 0.5: # Only confident ones
x1, y1, x2, y2 = box
rect = patches.Rectangle(
(x1, y1), x2-x1, y2-y1,
linewidth=2,
edgecolor='red',
facecolor='none'
)
ax.add_patch(rect)
ax.text(x1, y1,
f'{LABELS[label]}: {score:.2f}')
plt.show()
Popular Detection Models
graph TD A["Object Detection Models"] --> B["Faster R-CNN"] A --> C["YOLO"] A --> D["SSD"] B --> E["Very Accurate"] C --> F["Super Fast"] D --> G["Good Balance"]
- Faster R-CNN: Best accuracy, slower
- YOLO: Real-time speed, good accuracy
- SSD: Middle ground
🎭 Part 3: Semantic Segmentation
What Is It?
Semantic Segmentation = Color every single pixel!
Not just “there’s a cat” but “these exact pixels are the cat, these are the sofa, these are the floor”
Think of it like a coloring book in reverse—the computer colors based on what things ARE.
The Visual Difference
Original Photo → Segmentation Mask
🏠🌳🚗🧑 🟦🟩🟥🟨
House Tree Car Person Sky Grass Car Person
Every pixel gets a label. Every. Single. One.
Real Life Examples:
- Self-driving cars knowing road vs sidewalk
- Medical imaging outlining tumors precisely
- Photo editing apps selecting objects automatically
Using DeepLabV3
from torchvision.models.segmentation import (
deeplabv3_resnet101
)
# Load segmentation model
model = deeplabv3_resnet101(pretrained=True)
model.eval()
# Prepare image
input_tensor = prepare(img).unsqueeze(0)
# Segment!
with torch.no_grad():
output = model(input_tensor)['out']
# Get class for each pixel
predictions = output.argmax(1)
Understanding the Output
The output is a mask the same size as your image.
Each pixel has a number representing its class:
- 0 = background
- 1 = airplane
- 2 = bicycle
- 15 = person
- etc.
# predictions shape: [1, H, W]
# Each value is a class ID
mask = predictions[0].numpy()
# mask[100, 200] might = 15 (person)
# mask[300, 400] might = 0 (background)
Visualizing the Segmentation
import numpy as np
# Create colorful visualization
def decode_segmap(mask):
# Define colors for each class
colors = np.array([
[0, 0, 0], # background
[128, 0, 0], # airplane
[0, 128, 0], # bicycle
# ... more colors
[192, 128, 128], # person
])
r = np.zeros_like(mask)
g = np.zeros_like(mask)
b = np.zeros_like(mask)
for class_id in range(21):
idx = mask == class_id
r[idx] = colors[class_id, 0]
g[idx] = colors[class_id, 1]
b[idx] = colors[class_id, 2]
return np.stack([r, g, b], axis=2)
colored_mask = decode_segmap(mask)
plt.imshow(colored_mask)
The Three Tasks Compared
graph LR A["📷 Same Image"] --> B["Classification"] A --> C["Detection"] A --> D["Segmentation"] B --> E["🏷️ Label: Dog"] C --> F["📦 Box + Label"] D --> G["🎨 Every Pixel Labeled"]
| Task | Question | Output |
|---|---|---|
| Classification | What is it? | Single label |
| Detection | What & where? | Boxes + labels |
| Segmentation | What is everything? | Pixel-by-pixel labels |
🎓 Bringing It All Together
The Complete PyTorch CV Toolkit
from torchvision import models
# Classification
classifier = models.resnet50(pretrained=True)
# Detection
detector = models.detection.fasterrcnn_resnet50_fpn(
pretrained=True
)
# Segmentation
segmenter = models.segmentation.deeplabv3_resnet101(
pretrained=True
)
All three share the same basic workflow:
- Load a pre-trained model
- Prepare your image
- Run inference
- Interpret the output
When to Use What?
| Use Case | Best Task |
|---|---|
| “Is this a hot dog or not?” | Classification |
| “Find all faces in this photo” | Detection |
| “Remove the background precisely” | Segmentation |
| “Count cars in parking lot” | Detection |
| “Measure tumor size exactly” | Segmentation |
🚀 You Did It!
You now understand the three pillars of computer vision in PyTorch:
- Classification - “What is this?”
- Detection - “What is this and where?”
- Segmentation - “What is every single pixel?”
These are the building blocks for:
- Self-driving cars
- Medical diagnosis
- Augmented reality
- Photo editing
- Security systems
- And so much more!
The computer can now see because you taught it how. How cool is that? 🎉
“The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” — Marcel Proust
Now your computer has those new eyes. Go build something amazing!
