🏗️ CNN & Vision Models: Teaching Computers to See Like You!
The Big Picture: What Are We Learning?
Imagine you’re a detective with a magical magnifying glass. This glass doesn’t just zoom in—it can spot patterns, recognize faces, and even tell if a cat is hiding in a pile of laundry! That’s exactly what Convolutional Neural Networks (CNNs) do for computers.
Today, we’ll explore:
- 🔍 How CNNs look at pictures (like detectives!)
- đź§± How to build your own CNN brick by brick
- 🪜 The magic of skip connections (ResNet’s secret)
- 🚀 Modern CNN superstars
- 🎯 Vision Transformers (the new kid on the block)
🔍 CNN Architectures Overview
What is a CNN? (The Detective Analogy)
Think of a CNN like a team of detectives examining a photograph:
- First Detective looks for simple things: edges, lines, corners
- Second Detective combines those clues: shapes, textures
- Third Detective sees bigger patterns: eyes, wheels, leaves
- Final Detective makes the decision: “It’s a cat!”
# A CNN is just layers stacked together
import torch.nn as nn
# Each layer is like one detective
conv1 = nn.Conv2d(3, 16, 3) # First detective
conv2 = nn.Conv2d(16, 32, 3) # Second detective
conv3 = nn.Conv2d(32, 64, 3) # Third detective
The Three Key Ingredients
Every CNN has three main parts:
| Part | What It Does | Real-Life Example |
|---|---|---|
| Conv Layer | Finds patterns | Looking through a magnifying glass |
| Pooling | Shrinks the image | Making a thumbnail |
| Fully Connected | Makes decisions | The final verdict |
graph TD A["📷 Input Image"] --> B["🔍 Conv Layer 1"] B --> C["📉 Pooling"] C --> D["🔍 Conv Layer 2"] D --> E["📉 Pooling"] E --> F["🧠Fully Connected"] F --> G["✅ Prediction"]
đź§± Building CNNs: Your First Neural Network
Step 1: The Convolution Operation
Imagine sliding a small window across a photo. At each spot, you multiply and add numbers. That’s convolution!
Simple Example:
- Your image is a 5Ă—5 grid of numbers (pixels)
- Your filter (kernel) is a 3Ă—3 pattern
- You slide it across, getting a smaller output
import torch
import torch.nn as nn
# Create a simple CNN
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
# First conv: 3 colors -> 16 features
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
# Second conv: 16 -> 32 features
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
# Final decision maker
self.fc = nn.Linear(32 * 8 * 8, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(-1, 32 * 8 * 8)
return self.fc(x)
Step 2: Understanding Each Layer
Conv2d Parameters Explained:
nn.Conv2d(
in_channels=3, # RGB = 3 colors
out_channels=16, # 16 different patterns to find
kernel_size=3, # 3x3 magnifying glass
padding=1 # Add border to keep size
)
Like a Recipe:
in_channels: What you’re cooking with (ingredients)out_channels: How many dishes you’re makingkernel_size: Size of your cooking potpadding: Extra space around the edges
Step 3: Activation & Pooling
# ReLU: Keep positive, zero out negative
x = torch.relu(x) # Makes learning faster!
# MaxPool: Keep the strongest signal
pool = nn.MaxPool2d(2, 2) # Shrink by half
x = pool(x)
Why Pooling?
- Makes computation faster
- Helps the network focus on important stuff
- Like summarizing a book into key points
🪜 ResNet and Skip Connections
The Problem: Vanishing Learning
Imagine playing telephone with 100 people. The message gets garbled! The same happens in deep networks—information gets lost.
The Solution: Skip Connections (Shortcuts!)
ResNet’s Brilliant Idea: Instead of just passing information forward, also create a shortcut that skips layers!
graph TD A["Input X"] --> B["Conv Layer"] A --> D["âž• Add"] B --> C["Conv Layer"] C --> D D --> E["Output"] style A fill:#e8f5e9 style D fill:#fff3e0 style E fill:#e3f2fd
Building a ResNet Block
class ResBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
shortcut = x # Save for later!
out = torch.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += shortcut # The magic skip!
return torch.relu(out)
Why Skip Connections Work
The Elevator Analogy:
- Without skip: Taking stairs (gets tiring!)
- With skip: Taking the elevator + walking when needed
Benefits:
- âś… Train networks 100+ layers deep
- âś… Information flows better
- ✅ Gradients don’t vanish
- âś… Easier to learn identity mappings
🚀 Modern CNN Architectures
The Evolution of CNNs
graph LR A["LeNet 1998"] --> B["AlexNet 2012"] B --> C["VGG 2014"] C --> D["ResNet 2015"] D --> E["EfficientNet 2019"] style A fill:#ffebee style E fill:#e8f5e9
Key Modern Architectures
1. VGGNet - Keep It Simple
Uses only 3Ă—3 filters, stacked deep.
# VGG-style block
vgg_block = nn.Sequential(
nn.Conv2d(64, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
2. Inception/GoogLeNet - Multiple Paths
Looks at images with different “magnifying glasses” at once!
# Inception module concept
class InceptionBlock(nn.Module):
def __init__(self, in_ch):
super().__init__()
self.path1 = nn.Conv2d(in_ch, 64, 1)
self.path2 = nn.Conv2d(in_ch, 64, 3, padding=1)
self.path3 = nn.Conv2d(in_ch, 64, 5, padding=2)
def forward(self, x):
p1 = self.path1(x)
p2 = self.path2(x)
p3 = self.path3(x)
return torch.cat([p1, p2, p3], dim=1)
3. EfficientNet - Smart Scaling
Scales width, depth, and resolution together.
| Dimension | What It Means |
|---|---|
| Width | More filters per layer |
| Depth | More layers |
| Resolution | Larger images |
# Using pretrained EfficientNet
from torchvision.models import efficientnet_b0
model = efficientnet_b0(pretrained=True)
# Modify for your task
model.classifier[1] = nn.Linear(1280, num_classes)
4. MobileNet - For Phones!
Uses depthwise separable convolutions to be fast and light.
# Depthwise separable: split into 2 steps
depthwise = nn.Conv2d(32, 32, 3, groups=32)
pointwise = nn.Conv2d(32, 64, 1)
# Much fewer calculations!
🎯 Vision Transformer (ViT)
A New Way to See Images
The Big Idea: Instead of sliding filters, chop the image into patches and treat them like words in a sentence!
graph TD A["🖼️ Image 224x224"] --> B["✂️ Cut into Patches"] B --> C["16x16 patches = 196 patches"] C --> D["📝 Flatten Each Patch"] D --> E["➕ Add Position Info"] E --> F["🤖 Transformer Encoder"] F --> G["✅ Classification"]
How ViT Works
Step 1: Patch the Image
# Split image into 16x16 patches
patch_size = 16
# 224/16 = 14, so 14Ă—14 = 196 patches
Step 2: Embed Each Patch
class PatchEmbed(nn.Module):
def __init__(self, img_size=224, patch_size=16, embed_dim=768):
super().__init__()
self.proj = nn.Conv2d(3, embed_dim, patch_size, stride=patch_size)
def forward(self, x):
# [B, 3, 224, 224] -> [B, 768, 14, 14]
x = self.proj(x)
# -> [B, 768, 196] -> [B, 196, 768]
return x.flatten(2).transpose(1, 2)
Step 3: Add Position Information
# Tell the model WHERE each patch came from
pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
# 196 patches + 1 special [CLS] token
Step 4: Transform!
# Use Transformer layers (just like in GPT!)
encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=768, nhead=12),
num_layers=12
)
Simple ViT Implementation
class SimpleViT(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
self.patch_embed = PatchEmbed()
self.cls_token = nn.Parameter(torch.zeros(1, 1, 768))
self.pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
self.encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(768, 12), 12
)
self.head = nn.Linear(768, num_classes)
def forward(self, x):
B = x.shape[0]
x = self.patch_embed(x)
cls = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls, x], dim=1)
x = x + self.pos_embed
x = self.encoder(x)
return self.head(x[:, 0])
CNN vs ViT: When to Use What?
| Aspect | CNN | ViT |
|---|---|---|
| Small Data | ✅ Better | ❌ Needs more data |
| Big Data | ⚡ Good | ✅ Excellent |
| Local Patterns | ✅ Built-in | ⚡ Learns them |
| Global Context | ⚡ Limited | ✅ Natural |
| Speed | ✅ Faster | ⚡ Slower |
🎮 Putting It All Together
Loading Pretrained Models in PyTorch
import torchvision.models as models
# Load pretrained models
resnet = models.resnet50(pretrained=True)
vit = models.vit_b_16(pretrained=True)
efficientnet = models.efficientnet_b0(pretrained=True)
# Modify for your task (e.g., 10 classes)
resnet.fc = nn.Linear(2048, 10)
vit.heads = nn.Linear(768, 10)
efficientnet.classifier[1] = nn.Linear(1280, 10)
Quick Summary
| Architecture | Key Innovation | Best For |
|---|---|---|
| CNN | Convolution + Pooling | General images |
| ResNet | Skip connections | Deep networks |
| EfficientNet | Compound scaling | Efficiency |
| MobileNet | Depthwise convs | Mobile devices |
| ViT | Image patches + Transformers | Large-scale |
🌟 Key Takeaways
- CNNs are detectives - Each layer finds different patterns
- Skip connections save the day - ResNet’s secret weapon
- Modern CNNs are specialized - EfficientNet for efficiency, MobileNet for phones
- ViT thinks differently - Patches instead of sliding windows
- Use pretrained models - Start with knowledge, fine-tune for your task
You’re now ready to build vision AI! 🚀
Remember: The best architecture depends on your data, compute, and problem. Start simple, then experiment!
