What is a CNN and how does it work?

A CNN works like detectives examining a photo. Each layer finds patterns: first edges and lines, then shapes, then complex features like eyes or wheels.

Why do ResNet skip connections work?

Skip connections create shortcuts that let information bypass layers. This prevents gradients from vanishing and enables training networks 100+ layers deep.

How does Vision Transformer (ViT) process images?

ViT chops images into patches and treats them like words in a sentence. Each patch is embedded and processed by transformer layers to classify the image.

When should I use CNN vs Vision Transformer?

Use CNNs for small datasets and faster inference. Use ViT for large datasets where it excels at capturing global context across the entire image.

CNN and Vision Models | PyTorch Guide

🏗️ CNN & Vision Models: Teaching Computers to See Like You!

The Big Picture: What Are We Learning?

Imagine you’re a detective with a magical magnifying glass. This glass doesn’t just zoom in—it can spot patterns, recognize faces, and even tell if a cat is hiding in a pile of laundry! That’s exactly what Convolutional Neural Networks (CNNs) do for computers.

Today, we’ll explore:

🔍 How CNNs look at pictures (like detectives!)
🧱 How to build your own CNN brick by brick
🪜 The magic of skip connections (ResNet’s secret)
🚀 Modern CNN superstars
🎯 Vision Transformers (the new kid on the block)

🔍 CNN Architectures Overview

What is a CNN? (The Detective Analogy)

Think of a CNN like a team of detectives examining a photograph:

First Detective looks for simple things: edges, lines, corners
Second Detective combines those clues: shapes, textures
Third Detective sees bigger patterns: eyes, wheels, leaves
Final Detective makes the decision: “It’s a cat!”

# A CNN is just layers stacked together
import torch.nn as nn

# Each layer is like one detective
conv1 = nn.Conv2d(3, 16, 3)   # First detective
conv2 = nn.Conv2d(16, 32, 3)  # Second detective
conv3 = nn.Conv2d(32, 64, 3)  # Third detective

The Three Key Ingredients

Every CNN has three main parts:

Part	What It Does	Real-Life Example
Conv Layer	Finds patterns	Looking through a magnifying glass
Pooling	Shrinks the image	Making a thumbnail
Fully Connected	Makes decisions	The final verdict

graph TD
    A["📷 Input Image"] --> B["🔍 Conv Layer 1"]
    B --> C["📉 Pooling"]
    C --> D["🔍 Conv Layer 2"]
    D --> E["📉 Pooling"]
    E --> F["🧠 Fully Connected"]
    F --> G["✅ Prediction"]

🧱 Building CNNs: Your First Neural Network

Step 1: The Convolution Operation

Imagine sliding a small window across a photo. At each spot, you multiply and add numbers. That’s convolution!

Simple Example:

Your image is a 5×5 grid of numbers (pixels)
Your filter (kernel) is a 3×3 pattern
You slide it across, getting a smaller output

import torch
import torch.nn as nn

# Create a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        # First conv: 3 colors -> 16 features
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        # Second conv: 16 -> 32 features
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        # Final decision maker
        self.fc = nn.Linear(32 * 8 * 8, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)
        return self.fc(x)

Step 2: Understanding Each Layer

Conv2d Parameters Explained:

nn.Conv2d(
    in_channels=3,   # RGB = 3 colors
    out_channels=16, # 16 different patterns to find
    kernel_size=3,   # 3x3 magnifying glass
    padding=1        # Add border to keep size
)

Like a Recipe:

in_channels: What you’re cooking with (ingredients)
out_channels: How many dishes you’re making
kernel_size: Size of your cooking pot
padding: Extra space around the edges

Step 3: Activation & Pooling

# ReLU: Keep positive, zero out negative
x = torch.relu(x)  # Makes learning faster!

# MaxPool: Keep the strongest signal
pool = nn.MaxPool2d(2, 2)  # Shrink by half
x = pool(x)

Why Pooling?

Makes computation faster
Helps the network focus on important stuff
Like summarizing a book into key points

🪜 ResNet and Skip Connections

The Problem: Vanishing Learning

Imagine playing telephone with 100 people. The message gets garbled! The same happens in deep networks—information gets lost.

The Solution: Skip Connections (Shortcuts!)

ResNet’s Brilliant Idea: Instead of just passing information forward, also create a shortcut that skips layers!

graph TD
    A["Input X"] --> B["Conv Layer"]
    A --> D["➕ Add"]
    B --> C["Conv Layer"]
    C --> D
    D --> E["Output"]

    style A fill:#e8f5e9
    style D fill:#fff3e0
    style E fill:#e3f2fd

Building a ResNet Block

class ResBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        shortcut = x  # Save for later!

        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += shortcut  # The magic skip!
        return torch.relu(out)

Why Skip Connections Work

The Elevator Analogy:

Without skip: Taking stairs (gets tiring!)
With skip: Taking the elevator + walking when needed

Benefits:

✅ Train networks 100+ layers deep
✅ Information flows better
✅ Gradients don’t vanish
✅ Easier to learn identity mappings

🚀 Modern CNN Architectures

The Evolution of CNNs

graph LR
    A["LeNet 1998"] --> B["AlexNet 2012"]
    B --> C["VGG 2014"]
    C --> D["ResNet 2015"]
    D --> E["EfficientNet 2019"]

    style A fill:#ffebee
    style E fill:#e8f5e9

Key Modern Architectures

1. VGGNet - Keep It Simple

Uses only 3×3 filters, stacked deep.

# VGG-style block
vgg_block = nn.Sequential(
    nn.Conv2d(64, 64, 3, padding=1),
    nn.ReLU(),
    nn.Conv2d(64, 64, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2, 2)
)

2. Inception/GoogLeNet - Multiple Paths

Looks at images with different “magnifying glasses” at once!

# Inception module concept
class InceptionBlock(nn.Module):
    def __init__(self, in_ch):
        super().__init__()
        self.path1 = nn.Conv2d(in_ch, 64, 1)
        self.path2 = nn.Conv2d(in_ch, 64, 3, padding=1)
        self.path3 = nn.Conv2d(in_ch, 64, 5, padding=2)

    def forward(self, x):
        p1 = self.path1(x)
        p2 = self.path2(x)
        p3 = self.path3(x)
        return torch.cat([p1, p2, p3], dim=1)

3. EfficientNet - Smart Scaling

Scales width, depth, and resolution together.

Dimension	What It Means
Width	More filters per layer
Depth	More layers
Resolution	Larger images

# Using pretrained EfficientNet
from torchvision.models import efficientnet_b0

model = efficientnet_b0(pretrained=True)
# Modify for your task
model.classifier[1] = nn.Linear(1280, num_classes)

4. MobileNet - For Phones!

Uses depthwise separable convolutions to be fast and light.

# Depthwise separable: split into 2 steps
depthwise = nn.Conv2d(32, 32, 3, groups=32)
pointwise = nn.Conv2d(32, 64, 1)
# Much fewer calculations!

🎯 Vision Transformer (ViT)

A New Way to See Images

The Big Idea: Instead of sliding filters, chop the image into patches and treat them like words in a sentence!

graph TD
    A["🖼️ Image 224x224"] --> B["✂️ Cut into Patches"]
    B --> C["16x16 patches = 196 patches"]
    C --> D["📝 Flatten Each Patch"]
    D --> E["➕ Add Position Info"]
    E --> F["🤖 Transformer Encoder"]
    F --> G["✅ Classification"]

How ViT Works

Step 1: Patch the Image

# Split image into 16x16 patches
patch_size = 16
# 224/16 = 14, so 14×14 = 196 patches

Step 2: Embed Each Patch

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, embed_dim=768):
        super().__init__()
        self.proj = nn.Conv2d(3, embed_dim, patch_size, stride=patch_size)

    def forward(self, x):
        # [B, 3, 224, 224] -> [B, 768, 14, 14]
        x = self.proj(x)
        # -> [B, 768, 196] -> [B, 196, 768]
        return x.flatten(2).transpose(1, 2)

Step 3: Add Position Information

# Tell the model WHERE each patch came from
pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
# 196 patches + 1 special [CLS] token

Step 4: Transform!

# Use Transformer layers (just like in GPT!)
encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=768, nhead=12),
    num_layers=12
)

Simple ViT Implementation

class SimpleViT(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.patch_embed = PatchEmbed()
        self.cls_token = nn.Parameter(torch.zeros(1, 1, 768))
        self.pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(768, 12), 12
        )
        self.head = nn.Linear(768, num_classes)

    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)
        cls = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed
        x = self.encoder(x)
        return self.head(x[:, 0])

CNN vs ViT: When to Use What?

Aspect	CNN	ViT
Small Data	✅ Better	❌ Needs more data
Big Data	⚡ Good	✅ Excellent
Local Patterns	✅ Built-in	⚡ Learns them
Global Context	⚡ Limited	✅ Natural
Speed	✅ Faster	⚡ Slower

🎮 Putting It All Together

Loading Pretrained Models in PyTorch

import torchvision.models as models

# Load pretrained models
resnet = models.resnet50(pretrained=True)
vit = models.vit_b_16(pretrained=True)
efficientnet = models.efficientnet_b0(pretrained=True)

# Modify for your task (e.g., 10 classes)
resnet.fc = nn.Linear(2048, 10)
vit.heads = nn.Linear(768, 10)
efficientnet.classifier[1] = nn.Linear(1280, 10)

Quick Summary

Architecture	Key Innovation	Best For
CNN	Convolution + Pooling	General images
ResNet	Skip connections	Deep networks
EfficientNet	Compound scaling	Efficiency
MobileNet	Depthwise convs	Mobile devices
ViT	Image patches + Transformers	Large-scale

🌟 Key Takeaways

CNNs are detectives - Each layer finds different patterns
Skip connections save the day - ResNet’s secret weapon
Modern CNNs are specialized - EfficientNet for efficiency, MobileNet for phones
ViT thinks differently - Patches instead of sliding windows
Use pretrained models - Start with knowledge, fine-tune for your task

You’re now ready to build vision AI! 🚀

Remember: The best architecture depends on your data, compute, and problem. Start simple, then experiment!

CNN and Vision Models

Unable to load concept

Coming Soon...

🏗️ CNN & Vision Models: Teaching Computers to See Like You!

The Big Picture: What Are We Learning?

🔍 CNN Architectures Overview

What is a CNN? (The Detective Analogy)

The Three Key Ingredients

🧱 Building CNNs: Your First Neural Network

Step 1: The Convolution Operation

Step 2: Understanding Each Layer

Step 3: Activation & Pooling

🪜 ResNet and Skip Connections

The Problem: Vanishing Learning

The Solution: Skip Connections (Shortcuts!)

Building a ResNet Block

Why Skip Connections Work

🚀 Modern CNN Architectures

The Evolution of CNNs

Key Modern Architectures

1. VGGNet - Keep It Simple

2. Inception/GoogLeNet - Multiple Paths

3. EfficientNet - Smart Scaling

4. MobileNet - For Phones!

🎯 Vision Transformer (ViT)

A New Way to See Images

How ViT Works

Simple ViT Implementation

CNN vs ViT: When to Use What?

🎮 Putting It All Together

Loading Pretrained Models in PyTorch

Quick Summary

🌟 Key Takeaways

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue