CNN and Vision Models

Back

Loading concept...

🏗️ CNN & Vision Models: Teaching Computers to See Like You!

The Big Picture: What Are We Learning?

Imagine you’re a detective with a magical magnifying glass. This glass doesn’t just zoom in—it can spot patterns, recognize faces, and even tell if a cat is hiding in a pile of laundry! That’s exactly what Convolutional Neural Networks (CNNs) do for computers.

Today, we’ll explore:

  • 🔍 How CNNs look at pictures (like detectives!)
  • đź§± How to build your own CNN brick by brick
  • 🪜 The magic of skip connections (ResNet’s secret)
  • 🚀 Modern CNN superstars
  • 🎯 Vision Transformers (the new kid on the block)

🔍 CNN Architectures Overview

What is a CNN? (The Detective Analogy)

Think of a CNN like a team of detectives examining a photograph:

  • First Detective looks for simple things: edges, lines, corners
  • Second Detective combines those clues: shapes, textures
  • Third Detective sees bigger patterns: eyes, wheels, leaves
  • Final Detective makes the decision: “It’s a cat!”
# A CNN is just layers stacked together
import torch.nn as nn

# Each layer is like one detective
conv1 = nn.Conv2d(3, 16, 3)   # First detective
conv2 = nn.Conv2d(16, 32, 3)  # Second detective
conv3 = nn.Conv2d(32, 64, 3)  # Third detective

The Three Key Ingredients

Every CNN has three main parts:

Part What It Does Real-Life Example
Conv Layer Finds patterns Looking through a magnifying glass
Pooling Shrinks the image Making a thumbnail
Fully Connected Makes decisions The final verdict
graph TD A["📷 Input Image"] --> B["🔍 Conv Layer 1"] B --> C["📉 Pooling"] C --> D["🔍 Conv Layer 2"] D --> E["📉 Pooling"] E --> F["🧠 Fully Connected"] F --> G["✅ Prediction"]

đź§± Building CNNs: Your First Neural Network

Step 1: The Convolution Operation

Imagine sliding a small window across a photo. At each spot, you multiply and add numbers. That’s convolution!

Simple Example:

  • Your image is a 5Ă—5 grid of numbers (pixels)
  • Your filter (kernel) is a 3Ă—3 pattern
  • You slide it across, getting a smaller output
import torch
import torch.nn as nn

# Create a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        # First conv: 3 colors -> 16 features
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        # Second conv: 16 -> 32 features
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        # Final decision maker
        self.fc = nn.Linear(32 * 8 * 8, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)
        return self.fc(x)

Step 2: Understanding Each Layer

Conv2d Parameters Explained:

nn.Conv2d(
    in_channels=3,   # RGB = 3 colors
    out_channels=16, # 16 different patterns to find
    kernel_size=3,   # 3x3 magnifying glass
    padding=1        # Add border to keep size
)

Like a Recipe:

  • in_channels: What you’re cooking with (ingredients)
  • out_channels: How many dishes you’re making
  • kernel_size: Size of your cooking pot
  • padding: Extra space around the edges

Step 3: Activation & Pooling

# ReLU: Keep positive, zero out negative
x = torch.relu(x)  # Makes learning faster!

# MaxPool: Keep the strongest signal
pool = nn.MaxPool2d(2, 2)  # Shrink by half
x = pool(x)

Why Pooling?

  • Makes computation faster
  • Helps the network focus on important stuff
  • Like summarizing a book into key points

🪜 ResNet and Skip Connections

The Problem: Vanishing Learning

Imagine playing telephone with 100 people. The message gets garbled! The same happens in deep networks—information gets lost.

The Solution: Skip Connections (Shortcuts!)

ResNet’s Brilliant Idea: Instead of just passing information forward, also create a shortcut that skips layers!

graph TD A["Input X"] --> B["Conv Layer"] A --> D["âž• Add"] B --> C["Conv Layer"] C --> D D --> E["Output"] style A fill:#e8f5e9 style D fill:#fff3e0 style E fill:#e3f2fd

Building a ResNet Block

class ResBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        shortcut = x  # Save for later!

        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += shortcut  # The magic skip!
        return torch.relu(out)

Why Skip Connections Work

The Elevator Analogy:

  • Without skip: Taking stairs (gets tiring!)
  • With skip: Taking the elevator + walking when needed

Benefits:

  • âś… Train networks 100+ layers deep
  • âś… Information flows better
  • âś… Gradients don’t vanish
  • âś… Easier to learn identity mappings

🚀 Modern CNN Architectures

The Evolution of CNNs

graph LR A["LeNet 1998"] --> B["AlexNet 2012"] B --> C["VGG 2014"] C --> D["ResNet 2015"] D --> E["EfficientNet 2019"] style A fill:#ffebee style E fill:#e8f5e9

Key Modern Architectures

1. VGGNet - Keep It Simple

Uses only 3Ă—3 filters, stacked deep.

# VGG-style block
vgg_block = nn.Sequential(
    nn.Conv2d(64, 64, 3, padding=1),
    nn.ReLU(),
    nn.Conv2d(64, 64, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2, 2)
)

2. Inception/GoogLeNet - Multiple Paths

Looks at images with different “magnifying glasses” at once!

# Inception module concept
class InceptionBlock(nn.Module):
    def __init__(self, in_ch):
        super().__init__()
        self.path1 = nn.Conv2d(in_ch, 64, 1)
        self.path2 = nn.Conv2d(in_ch, 64, 3, padding=1)
        self.path3 = nn.Conv2d(in_ch, 64, 5, padding=2)

    def forward(self, x):
        p1 = self.path1(x)
        p2 = self.path2(x)
        p3 = self.path3(x)
        return torch.cat([p1, p2, p3], dim=1)

3. EfficientNet - Smart Scaling

Scales width, depth, and resolution together.

Dimension What It Means
Width More filters per layer
Depth More layers
Resolution Larger images
# Using pretrained EfficientNet
from torchvision.models import efficientnet_b0

model = efficientnet_b0(pretrained=True)
# Modify for your task
model.classifier[1] = nn.Linear(1280, num_classes)

4. MobileNet - For Phones!

Uses depthwise separable convolutions to be fast and light.

# Depthwise separable: split into 2 steps
depthwise = nn.Conv2d(32, 32, 3, groups=32)
pointwise = nn.Conv2d(32, 64, 1)
# Much fewer calculations!

🎯 Vision Transformer (ViT)

A New Way to See Images

The Big Idea: Instead of sliding filters, chop the image into patches and treat them like words in a sentence!

graph TD A["🖼️ Image 224x224"] --> B["✂️ Cut into Patches"] B --> C["16x16 patches = 196 patches"] C --> D["📝 Flatten Each Patch"] D --> E["➕ Add Position Info"] E --> F["🤖 Transformer Encoder"] F --> G["✅ Classification"]

How ViT Works

Step 1: Patch the Image

# Split image into 16x16 patches
patch_size = 16
# 224/16 = 14, so 14Ă—14 = 196 patches

Step 2: Embed Each Patch

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, embed_dim=768):
        super().__init__()
        self.proj = nn.Conv2d(3, embed_dim, patch_size, stride=patch_size)

    def forward(self, x):
        # [B, 3, 224, 224] -> [B, 768, 14, 14]
        x = self.proj(x)
        # -> [B, 768, 196] -> [B, 196, 768]
        return x.flatten(2).transpose(1, 2)

Step 3: Add Position Information

# Tell the model WHERE each patch came from
pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
# 196 patches + 1 special [CLS] token

Step 4: Transform!

# Use Transformer layers (just like in GPT!)
encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=768, nhead=12),
    num_layers=12
)

Simple ViT Implementation

class SimpleViT(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.patch_embed = PatchEmbed()
        self.cls_token = nn.Parameter(torch.zeros(1, 1, 768))
        self.pos_embed = nn.Parameter(torch.zeros(1, 197, 768))
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(768, 12), 12
        )
        self.head = nn.Linear(768, num_classes)

    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)
        cls = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed
        x = self.encoder(x)
        return self.head(x[:, 0])

CNN vs ViT: When to Use What?

Aspect CNN ViT
Small Data ✅ Better ❌ Needs more data
Big Data ⚡ Good ✅ Excellent
Local Patterns ✅ Built-in ⚡ Learns them
Global Context ⚡ Limited ✅ Natural
Speed ✅ Faster ⚡ Slower

🎮 Putting It All Together

Loading Pretrained Models in PyTorch

import torchvision.models as models

# Load pretrained models
resnet = models.resnet50(pretrained=True)
vit = models.vit_b_16(pretrained=True)
efficientnet = models.efficientnet_b0(pretrained=True)

# Modify for your task (e.g., 10 classes)
resnet.fc = nn.Linear(2048, 10)
vit.heads = nn.Linear(768, 10)
efficientnet.classifier[1] = nn.Linear(1280, 10)

Quick Summary

Architecture Key Innovation Best For
CNN Convolution + Pooling General images
ResNet Skip connections Deep networks
EfficientNet Compound scaling Efficiency
MobileNet Depthwise convs Mobile devices
ViT Image patches + Transformers Large-scale

🌟 Key Takeaways

  1. CNNs are detectives - Each layer finds different patterns
  2. Skip connections save the day - ResNet’s secret weapon
  3. Modern CNNs are specialized - EfficientNet for efficiency, MobileNet for phones
  4. ViT thinks differently - Patches instead of sliding windows
  5. Use pretrained models - Start with knowledge, fine-tune for your task

You’re now ready to build vision AI! 🚀


Remember: The best architecture depends on your data, compute, and problem. Start simple, then experiment!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.