What is a Transformer in deep learning?

A Transformer sees all words at once using attention, unlike older models that read one word at a time. It uses stacked encoder and decoder blocks.

How does multi-head attention work?

Multi-head attention uses multiple 'heads' that each look for different patterns - grammar, meaning, position. Results combine for rich understanding.

Why do Transformers need positional encoding?

Transformers see all words simultaneously, so they need positional encoding to know word order. It uses sine/cosine waves as unique position fingerprints.

What is the difference between encoder and decoder blocks?

Encoders read and understand input using self-attention. Decoders create output and have cross-attention to look at encoder output while generating.

Transformer Layers | PyTorch Guide

Transformer Layers: The Magic of Attention

The Big Picture: A Orchestra of Words

Imagine you’re conducting an orchestra. Every musician (word) needs to know what every other musician is playing. They all listen to each other at the same time. That’s exactly what a Transformer does with words!

Before Transformers, computers read sentences like a snail crawling through a book—one word at a time, slowly. Transformers changed everything. Now, the computer can see ALL words at once, like looking at a photograph instead of watching a slow video.

1. Transformer Architecture: The Master Blueprint

What Is It?

Think of building a LEGO castle. You have specific blocks that go in a specific order. A Transformer is like that—a stack of special building blocks.

The Simple Version:

Words go IN at the bottom
Magic happens in the middle (layers!)
Understanding comes OUT at the top

graph TD
    A["Input Words"] --> B["Embedding Layer"]
    B --> C["Positional Encoding"]
    C --> D["Encoder Blocks x6"]
    D --> E["Decoder Blocks x6"]
    E --> F["Output Words"]

The Two Main Parts

Part	Job	Real-Life Analogy
Encoder	Reads and understands	Like reading a book carefully
Decoder	Creates new output	Like writing a summary of that book

PyTorch Example

import torch.nn as nn

# Create a Transformer
transformer = nn.Transformer(
    d_model=512,      # Size of word vectors
    nhead=8,          # Number of attention heads
    num_encoder_layers=6,
    num_decoder_layers=6
)

What do these numbers mean?

d_model=512: Each word becomes a list of 512 numbers
nhead=8: 8 different “eyes” looking at relationships
num_encoder_layers=6: 6 reading layers stacked up
num_decoder_layers=6: 6 writing layers stacked up

2. Multi-Head Attention: Eight Eyes Are Better Than One

The Story

Imagine you’re looking at a family photo. With ONE eye, you see who’s standing where. But what if you had EIGHT eyes, each looking for different things?

Eye 1: Who’s smiling?
Eye 2: Who’s standing together?
Eye 3: What colors are they wearing?
Eye 4: Who’s the tallest?
…and so on!

That’s Multi-Head Attention. Each “head” looks for different patterns in the sentence.

Why Multiple Heads?

Single Head	Multi-Head
Sees one type of relationship	Sees MANY relationships
“Cat sits on mat” = location only	Also sees: subject-verb, size, color…

PyTorch Example

import torch.nn as nn

# Create multi-head attention
multihead_attn = nn.MultiheadAttention(
    embed_dim=512,    # Word vector size
    num_heads=8       # Number of "eyes"
)

# Each head gets: 512 ÷ 8 = 64 dimensions
# All 8 heads work together!

How It Works (Simple Version)

graph TD
    A["Input"] --> B["Head 1: Grammar"]
    A --> C["Head 2: Meaning"]
    A --> D["Head 3: Position"]
    A --> E["Head 4-8: Other patterns"]
    B --> F["Combine All"]
    C --> F
    D --> F
    E --> F
    F --> G["Rich Understanding"]

3. Scaled Dot-Product Attention: The Heart of Everything

The Core Idea

Every word asks a question: “Who should I pay attention to?”

Three magic ingredients:

Query (Q): “What am I looking for?”
Key (K): “What do I have to offer?”
Value (V): “Here’s my actual information”

The Restaurant Analogy

You walk into a restaurant:

Query: “I want something spicy” (your request)
Key: Menu items have tags like “spicy”, “sweet”, “sour”
Value: The actual food you get

The match between your Query and each Key tells you what Value to pick!

The Magic Formula

Attention(Q, K, V) = softmax(QK^T / √d) × V

Breaking it down for a 5-year-old:

Q asks K: “Are we similar?” (dot product)
Divide by √d so numbers don’t get too big
Softmax turns scores into percentages (must add to 100%)
Multiply by V to get the final answer

Why “Scaled”?

Without scaling, big numbers cause problems:

softmax(100) ≈ 1.0 (ignores everything else!)
softmax(10) = nice balanced attention

PyTorch Example

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)  # dimension of keys

    # Step 1: Q asks K how similar they are
    scores = torch.matmul(Q, K.transpose(-2, -1))

    # Step 2: Scale down (divide by √d)
    scores = scores / math.sqrt(d_k)

    # Step 3: Convert to percentages
    attention_weights = F.softmax(scores, dim=-1)

    # Step 4: Get weighted values
    output = torch.matmul(attention_weights, V)

    return output

4. Positional Encoding: Teaching Order to the Transformer

The Problem

Transformers see all words at once. But “Dog bites man” and “Man bites dog” have the same words with VERY different meanings!

We need to tell the Transformer: “This word is FIRST, this word is SECOND…”

The Solution: Wave Patterns!

We add special number patterns to each word based on its position. These patterns use sine and cosine waves—like music!

graph LR
    A["Word: Cat"] --> B["Cat + Position 1"]
    C["Word: Sat"] --> D["Sat + Position 2"]
    E["Word: Mat"] --> F["Mat + Position 3"]

Why Waves?

Position	Pattern Property
1	Low frequency wave
2	Slightly different wave
100	Still unique!

Waves are perfect because:

Every position gets a unique code
The model can learn “Position 5 is close to Position 6”
Works for sentences of any length!

PyTorch Example

import torch
import math

def positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1)

    # Create wave frequencies
    div_term = torch.exp(
        torch.arange(0, d_model, 2) *
        (-math.log(10000.0) / d_model)
    )

    # Even positions: sine wave
    pe[:, 0::2] = torch.sin(position * div_term)
    # Odd positions: cosine wave
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# For 100 positions, 512 dimensions
pos_encoding = positional_encoding(100, 512)

Visual Intuition

Position 0: [0.0, 1.0, 0.0, 1.0, ...]
Position 1: [0.84, 0.54, 0.02, 0.99, ...]
Position 2: [0.91, -0.42, 0.04, 0.99, ...]

Each row is unique—like a fingerprint for that position!

5. Attention Masks: The “Do Not Look” Sign

Why We Need Masks

Sometimes, we want to HIDE certain words:

Padding: Short sentences get extra “[PAD]” tokens—ignore them!
Future words: When generating text, you can’t peek ahead!

The Theater Analogy

Imagine watching a mystery movie, but someone covers your eyes during spoilers. That’s masking—preventing you from seeing things you shouldn’t.

Two Types of Masks

Mask Type	Purpose	Shape
Padding Mask	Hide fake [PAD] words	Different per sentence
Causal Mask	Hide future words	Triangle shape

Causal Mask Example

For the sentence “I love cats”:

         I    love   cats
I      [SEE]  [HIDE] [HIDE]
love   [SEE]  [SEE]  [HIDE]
cats   [SEE]  [SEE]  [SEE]

Each word can only see itself and words BEFORE it!

PyTorch Example

import torch

# Causal mask (can't see future)
def create_causal_mask(size):
    mask = torch.triu(
        torch.ones(size, size),
        diagonal=1
    ).bool()
    return mask  # True = HIDE

# For 5 words:
mask = create_causal_mask(5)
# Result: Upper triangle is True (hidden)

# Padding mask
def create_padding_mask(lengths, max_len):
    batch_size = len(lengths)
    mask = torch.zeros(batch_size, max_len).bool()
    for i, length in enumerate(lengths):
        mask[i, length:] = True  # Hide padding
    return mask

How Masks Work

# In attention: add -infinity to masked positions
masked_scores = scores.masked_fill(mask, float('-inf'))
# After softmax: -inf becomes 0 (no attention!)

6. Encoder and Decoder Blocks: The Dynamic Duo

The Encoder Block: Understanding Expert

Each encoder block has:

Multi-Head Self-Attention: Words look at each other
Feed-Forward Network: Think deeply about each word
Layer Normalization: Keep numbers stable
Residual Connections: Don’t forget the original input!

graph TD
    A["Input"] --> B["Multi-Head Attention"]
    A --> C["Add: Residual"]
    B --> C
    C --> D["Layer Norm"]
    D --> E["Feed-Forward"]
    D --> F["Add: Residual"]
    E --> F
    F --> G["Layer Norm"]
    G --> H["Output"]

PyTorch Encoder Block

import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, d_model, nhead, d_ff):
        super().__init__()
        self.attention = nn.MultiheadAttention(
            d_model, nhead
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)

        # Feed-forward with residual
        ff_out = self.ff(x)
        x = self.norm2(x + ff_out)

        return x

The Decoder Block: Creation Expert

The decoder has everything the encoder has, PLUS:

Cross-Attention: Looks at encoder output!

graph TD
    A["Target Input"] --> B["Masked Self-Attention"]
    A --> C["Add + Norm"]
    B --> C
    C --> D["Cross-Attention"]
    E["Encoder Output"] --> D
    C --> F["Add + Norm"]
    D --> F
    F --> G["Feed-Forward"]
    F --> H["Add + Norm"]
    G --> H
    H --> I["Output"]

PyTorch Decoder Block

class DecoderBlock(nn.Module):
    def __init__(self, d_model, nhead, d_ff):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(
            d_model, nhead
        )
        self.cross_attn = nn.MultiheadAttention(
            d_model, nhead
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x, enc_out, mask=None):
        # Masked self-attention
        attn1, _ = self.self_attn(
            x, x, x, attn_mask=mask
        )
        x = self.norm1(x + attn1)

        # Cross-attention to encoder
        attn2, _ = self.cross_attn(x, enc_out, enc_out)
        x = self.norm2(x + attn2)

        # Feed-forward
        ff_out = self.ff(x)
        x = self.norm3(x + ff_out)

        return x

The Complete Picture

Component	Encoder	Decoder
Self-Attention	✅ (see all words)	✅ (see only past)
Cross-Attention	❌	✅ (see encoder)
Feed-Forward	✅	✅
Layer Norm	✅	✅
Residuals	✅	✅

Putting It All Together

import torch.nn as nn

# The full Transformer in one line!
transformer = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6,
    num_decoder_layers=6,
    dim_feedforward=2048
)

# Usage
src = torch.randn(10, 32, 512)  # 10 words, batch 32
tgt = torch.randn(20, 32, 512)  # 20 target words

output = transformer(src, tgt)
# output shape: (20, 32, 512)

Key Takeaways

Transformers see all words at once—no more slow reading!
Multi-Head Attention uses multiple “eyes” to catch different patterns
Scaled Dot-Product Attention is the heart: Q asks, K answers, V delivers
Positional Encoding tells words their position using wave patterns
Attention Masks hide things we shouldn’t see (padding, future)
Encoder understands input; Decoder creates output with cross-attention

You Did It!

You now understand the Transformer—the architecture behind ChatGPT, Google Translate, and so much more. These simple ideas (attention, positions, masks) combine to create the most powerful language models ever built.

Every time you use AI that understands language, there’s a Transformer inside, paying attention to every word, just like you learned today!

Transformer Layers

Unable to load concept

Coming Soon...

Transformer Layers: The Magic of Attention

The Big Picture: A Orchestra of Words

1. Transformer Architecture: The Master Blueprint

What Is It?

The Two Main Parts

PyTorch Example

2. Multi-Head Attention: Eight Eyes Are Better Than One

The Story

Why Multiple Heads?

PyTorch Example

How It Works (Simple Version)

3. Scaled Dot-Product Attention: The Heart of Everything

The Core Idea

The Restaurant Analogy

The Magic Formula

Why “Scaled”?

PyTorch Example

4. Positional Encoding: Teaching Order to the Transformer

The Problem

The Solution: Wave Patterns!

Why Waves?

PyTorch Example

Visual Intuition

5. Attention Masks: The “Do Not Look” Sign

Why We Need Masks

The Theater Analogy

Two Types of Masks

Causal Mask Example

PyTorch Example

How Masks Work

6. Encoder and Decoder Blocks: The Dynamic Duo

The Encoder Block: Understanding Expert

PyTorch Encoder Block

The Decoder Block: Creation Expert

PyTorch Decoder Block

The Complete Picture

Putting It All Together

Key Takeaways

You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue