Transformer Layers

Back

Loading concept...

Transformer Layers: The Magic of Attention

The Big Picture: A Orchestra of Words

Imagine you’re conducting an orchestra. Every musician (word) needs to know what every other musician is playing. They all listen to each other at the same time. That’s exactly what a Transformer does with words!

Before Transformers, computers read sentences like a snail crawling through a book—one word at a time, slowly. Transformers changed everything. Now, the computer can see ALL words at once, like looking at a photograph instead of watching a slow video.


1. Transformer Architecture: The Master Blueprint

What Is It?

Think of building a LEGO castle. You have specific blocks that go in a specific order. A Transformer is like that—a stack of special building blocks.

The Simple Version:

  • Words go IN at the bottom
  • Magic happens in the middle (layers!)
  • Understanding comes OUT at the top
graph TD A["Input Words"] --> B["Embedding Layer"] B --> C["Positional Encoding"] C --> D["Encoder Blocks x6"] D --> E["Decoder Blocks x6"] E --> F["Output Words"]

The Two Main Parts

Part Job Real-Life Analogy
Encoder Reads and understands Like reading a book carefully
Decoder Creates new output Like writing a summary of that book

PyTorch Example

import torch.nn as nn

# Create a Transformer
transformer = nn.Transformer(
    d_model=512,      # Size of word vectors
    nhead=8,          # Number of attention heads
    num_encoder_layers=6,
    num_decoder_layers=6
)

What do these numbers mean?

  • d_model=512: Each word becomes a list of 512 numbers
  • nhead=8: 8 different “eyes” looking at relationships
  • num_encoder_layers=6: 6 reading layers stacked up
  • num_decoder_layers=6: 6 writing layers stacked up

2. Multi-Head Attention: Eight Eyes Are Better Than One

The Story

Imagine you’re looking at a family photo. With ONE eye, you see who’s standing where. But what if you had EIGHT eyes, each looking for different things?

  • Eye 1: Who’s smiling?
  • Eye 2: Who’s standing together?
  • Eye 3: What colors are they wearing?
  • Eye 4: Who’s the tallest?
  • …and so on!

That’s Multi-Head Attention. Each “head” looks for different patterns in the sentence.

Why Multiple Heads?

Single Head Multi-Head
Sees one type of relationship Sees MANY relationships
“Cat sits on mat” = location only Also sees: subject-verb, size, color…

PyTorch Example

import torch.nn as nn

# Create multi-head attention
multihead_attn = nn.MultiheadAttention(
    embed_dim=512,    # Word vector size
    num_heads=8       # Number of "eyes"
)

# Each head gets: 512 ÷ 8 = 64 dimensions
# All 8 heads work together!

How It Works (Simple Version)

graph TD A["Input"] --> B["Head 1: Grammar"] A --> C["Head 2: Meaning"] A --> D["Head 3: Position"] A --> E["Head 4-8: Other patterns"] B --> F["Combine All"] C --> F D --> F E --> F F --> G["Rich Understanding"]

3. Scaled Dot-Product Attention: The Heart of Everything

The Core Idea

Every word asks a question: “Who should I pay attention to?”

Three magic ingredients:

  • Query (Q): “What am I looking for?”
  • Key (K): “What do I have to offer?”
  • Value (V): “Here’s my actual information”

The Restaurant Analogy

You walk into a restaurant:

  1. Query: “I want something spicy” (your request)
  2. Key: Menu items have tags like “spicy”, “sweet”, “sour”
  3. Value: The actual food you get

The match between your Query and each Key tells you what Value to pick!

The Magic Formula

Attention(Q, K, V) = softmax(QK^T / √d) × V

Breaking it down for a 5-year-old:

  1. Q asks K: “Are we similar?” (dot product)
  2. Divide by √d so numbers don’t get too big
  3. Softmax turns scores into percentages (must add to 100%)
  4. Multiply by V to get the final answer

Why “Scaled”?

Without scaling, big numbers cause problems:

  • softmax(100) ≈ 1.0 (ignores everything else!)
  • softmax(10) = nice balanced attention

PyTorch Example

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)  # dimension of keys

    # Step 1: Q asks K how similar they are
    scores = torch.matmul(Q, K.transpose(-2, -1))

    # Step 2: Scale down (divide by √d)
    scores = scores / math.sqrt(d_k)

    # Step 3: Convert to percentages
    attention_weights = F.softmax(scores, dim=-1)

    # Step 4: Get weighted values
    output = torch.matmul(attention_weights, V)

    return output

4. Positional Encoding: Teaching Order to the Transformer

The Problem

Transformers see all words at once. But “Dog bites man” and “Man bites dog” have the same words with VERY different meanings!

We need to tell the Transformer: “This word is FIRST, this word is SECOND…”

The Solution: Wave Patterns!

We add special number patterns to each word based on its position. These patterns use sine and cosine waves—like music!

graph LR A["Word: Cat"] --> B["Cat + Position 1"] C["Word: Sat"] --> D["Sat + Position 2"] E["Word: Mat"] --> F["Mat + Position 3"]

Why Waves?

Position Pattern Property
1 Low frequency wave
2 Slightly different wave
100 Still unique!

Waves are perfect because:

  • Every position gets a unique code
  • The model can learn “Position 5 is close to Position 6”
  • Works for sentences of any length!

PyTorch Example

import torch
import math

def positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1)

    # Create wave frequencies
    div_term = torch.exp(
        torch.arange(0, d_model, 2) *
        (-math.log(10000.0) / d_model)
    )

    # Even positions: sine wave
    pe[:, 0::2] = torch.sin(position * div_term)
    # Odd positions: cosine wave
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# For 100 positions, 512 dimensions
pos_encoding = positional_encoding(100, 512)

Visual Intuition

Position 0: [0.0, 1.0, 0.0, 1.0, ...]
Position 1: [0.84, 0.54, 0.02, 0.99, ...]
Position 2: [0.91, -0.42, 0.04, 0.99, ...]

Each row is unique—like a fingerprint for that position!


5. Attention Masks: The “Do Not Look” Sign

Why We Need Masks

Sometimes, we want to HIDE certain words:

  1. Padding: Short sentences get extra “[PAD]” tokens—ignore them!
  2. Future words: When generating text, you can’t peek ahead!

The Theater Analogy

Imagine watching a mystery movie, but someone covers your eyes during spoilers. That’s masking—preventing you from seeing things you shouldn’t.

Two Types of Masks

Mask Type Purpose Shape
Padding Mask Hide fake [PAD] words Different per sentence
Causal Mask Hide future words Triangle shape

Causal Mask Example

For the sentence “I love cats”:

         I    love   cats
I      [SEE]  [HIDE] [HIDE]
love   [SEE]  [SEE]  [HIDE]
cats   [SEE]  [SEE]  [SEE]

Each word can only see itself and words BEFORE it!

PyTorch Example

import torch

# Causal mask (can't see future)
def create_causal_mask(size):
    mask = torch.triu(
        torch.ones(size, size),
        diagonal=1
    ).bool()
    return mask  # True = HIDE

# For 5 words:
mask = create_causal_mask(5)
# Result: Upper triangle is True (hidden)

# Padding mask
def create_padding_mask(lengths, max_len):
    batch_size = len(lengths)
    mask = torch.zeros(batch_size, max_len).bool()
    for i, length in enumerate(lengths):
        mask[i, length:] = True  # Hide padding
    return mask

How Masks Work

# In attention: add -infinity to masked positions
masked_scores = scores.masked_fill(mask, float('-inf'))
# After softmax: -inf becomes 0 (no attention!)

6. Encoder and Decoder Blocks: The Dynamic Duo

The Encoder Block: Understanding Expert

Each encoder block has:

  1. Multi-Head Self-Attention: Words look at each other
  2. Feed-Forward Network: Think deeply about each word
  3. Layer Normalization: Keep numbers stable
  4. Residual Connections: Don’t forget the original input!
graph TD A["Input"] --> B["Multi-Head Attention"] A --> C["Add: Residual"] B --> C C --> D["Layer Norm"] D --> E["Feed-Forward"] D --> F["Add: Residual"] E --> F F --> G["Layer Norm"] G --> H["Output"]

PyTorch Encoder Block

import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, d_model, nhead, d_ff):
        super().__init__()
        self.attention = nn.MultiheadAttention(
            d_model, nhead
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)

        # Feed-forward with residual
        ff_out = self.ff(x)
        x = self.norm2(x + ff_out)

        return x

The Decoder Block: Creation Expert

The decoder has everything the encoder has, PLUS:

  • Cross-Attention: Looks at encoder output!
graph TD A["Target Input"] --> B["Masked Self-Attention"] A --> C["Add + Norm"] B --> C C --> D["Cross-Attention"] E["Encoder Output"] --> D C --> F["Add + Norm"] D --> F F --> G["Feed-Forward"] F --> H["Add + Norm"] G --> H H --> I["Output"]

PyTorch Decoder Block

class DecoderBlock(nn.Module):
    def __init__(self, d_model, nhead, d_ff):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(
            d_model, nhead
        )
        self.cross_attn = nn.MultiheadAttention(
            d_model, nhead
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x, enc_out, mask=None):
        # Masked self-attention
        attn1, _ = self.self_attn(
            x, x, x, attn_mask=mask
        )
        x = self.norm1(x + attn1)

        # Cross-attention to encoder
        attn2, _ = self.cross_attn(x, enc_out, enc_out)
        x = self.norm2(x + attn2)

        # Feed-forward
        ff_out = self.ff(x)
        x = self.norm3(x + ff_out)

        return x

The Complete Picture

Component Encoder Decoder
Self-Attention ✅ (see all words) ✅ (see only past)
Cross-Attention ✅ (see encoder)
Feed-Forward
Layer Norm
Residuals

Putting It All Together

import torch.nn as nn

# The full Transformer in one line!
transformer = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6,
    num_decoder_layers=6,
    dim_feedforward=2048
)

# Usage
src = torch.randn(10, 32, 512)  # 10 words, batch 32
tgt = torch.randn(20, 32, 512)  # 20 target words

output = transformer(src, tgt)
# output shape: (20, 32, 512)

Key Takeaways

  1. Transformers see all words at once—no more slow reading!
  2. Multi-Head Attention uses multiple “eyes” to catch different patterns
  3. Scaled Dot-Product Attention is the heart: Q asks, K answers, V delivers
  4. Positional Encoding tells words their position using wave patterns
  5. Attention Masks hide things we shouldn’t see (padding, future)
  6. Encoder understands input; Decoder creates output with cross-attention

You Did It!

You now understand the Transformer—the architecture behind ChatGPT, Google Translate, and so much more. These simple ideas (attention, positions, masks) combine to create the most powerful language models ever built.

Every time you use AI that understands language, there’s a Transformer inside, paying attention to every word, just like you learned today!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.