Transformer Layers: The Magic of Attention
The Big Picture: A Orchestra of Words
Imagine you’re conducting an orchestra. Every musician (word) needs to know what every other musician is playing. They all listen to each other at the same time. That’s exactly what a Transformer does with words!
Before Transformers, computers read sentences like a snail crawling through a book—one word at a time, slowly. Transformers changed everything. Now, the computer can see ALL words at once, like looking at a photograph instead of watching a slow video.
1. Transformer Architecture: The Master Blueprint
What Is It?
Think of building a LEGO castle. You have specific blocks that go in a specific order. A Transformer is like that—a stack of special building blocks.
The Simple Version:
- Words go IN at the bottom
- Magic happens in the middle (layers!)
- Understanding comes OUT at the top
graph TD A["Input Words"] --> B["Embedding Layer"] B --> C["Positional Encoding"] C --> D["Encoder Blocks x6"] D --> E["Decoder Blocks x6"] E --> F["Output Words"]
The Two Main Parts
| Part | Job | Real-Life Analogy |
|---|---|---|
| Encoder | Reads and understands | Like reading a book carefully |
| Decoder | Creates new output | Like writing a summary of that book |
PyTorch Example
import torch.nn as nn
# Create a Transformer
transformer = nn.Transformer(
d_model=512, # Size of word vectors
nhead=8, # Number of attention heads
num_encoder_layers=6,
num_decoder_layers=6
)
What do these numbers mean?
d_model=512: Each word becomes a list of 512 numbersnhead=8: 8 different “eyes” looking at relationshipsnum_encoder_layers=6: 6 reading layers stacked upnum_decoder_layers=6: 6 writing layers stacked up
2. Multi-Head Attention: Eight Eyes Are Better Than One
The Story
Imagine you’re looking at a family photo. With ONE eye, you see who’s standing where. But what if you had EIGHT eyes, each looking for different things?
- Eye 1: Who’s smiling?
- Eye 2: Who’s standing together?
- Eye 3: What colors are they wearing?
- Eye 4: Who’s the tallest?
- …and so on!
That’s Multi-Head Attention. Each “head” looks for different patterns in the sentence.
Why Multiple Heads?
| Single Head | Multi-Head |
|---|---|
| Sees one type of relationship | Sees MANY relationships |
| “Cat sits on mat” = location only | Also sees: subject-verb, size, color… |
PyTorch Example
import torch.nn as nn
# Create multi-head attention
multihead_attn = nn.MultiheadAttention(
embed_dim=512, # Word vector size
num_heads=8 # Number of "eyes"
)
# Each head gets: 512 ÷ 8 = 64 dimensions
# All 8 heads work together!
How It Works (Simple Version)
graph TD A["Input"] --> B["Head 1: Grammar"] A --> C["Head 2: Meaning"] A --> D["Head 3: Position"] A --> E["Head 4-8: Other patterns"] B --> F["Combine All"] C --> F D --> F E --> F F --> G["Rich Understanding"]
3. Scaled Dot-Product Attention: The Heart of Everything
The Core Idea
Every word asks a question: “Who should I pay attention to?”
Three magic ingredients:
- Query (Q): “What am I looking for?”
- Key (K): “What do I have to offer?”
- Value (V): “Here’s my actual information”
The Restaurant Analogy
You walk into a restaurant:
- Query: “I want something spicy” (your request)
- Key: Menu items have tags like “spicy”, “sweet”, “sour”
- Value: The actual food you get
The match between your Query and each Key tells you what Value to pick!
The Magic Formula
Attention(Q, K, V) = softmax(QK^T / √d) × V
Breaking it down for a 5-year-old:
- Q asks K: “Are we similar?” (dot product)
- Divide by √d so numbers don’t get too big
- Softmax turns scores into percentages (must add to 100%)
- Multiply by V to get the final answer
Why “Scaled”?
Without scaling, big numbers cause problems:
- softmax(100) ≈ 1.0 (ignores everything else!)
- softmax(10) = nice balanced attention
PyTorch Example
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V):
d_k = Q.size(-1) # dimension of keys
# Step 1: Q asks K how similar they are
scores = torch.matmul(Q, K.transpose(-2, -1))
# Step 2: Scale down (divide by √d)
scores = scores / math.sqrt(d_k)
# Step 3: Convert to percentages
attention_weights = F.softmax(scores, dim=-1)
# Step 4: Get weighted values
output = torch.matmul(attention_weights, V)
return output
4. Positional Encoding: Teaching Order to the Transformer
The Problem
Transformers see all words at once. But “Dog bites man” and “Man bites dog” have the same words with VERY different meanings!
We need to tell the Transformer: “This word is FIRST, this word is SECOND…”
The Solution: Wave Patterns!
We add special number patterns to each word based on its position. These patterns use sine and cosine waves—like music!
graph LR A["Word: Cat"] --> B["Cat + Position 1"] C["Word: Sat"] --> D["Sat + Position 2"] E["Word: Mat"] --> F["Mat + Position 3"]
Why Waves?
| Position | Pattern Property |
|---|---|
| 1 | Low frequency wave |
| 2 | Slightly different wave |
| 100 | Still unique! |
Waves are perfect because:
- Every position gets a unique code
- The model can learn “Position 5 is close to Position 6”
- Works for sentences of any length!
PyTorch Example
import torch
import math
def positional_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
# Create wave frequencies
div_term = torch.exp(
torch.arange(0, d_model, 2) *
(-math.log(10000.0) / d_model)
)
# Even positions: sine wave
pe[:, 0::2] = torch.sin(position * div_term)
# Odd positions: cosine wave
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# For 100 positions, 512 dimensions
pos_encoding = positional_encoding(100, 512)
Visual Intuition
Position 0: [0.0, 1.0, 0.0, 1.0, ...]
Position 1: [0.84, 0.54, 0.02, 0.99, ...]
Position 2: [0.91, -0.42, 0.04, 0.99, ...]
Each row is unique—like a fingerprint for that position!
5. Attention Masks: The “Do Not Look” Sign
Why We Need Masks
Sometimes, we want to HIDE certain words:
- Padding: Short sentences get extra “[PAD]” tokens—ignore them!
- Future words: When generating text, you can’t peek ahead!
The Theater Analogy
Imagine watching a mystery movie, but someone covers your eyes during spoilers. That’s masking—preventing you from seeing things you shouldn’t.
Two Types of Masks
| Mask Type | Purpose | Shape |
|---|---|---|
| Padding Mask | Hide fake [PAD] words | Different per sentence |
| Causal Mask | Hide future words | Triangle shape |
Causal Mask Example
For the sentence “I love cats”:
I love cats
I [SEE] [HIDE] [HIDE]
love [SEE] [SEE] [HIDE]
cats [SEE] [SEE] [SEE]
Each word can only see itself and words BEFORE it!
PyTorch Example
import torch
# Causal mask (can't see future)
def create_causal_mask(size):
mask = torch.triu(
torch.ones(size, size),
diagonal=1
).bool()
return mask # True = HIDE
# For 5 words:
mask = create_causal_mask(5)
# Result: Upper triangle is True (hidden)
# Padding mask
def create_padding_mask(lengths, max_len):
batch_size = len(lengths)
mask = torch.zeros(batch_size, max_len).bool()
for i, length in enumerate(lengths):
mask[i, length:] = True # Hide padding
return mask
How Masks Work
# In attention: add -infinity to masked positions
masked_scores = scores.masked_fill(mask, float('-inf'))
# After softmax: -inf becomes 0 (no attention!)
6. Encoder and Decoder Blocks: The Dynamic Duo
The Encoder Block: Understanding Expert
Each encoder block has:
- Multi-Head Self-Attention: Words look at each other
- Feed-Forward Network: Think deeply about each word
- Layer Normalization: Keep numbers stable
- Residual Connections: Don’t forget the original input!
graph TD A["Input"] --> B["Multi-Head Attention"] A --> C["Add: Residual"] B --> C C --> D["Layer Norm"] D --> E["Feed-Forward"] D --> F["Add: Residual"] E --> F F --> G["Layer Norm"] G --> H["Output"]
PyTorch Encoder Block
import torch.nn as nn
class EncoderBlock(nn.Module):
def __init__(self, d_model, nhead, d_ff):
super().__init__()
self.attention = nn.MultiheadAttention(
d_model, nhead
)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
# Self-attention with residual
attn_out, _ = self.attention(x, x, x)
x = self.norm1(x + attn_out)
# Feed-forward with residual
ff_out = self.ff(x)
x = self.norm2(x + ff_out)
return x
The Decoder Block: Creation Expert
The decoder has everything the encoder has, PLUS:
- Cross-Attention: Looks at encoder output!
graph TD A["Target Input"] --> B["Masked Self-Attention"] A --> C["Add + Norm"] B --> C C --> D["Cross-Attention"] E["Encoder Output"] --> D C --> F["Add + Norm"] D --> F F --> G["Feed-Forward"] F --> H["Add + Norm"] G --> H H --> I["Output"]
PyTorch Decoder Block
class DecoderBlock(nn.Module):
def __init__(self, d_model, nhead, d_ff):
super().__init__()
self.self_attn = nn.MultiheadAttention(
d_model, nhead
)
self.cross_attn = nn.MultiheadAttention(
d_model, nhead
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
def forward(self, x, enc_out, mask=None):
# Masked self-attention
attn1, _ = self.self_attn(
x, x, x, attn_mask=mask
)
x = self.norm1(x + attn1)
# Cross-attention to encoder
attn2, _ = self.cross_attn(x, enc_out, enc_out)
x = self.norm2(x + attn2)
# Feed-forward
ff_out = self.ff(x)
x = self.norm3(x + ff_out)
return x
The Complete Picture
| Component | Encoder | Decoder |
|---|---|---|
| Self-Attention | ✅ (see all words) | ✅ (see only past) |
| Cross-Attention | ❌ | ✅ (see encoder) |
| Feed-Forward | ✅ | ✅ |
| Layer Norm | ✅ | ✅ |
| Residuals | ✅ | ✅ |
Putting It All Together
import torch.nn as nn
# The full Transformer in one line!
transformer = nn.Transformer(
d_model=512,
nhead=8,
num_encoder_layers=6,
num_decoder_layers=6,
dim_feedforward=2048
)
# Usage
src = torch.randn(10, 32, 512) # 10 words, batch 32
tgt = torch.randn(20, 32, 512) # 20 target words
output = transformer(src, tgt)
# output shape: (20, 32, 512)
Key Takeaways
- Transformers see all words at once—no more slow reading!
- Multi-Head Attention uses multiple “eyes” to catch different patterns
- Scaled Dot-Product Attention is the heart: Q asks, K answers, V delivers
- Positional Encoding tells words their position using wave patterns
- Attention Masks hide things we shouldn’t see (padding, future)
- Encoder understands input; Decoder creates output with cross-attention
You Did It!
You now understand the Transformer—the architecture behind ChatGPT, Google Translate, and so much more. These simple ideas (attention, positions, masks) combine to create the most powerful language models ever built.
Every time you use AI that understands language, there’s a Transformer inside, paying attention to every word, just like you learned today!
