🧠 Building Sequence Models in PyTorch
A Journey Through Time: Teaching Machines to Remember
🎬 The Story Begins: Why Sequences Matter
Imagine you’re reading a sentence: “The cat sat on the ___”
Your brain doesn’t just see the last word. It remembers everything that came before. That’s what makes you smart at language!
Regular neural networks? They have amnesia. Show them one word, they forget the last.
Sequence models? They’re like you — they remember!
🎯 Our Mission: Build neural networks that can read, remember, and understand sequences — just like you do.
📖 Chapter 1: Building RNNs (Recurrent Neural Networks)
The Memory Loop
Think of an RNN like a student passing notes in class.
Word 1 → [Brain Box] → passes note →
Word 2 → [Brain Box] → passes note →
Word 3 → [Brain Box] → final answer!
Each “Brain Box” gets:
- The new word (current input)
- The note from before (hidden state)
🔨 Building Your First RNN in PyTorch
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
def __init__(self, input_size,
hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(
input_size,
hidden_size,
batch_first=True
)
self.fc = nn.Linear(
hidden_size, output_size
)
def forward(self, x):
# x shape: (batch, seq_len, input)
out, hidden = self.rnn(x)
# Take last output
out = self.fc(out[:, -1, :])
return out
🎯 What Each Part Does
| Part | Job |
|---|---|
nn.RNN |
The memory loop |
hidden_size |
How much to remember |
out[:, -1, :] |
Final answer |
😅 The RNN Problem
RNNs have short-term memory. Like a goldfish trying to remember a whole book!
Long sequences? RNNs forget the beginning by the end.
📖 Chapter 2: The Attention Mechanism
A Spotlight in the Dark
Imagine you’re at a concert with 1000 people. Everyone’s talking. But when your friend calls your name — you hear it instantly!
That’s attention. Instead of processing everything equally, you focus on what matters.
graph TD A["All Words"] --> B{Attention} B --> C["Important: 90%"] B --> D["Less Important: 8%"] B --> E["Ignore: 2%"] C --> F["Final Understanding"]
🔨 Building Attention in PyTorch
class Attention(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.attn = nn.Linear(
hidden_size * 2, hidden_size
)
self.v = nn.Linear(hidden_size, 1)
def forward(self, hidden,
encoder_outputs):
# hidden: current state
# encoder_outputs: all past states
seq_len = encoder_outputs.size(1)
# Repeat hidden for each position
hidden = hidden.unsqueeze(1)
hidden = hidden.repeat(1, seq_len, 1)
# Calculate attention scores
energy = torch.tanh(
self.attn(torch.cat(
[hidden, encoder_outputs],
dim=2
))
)
scores = self.v(energy).squeeze(2)
# Softmax = pick the important ones
weights = torch.softmax(scores, dim=1)
# Weighted sum
context = torch.bmm(
weights.unsqueeze(1),
encoder_outputs
)
return context, weights
🎯 The Magic Formula
Attention Score = How relevant is THIS word to what I need?
High score → Pay attention! Low score → Ignore it.
📖 Chapter 3: Self-Attention Implementation
Talking to Yourself (But Productively!)
Regular attention: “How does word A relate to word B?”
Self-attention: “How does EACH word relate to EVERY other word in the same sentence?”
It’s like everyone at a party introducing themselves to everyone else!
graph TD A["The"] --> B["cat"] A --> C["sat"] A --> D["mat"] B --> A B --> C B --> D C --> A C --> B C --> D D --> A D --> B D --> C
🔨 Self-Attention from Scratch
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super().__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
# Three transformations
self.queries = nn.Linear(
embed_size, embed_size
)
self.keys = nn.Linear(
embed_size, embed_size
)
self.values = nn.Linear(
embed_size, embed_size
)
self.fc_out = nn.Linear(
embed_size, embed_size
)
def forward(self, x):
N, seq_len, _ = x.shape
# Create Q, K, V
Q = self.queries(x)
K = self.keys(x)
V = self.values(x)
# Split into multiple heads
Q = Q.view(N, seq_len,
self.heads, self.head_dim)
K = K.view(N, seq_len,
self.heads, self.head_dim)
V = V.view(N, seq_len,
self.heads, self.head_dim)
# Attention = softmax(Q·K / √d) · V
energy = torch.einsum(
"nqhd,nkhd->nhqk", Q, K
)
scaling = self.head_dim ** 0.5
attention = torch.softmax(
energy / scaling, dim=3
)
out = torch.einsum(
"nhql,nlhd->nqhd",
attention, V
)
out = out.reshape(N, seq_len,
self.embed_size)
return self.fc_out(out)
🎯 Q, K, V Explained Simply
| Name | What It Does | Analogy |
|---|---|---|
| Query (Q) | What am I looking for? | Your question |
| Key (K) | What do I contain? | Labels on boxes |
| Value (V) | What’s inside? | Actual content |
The Formula: Match your question (Q) to labels (K), then grab the matching content (V)!
📖 Chapter 4: Building Transformers
The Attention Olympics
Transformers don’t just use attention once. They use it over and over, making understanding deeper each time!
Think of it like reading a book:
- First pass: See the words
- Second pass: Understand sentences
- Third pass: Get the deeper meaning
graph TD A["Input Words"] --> B["Self-Attention 1"] B --> C["Feed Forward 1"] C --> D["Self-Attention 2"] D --> E["Feed Forward 2"] E --> F["Self-Attention 3"] F --> G["Feed Forward 3"] G --> H["Rich Understanding"]
🔨 Transformer Block in PyTorch
class TransformerBlock(nn.Module):
def __init__(self, embed_size,
heads, dropout, ff_dim):
super().__init__()
self.attention = SelfAttention(
embed_size, heads
)
self.norm1 = nn.LayerNorm(embed_size)
self.norm2 = nn.LayerNorm(embed_size)
self.feed_forward = nn.Sequential(
nn.Linear(embed_size, ff_dim),
nn.ReLU(),
nn.Linear(ff_dim, embed_size)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Self-attention + skip connection
attended = self.attention(x)
x = self.norm1(attended + x)
x = self.dropout(x)
# Feed forward + skip connection
forwarded = self.feed_forward(x)
x = self.norm2(forwarded + x)
x = self.dropout(x)
return x
🎯 Key Transformer Ideas
| Concept | Why It Matters |
|---|---|
| Skip Connections | Don’t forget the original! |
| Layer Norm | Keep numbers stable |
| Multiple Heads | Different perspectives |
| Stacking Blocks | Deeper understanding |
📖 Chapter 5: Encoder-Decoder Models
The Translator’s Brain
Imagine a translator who:
- Listens to the whole French sentence (Encoder)
- Thinks about its meaning
- Speaks the English version (Decoder)
graph LR A["French Words"] --> B["ENCODER"] B --> C["Understanding"] C --> D["DECODER"] D --> E["English Words"]
🔨 Encoder-Decoder in PyTorch
class Encoder(nn.Module):
def __init__(self, vocab_size,
embed_size, num_layers,
heads, ff_dim, dropout):
super().__init__()
self.embed = nn.Embedding(
vocab_size, embed_size
)
self.layers = nn.ModuleList([
TransformerBlock(
embed_size, heads,
dropout, ff_dim
)
for _ in range(num_layers)
])
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.dropout(self.embed(x))
for layer in self.layers:
x = layer(x)
return x
class Decoder(nn.Module):
def __init__(self, vocab_size,
embed_size, num_layers,
heads, ff_dim, dropout):
super().__init__()
self.embed = nn.Embedding(
vocab_size, embed_size
)
self.layers = nn.ModuleList([
TransformerBlock(
embed_size, heads,
dropout, ff_dim
)
for _ in range(num_layers)
])
self.fc_out = nn.Linear(
embed_size, vocab_size
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, enc_out):
x = self.dropout(self.embed(x))
for layer in self.layers:
x = layer(x)
# Also attend to encoder output
return self.fc_out(x)
🎯 Encoder vs Decoder
| Part | Job | Sees |
|---|---|---|
| Encoder | Understand input | All words at once |
| Decoder | Generate output | Only past words |
📖 Chapter 6: Sequence-to-Sequence (Seq2Seq)
The Complete Pipeline
Seq2Seq is the whole system working together!
Input sequence → Encoder → Context → Decoder → Output sequence
🔨 Complete Seq2Seq Model
class Seq2Seq(nn.Module):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def forward(self, src, trg):
# Encode the source
enc_out = self.encoder(src)
# Decode to target
output = self.decoder(trg, enc_out)
return output
# Training example
def train_step(model, src, trg,
criterion, optimizer):
optimizer.zero_grad()
# Forward pass
# trg[:-1] = input to decoder
# trg[1:] = expected output
output = model(src, trg[:, :-1])
# Calculate loss
output = output.reshape(-1,
output.shape[-1])
trg = trg[:, 1:].reshape(-1)
loss = criterion(output, trg)
loss.backward()
optimizer.step()
return loss.item()
🎯 Real-World Seq2Seq Uses
| Task | Input | Output |
|---|---|---|
| Translation | French text | English text |
| Summarization | Long article | Short summary |
| Chatbot | Your message | Bot’s reply |
| Code Generation | Description | Code |
🎁 The Grand Summary
graph TD A["Sequence Models"] --> B["RNNs"] A --> C["Attention"] A --> D["Transformers"] B --> E["Basic Memory"] C --> F["Focus on Important"] D --> G["Deep Understanding"] D --> H["Encoder-Decoder"] H --> I["Seq2Seq Tasks"]
🚀 What You Learned
| Concept | One-Line Summary |
|---|---|
| RNNs | Memory through loops |
| Attention | Focus on what matters |
| Self-Attention | Words talking to each other |
| Transformers | Stacked attention blocks |
| Encoder-Decoder | Understand → Generate |
| Seq2Seq | Complete translation system |
💪 You Did It!
You now understand how machines learn to:
- Read sequences (like sentences)
- Remember important parts
- Generate new sequences
This is the foundation of ChatGPT, Google Translate, and every modern language AI!
🎯 Next Step: Build your own Transformer and watch it learn to translate!
Remember: Every expert was once a beginner. Keep building, keep learning! 🚀
