Building Sequence Models

Back

Loading concept...

🧠 Building Sequence Models in PyTorch

A Journey Through Time: Teaching Machines to Remember


🎬 The Story Begins: Why Sequences Matter

Imagine you’re reading a sentence: “The cat sat on the ___”

Your brain doesn’t just see the last word. It remembers everything that came before. That’s what makes you smart at language!

Regular neural networks? They have amnesia. Show them one word, they forget the last.

Sequence models? They’re like you — they remember!

🎯 Our Mission: Build neural networks that can read, remember, and understand sequences — just like you do.


📖 Chapter 1: Building RNNs (Recurrent Neural Networks)

The Memory Loop

Think of an RNN like a student passing notes in class.

Word 1 → [Brain Box] → passes note →
Word 2 → [Brain Box] → passes note →
Word 3 → [Brain Box] → final answer!

Each “Brain Box” gets:

  1. The new word (current input)
  2. The note from before (hidden state)

🔨 Building Your First RNN in PyTorch

import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size,
                 hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(
            input_size,
            hidden_size,
            batch_first=True
        )
        self.fc = nn.Linear(
            hidden_size, output_size
        )

    def forward(self, x):
        # x shape: (batch, seq_len, input)
        out, hidden = self.rnn(x)
        # Take last output
        out = self.fc(out[:, -1, :])
        return out

🎯 What Each Part Does

Part Job
nn.RNN The memory loop
hidden_size How much to remember
out[:, -1, :] Final answer

😅 The RNN Problem

RNNs have short-term memory. Like a goldfish trying to remember a whole book!

Long sequences? RNNs forget the beginning by the end.


📖 Chapter 2: The Attention Mechanism

A Spotlight in the Dark

Imagine you’re at a concert with 1000 people. Everyone’s talking. But when your friend calls your name — you hear it instantly!

That’s attention. Instead of processing everything equally, you focus on what matters.

graph TD A["All Words"] --> B{Attention} B --> C["Important: 90%"] B --> D["Less Important: 8%"] B --> E["Ignore: 2%"] C --> F["Final Understanding"]

🔨 Building Attention in PyTorch

class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear(
            hidden_size * 2, hidden_size
        )
        self.v = nn.Linear(hidden_size, 1)

    def forward(self, hidden,
                encoder_outputs):
        # hidden: current state
        # encoder_outputs: all past states

        seq_len = encoder_outputs.size(1)

        # Repeat hidden for each position
        hidden = hidden.unsqueeze(1)
        hidden = hidden.repeat(1, seq_len, 1)

        # Calculate attention scores
        energy = torch.tanh(
            self.attn(torch.cat(
                [hidden, encoder_outputs],
                dim=2
            ))
        )
        scores = self.v(energy).squeeze(2)

        # Softmax = pick the important ones
        weights = torch.softmax(scores, dim=1)

        # Weighted sum
        context = torch.bmm(
            weights.unsqueeze(1),
            encoder_outputs
        )
        return context, weights

🎯 The Magic Formula

Attention Score = How relevant is THIS word to what I need?

High score → Pay attention! Low score → Ignore it.


📖 Chapter 3: Self-Attention Implementation

Talking to Yourself (But Productively!)

Regular attention: “How does word A relate to word B?”

Self-attention: “How does EACH word relate to EVERY other word in the same sentence?”

It’s like everyone at a party introducing themselves to everyone else!

graph TD A["The"] --> B["cat"] A --> C["sat"] A --> D["mat"] B --> A B --> C B --> D C --> A C --> B C --> D D --> A D --> B D --> C

🔨 Self-Attention from Scratch

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        # Three transformations
        self.queries = nn.Linear(
            embed_size, embed_size
        )
        self.keys = nn.Linear(
            embed_size, embed_size
        )
        self.values = nn.Linear(
            embed_size, embed_size
        )
        self.fc_out = nn.Linear(
            embed_size, embed_size
        )

    def forward(self, x):
        N, seq_len, _ = x.shape

        # Create Q, K, V
        Q = self.queries(x)
        K = self.keys(x)
        V = self.values(x)

        # Split into multiple heads
        Q = Q.view(N, seq_len,
                   self.heads, self.head_dim)
        K = K.view(N, seq_len,
                   self.heads, self.head_dim)
        V = V.view(N, seq_len,
                   self.heads, self.head_dim)

        # Attention = softmax(Q·K / √d) · V
        energy = torch.einsum(
            "nqhd,nkhd->nhqk", Q, K
        )
        scaling = self.head_dim ** 0.5
        attention = torch.softmax(
            energy / scaling, dim=3
        )

        out = torch.einsum(
            "nhql,nlhd->nqhd",
            attention, V
        )
        out = out.reshape(N, seq_len,
                          self.embed_size)

        return self.fc_out(out)

🎯 Q, K, V Explained Simply

Name What It Does Analogy
Query (Q) What am I looking for? Your question
Key (K) What do I contain? Labels on boxes
Value (V) What’s inside? Actual content

The Formula: Match your question (Q) to labels (K), then grab the matching content (V)!


📖 Chapter 4: Building Transformers

The Attention Olympics

Transformers don’t just use attention once. They use it over and over, making understanding deeper each time!

Think of it like reading a book:

  • First pass: See the words
  • Second pass: Understand sentences
  • Third pass: Get the deeper meaning
graph TD A["Input Words"] --> B["Self-Attention 1"] B --> C["Feed Forward 1"] C --> D["Self-Attention 2"] D --> E["Feed Forward 2"] E --> F["Self-Attention 3"] F --> G["Feed Forward 3"] G --> H["Rich Understanding"]

🔨 Transformer Block in PyTorch

class TransformerBlock(nn.Module):
    def __init__(self, embed_size,
                 heads, dropout, ff_dim):
        super().__init__()
        self.attention = SelfAttention(
            embed_size, heads
        )
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_size)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Self-attention + skip connection
        attended = self.attention(x)
        x = self.norm1(attended + x)
        x = self.dropout(x)

        # Feed forward + skip connection
        forwarded = self.feed_forward(x)
        x = self.norm2(forwarded + x)
        x = self.dropout(x)

        return x

🎯 Key Transformer Ideas

Concept Why It Matters
Skip Connections Don’t forget the original!
Layer Norm Keep numbers stable
Multiple Heads Different perspectives
Stacking Blocks Deeper understanding

📖 Chapter 5: Encoder-Decoder Models

The Translator’s Brain

Imagine a translator who:

  1. Listens to the whole French sentence (Encoder)
  2. Thinks about its meaning
  3. Speaks the English version (Decoder)
graph LR A["French Words"] --> B["ENCODER"] B --> C["Understanding"] C --> D["DECODER"] D --> E["English Words"]

🔨 Encoder-Decoder in PyTorch

class Encoder(nn.Module):
    def __init__(self, vocab_size,
                 embed_size, num_layers,
                 heads, ff_dim, dropout):
        super().__init__()
        self.embed = nn.Embedding(
            vocab_size, embed_size
        )
        self.layers = nn.ModuleList([
            TransformerBlock(
                embed_size, heads,
                dropout, ff_dim
            )
            for _ in range(num_layers)
        ])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.dropout(self.embed(x))
        for layer in self.layers:
            x = layer(x)
        return x


class Decoder(nn.Module):
    def __init__(self, vocab_size,
                 embed_size, num_layers,
                 heads, ff_dim, dropout):
        super().__init__()
        self.embed = nn.Embedding(
            vocab_size, embed_size
        )
        self.layers = nn.ModuleList([
            TransformerBlock(
                embed_size, heads,
                dropout, ff_dim
            )
            for _ in range(num_layers)
        ])
        self.fc_out = nn.Linear(
            embed_size, vocab_size
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out):
        x = self.dropout(self.embed(x))
        for layer in self.layers:
            x = layer(x)
            # Also attend to encoder output
        return self.fc_out(x)

🎯 Encoder vs Decoder

Part Job Sees
Encoder Understand input All words at once
Decoder Generate output Only past words

📖 Chapter 6: Sequence-to-Sequence (Seq2Seq)

The Complete Pipeline

Seq2Seq is the whole system working together!

Input sequence → Encoder → Context → Decoder → Output sequence

🔨 Complete Seq2Seq Model

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, trg):
        # Encode the source
        enc_out = self.encoder(src)

        # Decode to target
        output = self.decoder(trg, enc_out)

        return output


# Training example
def train_step(model, src, trg,
               criterion, optimizer):
    optimizer.zero_grad()

    # Forward pass
    # trg[:-1] = input to decoder
    # trg[1:] = expected output
    output = model(src, trg[:, :-1])

    # Calculate loss
    output = output.reshape(-1,
                            output.shape[-1])
    trg = trg[:, 1:].reshape(-1)

    loss = criterion(output, trg)
    loss.backward()
    optimizer.step()

    return loss.item()

🎯 Real-World Seq2Seq Uses

Task Input Output
Translation French text English text
Summarization Long article Short summary
Chatbot Your message Bot’s reply
Code Generation Description Code

🎁 The Grand Summary

graph TD A["Sequence Models"] --> B["RNNs"] A --> C["Attention"] A --> D["Transformers"] B --> E["Basic Memory"] C --> F["Focus on Important"] D --> G["Deep Understanding"] D --> H["Encoder-Decoder"] H --> I["Seq2Seq Tasks"]

🚀 What You Learned

Concept One-Line Summary
RNNs Memory through loops
Attention Focus on what matters
Self-Attention Words talking to each other
Transformers Stacked attention blocks
Encoder-Decoder Understand → Generate
Seq2Seq Complete translation system

💪 You Did It!

You now understand how machines learn to:

  • Read sequences (like sentences)
  • Remember important parts
  • Generate new sequences

This is the foundation of ChatGPT, Google Translate, and every modern language AI!

🎯 Next Step: Build your own Transformer and watch it learn to translate!


Remember: Every expert was once a beginner. Keep building, keep learning! 🚀

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.