🎭 The Story of Teaching Computers to Read
Imagine you have a magical robot friend who wants to read your favorite storybooks. But there’s a problem—robots don’t understand words the way we do! Let’s discover how we teach them to read, one step at a time.
🌟 Our Journey Map
graph TD A["📚 Raw Text"] --> B["🧹 Clean It Up"] B --> C["🔢 Turn to Numbers"] C --> D["📏 Make Same Length"] D --> E["✂️ Smart Word Pieces"] E --> F["🎯 Word Meanings"] F --> G["🏷️ Classify Text"] style A fill:#ff6b6b style G fill:#4ecdc4
📚 Text Data Fundamentals
What IS Text Data?
Think of text like LEGO bricks. When you read “I love pizza”, you see three words. But a computer? It sees a long string of letters, spaces, and symbols—like a messy pile of bricks!
Text data is simply:
- Words, sentences, paragraphs
- Emails, tweets, reviews, books
- Anything humans write!
The Big Problem: Computers only understand numbers (0s and 1s). So we need to transform our words into numbers they can work with.
# What we see
text = "Hello World!"
# What computer sees initially
# Just characters: H-e-l-l-o- -W-o-r-l-d-!
Real Examples:
| Text Type | Example |
|---|---|
| Tweet | “Love this movie! 🎬” |
| Review | “Best pizza in town.” |
| “Meeting at 3pm today” |
🧹 Text Preprocessing
Cleaning Up the Mess
Imagine your room is super messy with toys everywhere. Before you can play properly, you need to clean up! Text preprocessing is cleaning up messy text.
Why Clean?
- “HELLO” and “hello” mean the same thing
- “running” and “run” are related
- Extra spaces and weird symbols confuse computers
The Cleaning Checklist
1. Lowercase Everything
text = "I LOVE Dogs!"
clean = text.lower()
# Result: "i love dogs!"
2. Remove Punctuation
import re
text = "Hello!!! How are you???"
clean = re.sub(r'[^\w\s]', '', text)
# Result: "Hello How are you"
3. Remove Extra Spaces
text = "Too many spaces"
clean = ' '.join(text.split())
# Result: "Too many spaces"
4. Remove Stop Words Stop words are common words like “the”, “is”, “at” that don’t add much meaning.
# Before: "The cat is on the mat"
# After: "cat mat"
5. Stemming & Lemmatization Reducing words to their root form.
| Original | Stemmed | Lemmatized |
|---|---|---|
| running | runn | run |
| better | better | good |
| cats | cat | cat |
🔢 Text Encoding
Turning Words into Numbers
Here’s where the magic happens! We need to give each word a number ID, like giving every student in class a unique number.
Method 1: Simple Counting (Bag of Words)
Imagine dumping all words into a bag and counting them:
# Sentence: "I love cats. I love dogs."
# Word counts:
# I: 2, love: 2, cats: 1, dogs: 1
graph TD A["I love cats"] --> B["Word Bag"] C["I love dogs"] --> B B --> D["I:2 love:2 cats:1 dogs:1"]
Method 2: One-Hot Encoding
Give each word its own column with 1 or 0:
Vocabulary: [cat, dog, bird]
"cat" → [1, 0, 0]
"dog" → [0, 1, 0]
"bird" → [0, 0, 1]
Method 3: TF-IDF
Term Frequency - Inverse Document Frequency
This is smarter! It asks:
- How often does this word appear? (TF)
- Is it rare or common across all texts? (IDF)
Rare words = More important!
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["I love pizza", "Pizza is great"]
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
📏 Sequence Padding
Making Everything the Same Size
Imagine you have boxes for storing toys, but all boxes must be the same size. Some sentences are short, some are long—we need to make them equal!
The Problem:
"Hi" → 1 word
"Hello world" → 2 words
"I love pizza" → 3 words
Computers need fixed-size inputs!
The Solution: Padding
Add zeros (or special tokens) to make all sequences the same length:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = [[1, 2], [3, 4, 5], [6]]
# Pad to length 4
padded = pad_sequences(sequences, maxlen=4)
# Result:
# [[0, 0, 1, 2],
# [0, 3, 4, 5],
# [0, 0, 0, 6]]
Two Types of Padding:
| Type | Example | Use When |
|---|---|---|
| Pre-padding | [0, 0, 1, 2] | Default, works great! |
| Post-padding | [1, 2, 0, 0] | Sometimes for specific models |
# Post-padding
pad_sequences(sequences, maxlen=4, padding='post')
Truncating Long Sequences: What if a sentence is TOO long?
# If maxlen=3 and sequence is [1,2,3,4,5]
# Pre-truncate: [3, 4, 5]
# Post-truncate: [1, 2, 3]
✂️ Subword Tokenization
The Smart Way to Split Words
Here’s a clever trick: What if instead of whole words, we split them into smaller pieces?
Why Subwords?
- Handle words we’ve never seen before!
- “unhappiness” → “un” + “happi” + “ness”
- Smaller vocabulary needed
The Magic of BPE (Byte Pair Encoding)
graph TD A["lowest"] --> B["low + est"] C["lower"] --> D["low + er"] E["low"] --> F["low"] style A fill:#ffeaa7 style C fill:#ffeaa7 style E fill:#ffeaa7
How it Works:
- Start with all characters
- Find most common pairs
- Merge them into new tokens
- Repeat!
# Example breakdown:
"unhappiness" → ["un", "happi", "ness"]
"tensorflow" → ["tensor", "flow"]
"playing" → ["play", "ing"]
Popular Subword Methods
| Method | Used By | Key Feature |
|---|---|---|
| BPE | GPT-2 | Merge frequent pairs |
| WordPiece | BERT | Probability-based |
| SentencePiece | T5 | Language-agnostic |
import tensorflow as tf
# Using TensorFlow's TextVectorization
vectorizer = tf.keras.layers.TextVectorization(
max_tokens=1000,
output_mode='int'
)
🎯 Word Embeddings
Giving Words Their Personality
This is the coolest part! Instead of giving words simple numbers, we give them a whole list of numbers that capture their meaning.
Simple ID vs Embedding:
Simple: "king" = 42
Embedding: "king" = [0.2, 0.8, -0.3, 0.5, ...]
The Magic of Embeddings
Words with similar meanings have similar numbers!
graph LR A["King"] --- B["Queen"] A --- C["Prince"] D["Dog"] --- E["Cat"] D --- F["Puppy"] style A fill:#ff6b6b style B fill:#ff6b6b style D fill:#4ecdc4 style E fill:#4ecdc4
Famous Discovery:
King - Man + Woman = Queen
The math actually works with embeddings!
Using Embeddings in TensorFlow
import tensorflow as tf
# Create embedding layer
embedding = tf.keras.layers.Embedding(
input_dim=10000, # vocabulary size
output_dim=128 # embedding size
)
# Input: word IDs
# Output: dense vectors
Pre-trained Embeddings
Why train from scratch? Use embeddings others have trained!
| Name | Trained On | Dimensions |
|---|---|---|
| Word2Vec | Google News | 300 |
| GloVe | Wikipedia | 50-300 |
| FastText | Common Crawl | 300 |
# Load pre-trained GloVe
embedding_index = {}
with open('glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:])
embedding_index[word] = coefs
🏷️ Text Classification Tasks
Teaching Computers to Sort Text
Now for the finale! We’ve prepared our text—time to teach the computer to understand and categorize it!
What is Text Classification? Putting text into categories, like sorting mail into folders.
Common Classification Tasks
graph TD A["Text Classification"] --> B["Sentiment"] A --> C["Spam Detection"] A --> D["Topic Labeling"] A --> E["Intent Detection"] style A fill:#667eea
1. Sentiment Analysis Is this review positive or negative?
"Great movie!" → Positive ✅
"Terrible food" → Negative ❌
2. Spam Detection
"You won $1000000!" → Spam 🚫
"Meeting tomorrow" → Not Spam ✉️
3. Topic Classification
"Stock prices rose" → Finance 📈
"Team wins championship" → Sports ⚽
Building a Text Classifier
import tensorflow as tf
model = tf.keras.Sequential([
# Convert words to embeddings
tf.keras.layers.Embedding(
vocab_size, 64),
# Process the sequence
tf.keras.layers.GlobalAveragePooling1D(),
# Hidden layer
tf.keras.layers.Dense(64, activation='relu'),
# Output: 2 classes (pos/neg)
tf.keras.layers.Dense(2, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
The Complete Pipeline
graph TD A["Raw Text"] --> B["Preprocess"] B --> C["Tokenize"] C --> D["Pad Sequences"] D --> E["Embed Words"] E --> F["Neural Network"] F --> G["Prediction"] style A fill:#ff6b6b style G fill:#4ecdc4
Real Example:
# 1. Raw text
review = "This movie was absolutely amazing!"
# 2. Preprocess
clean = preprocess(review) # "movie absolutely amazing"
# 3. Tokenize
tokens = tokenizer.texts_to_sequences([clean])
# [[45, 892, 234]]
# 4. Pad
padded = pad_sequences(tokens, maxlen=100)
# [[0, 0, ..., 45, 892, 234]]
# 5. Predict
prediction = model.predict(padded)
# Output: "Positive" with 95% confidence!
🎉 You Did It!
You’ve just learned how to:
- ✅ Understand what text data is
- ✅ Clean and preprocess text
- ✅ Convert words to numbers
- ✅ Handle different length texts
- ✅ Use smart subword tokenization
- ✅ Give words meaningful embeddings
- ✅ Build text classifiers
The Journey: Raw messy text → Clean text → Numbers → Same-size sequences → Meaningful embeddings → Smart predictions!
Now you know how computers learn to read! 🚀📚
Remember: Every time your phone autocompletes your text or Netflix recommends a movie, these techniques are working behind the scenes!
