Text Processing

Back

Loading concept...

🎭 The Story of Teaching Computers to Read

Imagine you have a magical robot friend who wants to read your favorite storybooks. But there’s a problem—robots don’t understand words the way we do! Let’s discover how we teach them to read, one step at a time.


🌟 Our Journey Map

graph TD A["📚 Raw Text"] --> B["🧹 Clean It Up"] B --> C["🔢 Turn to Numbers"] C --> D["📏 Make Same Length"] D --> E["✂️ Smart Word Pieces"] E --> F["🎯 Word Meanings"] F --> G["🏷️ Classify Text"] style A fill:#ff6b6b style G fill:#4ecdc4

📚 Text Data Fundamentals

What IS Text Data?

Think of text like LEGO bricks. When you read “I love pizza”, you see three words. But a computer? It sees a long string of letters, spaces, and symbols—like a messy pile of bricks!

Text data is simply:

  • Words, sentences, paragraphs
  • Emails, tweets, reviews, books
  • Anything humans write!

The Big Problem: Computers only understand numbers (0s and 1s). So we need to transform our words into numbers they can work with.

# What we see
text = "Hello World!"

# What computer sees initially
# Just characters: H-e-l-l-o- -W-o-r-l-d-!

Real Examples:

Text Type Example
Tweet “Love this movie! 🎬”
Review “Best pizza in town.”
Email “Meeting at 3pm today”

🧹 Text Preprocessing

Cleaning Up the Mess

Imagine your room is super messy with toys everywhere. Before you can play properly, you need to clean up! Text preprocessing is cleaning up messy text.

Why Clean?

  • “HELLO” and “hello” mean the same thing
  • “running” and “run” are related
  • Extra spaces and weird symbols confuse computers

The Cleaning Checklist

1. Lowercase Everything

text = "I LOVE Dogs!"
clean = text.lower()
# Result: "i love dogs!"

2. Remove Punctuation

import re
text = "Hello!!! How are you???"
clean = re.sub(r'[^\w\s]', '', text)
# Result: "Hello How are you"

3. Remove Extra Spaces

text = "Too   many    spaces"
clean = ' '.join(text.split())
# Result: "Too many spaces"

4. Remove Stop Words Stop words are common words like “the”, “is”, “at” that don’t add much meaning.

# Before: "The cat is on the mat"
# After:  "cat mat"

5. Stemming & Lemmatization Reducing words to their root form.

Original Stemmed Lemmatized
running runn run
better better good
cats cat cat

🔢 Text Encoding

Turning Words into Numbers

Here’s where the magic happens! We need to give each word a number ID, like giving every student in class a unique number.

Method 1: Simple Counting (Bag of Words)

Imagine dumping all words into a bag and counting them:

# Sentence: "I love cats. I love dogs."
# Word counts:
# I: 2, love: 2, cats: 1, dogs: 1
graph TD A["I love cats"] --> B["Word Bag"] C["I love dogs"] --> B B --> D["I:2 love:2 cats:1 dogs:1"]

Method 2: One-Hot Encoding

Give each word its own column with 1 or 0:

Vocabulary: [cat, dog, bird]

"cat"  → [1, 0, 0]
"dog"  → [0, 1, 0]
"bird" → [0, 0, 1]

Method 3: TF-IDF

Term Frequency - Inverse Document Frequency

This is smarter! It asks:

  • How often does this word appear? (TF)
  • Is it rare or common across all texts? (IDF)

Rare words = More important!

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["I love pizza", "Pizza is great"]
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)

📏 Sequence Padding

Making Everything the Same Size

Imagine you have boxes for storing toys, but all boxes must be the same size. Some sentences are short, some are long—we need to make them equal!

The Problem:

"Hi"           → 1 word
"Hello world"  → 2 words
"I love pizza" → 3 words

Computers need fixed-size inputs!

The Solution: Padding

Add zeros (or special tokens) to make all sequences the same length:

from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = [[1, 2], [3, 4, 5], [6]]

# Pad to length 4
padded = pad_sequences(sequences, maxlen=4)

# Result:
# [[0, 0, 1, 2],
#  [0, 3, 4, 5],
#  [0, 0, 0, 6]]

Two Types of Padding:

Type Example Use When
Pre-padding [0, 0, 1, 2] Default, works great!
Post-padding [1, 2, 0, 0] Sometimes for specific models
# Post-padding
pad_sequences(sequences, maxlen=4, padding='post')

Truncating Long Sequences: What if a sentence is TOO long?

# If maxlen=3 and sequence is [1,2,3,4,5]
# Pre-truncate: [3, 4, 5]
# Post-truncate: [1, 2, 3]

✂️ Subword Tokenization

The Smart Way to Split Words

Here’s a clever trick: What if instead of whole words, we split them into smaller pieces?

Why Subwords?

  • Handle words we’ve never seen before!
  • “unhappiness” → “un” + “happi” + “ness”
  • Smaller vocabulary needed

The Magic of BPE (Byte Pair Encoding)

graph TD A["lowest"] --> B["low + est"] C["lower"] --> D["low + er"] E["low"] --> F["low"] style A fill:#ffeaa7 style C fill:#ffeaa7 style E fill:#ffeaa7

How it Works:

  1. Start with all characters
  2. Find most common pairs
  3. Merge them into new tokens
  4. Repeat!
# Example breakdown:
"unhappiness" → ["un", "happi", "ness"]
"tensorflow"  → ["tensor", "flow"]
"playing"     → ["play", "ing"]

Popular Subword Methods

Method Used By Key Feature
BPE GPT-2 Merge frequent pairs
WordPiece BERT Probability-based
SentencePiece T5 Language-agnostic
import tensorflow as tf

# Using TensorFlow's TextVectorization
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=1000,
    output_mode='int'
)

🎯 Word Embeddings

Giving Words Their Personality

This is the coolest part! Instead of giving words simple numbers, we give them a whole list of numbers that capture their meaning.

Simple ID vs Embedding:

Simple: "king" = 42
Embedding: "king" = [0.2, 0.8, -0.3, 0.5, ...]

The Magic of Embeddings

Words with similar meanings have similar numbers!

graph LR A["King"] --- B["Queen"] A --- C["Prince"] D["Dog"] --- E["Cat"] D --- F["Puppy"] style A fill:#ff6b6b style B fill:#ff6b6b style D fill:#4ecdc4 style E fill:#4ecdc4

Famous Discovery:

King - Man + Woman = Queen

The math actually works with embeddings!

Using Embeddings in TensorFlow

import tensorflow as tf

# Create embedding layer
embedding = tf.keras.layers.Embedding(
    input_dim=10000,  # vocabulary size
    output_dim=128    # embedding size
)

# Input: word IDs
# Output: dense vectors

Pre-trained Embeddings

Why train from scratch? Use embeddings others have trained!

Name Trained On Dimensions
Word2Vec Google News 300
GloVe Wikipedia 50-300
FastText Common Crawl 300
# Load pre-trained GloVe
embedding_index = {}
with open('glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:])
        embedding_index[word] = coefs

🏷️ Text Classification Tasks

Teaching Computers to Sort Text

Now for the finale! We’ve prepared our text—time to teach the computer to understand and categorize it!

What is Text Classification? Putting text into categories, like sorting mail into folders.

Common Classification Tasks

graph TD A["Text Classification"] --> B["Sentiment"] A --> C["Spam Detection"] A --> D["Topic Labeling"] A --> E["Intent Detection"] style A fill:#667eea

1. Sentiment Analysis Is this review positive or negative?

"Great movie!" → Positive ✅
"Terrible food" → Negative ❌

2. Spam Detection

"You won $1000000!" → Spam 🚫
"Meeting tomorrow" → Not Spam ✉️

3. Topic Classification

"Stock prices rose" → Finance 📈
"Team wins championship" → Sports ⚽

Building a Text Classifier

import tensorflow as tf

model = tf.keras.Sequential([
    # Convert words to embeddings
    tf.keras.layers.Embedding(
        vocab_size, 64),

    # Process the sequence
    tf.keras.layers.GlobalAveragePooling1D(),

    # Hidden layer
    tf.keras.layers.Dense(64, activation='relu'),

    # Output: 2 classes (pos/neg)
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

The Complete Pipeline

graph TD A["Raw Text"] --> B["Preprocess"] B --> C["Tokenize"] C --> D["Pad Sequences"] D --> E["Embed Words"] E --> F["Neural Network"] F --> G["Prediction"] style A fill:#ff6b6b style G fill:#4ecdc4

Real Example:

# 1. Raw text
review = "This movie was absolutely amazing!"

# 2. Preprocess
clean = preprocess(review)  # "movie absolutely amazing"

# 3. Tokenize
tokens = tokenizer.texts_to_sequences([clean])
# [[45, 892, 234]]

# 4. Pad
padded = pad_sequences(tokens, maxlen=100)
# [[0, 0, ..., 45, 892, 234]]

# 5. Predict
prediction = model.predict(padded)
# Output: "Positive" with 95% confidence!

🎉 You Did It!

You’ve just learned how to:

  • ✅ Understand what text data is
  • ✅ Clean and preprocess text
  • ✅ Convert words to numbers
  • ✅ Handle different length texts
  • ✅ Use smart subword tokenization
  • ✅ Give words meaningful embeddings
  • ✅ Build text classifiers

The Journey: Raw messy text → Clean text → Numbers → Same-size sequences → Meaningful embeddings → Smart predictions!

Now you know how computers learn to read! 🚀📚


Remember: Every time your phone autocompletes your text or Netflix recommends a movie, these techniques are working behind the scenes!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.