What is text data in machine learning?

Text data includes words, sentences, emails, tweets, and reviews. Computers see it as characters, so we transform text into numbers they can process.

What is text preprocessing?

Text preprocessing cleans messy text by lowercasing, removing punctuation, eliminating extra spaces, and reducing words to root forms.

What are word embeddings?

Word embeddings give words lists of numbers capturing their meaning. Similar words have similar numbers, so 'King - Man + Woman = Queen' works mathematically.

What is sequence padding?

Sequence padding adds zeros to make all text sequences the same length. Computers need fixed-size inputs, so short sentences get padded to match longer ones.

Text Processing in TensorFlow | NLP Guide

🎭 The Story of Teaching Computers to Read

Imagine you have a magical robot friend who wants to read your favorite storybooks. But there’s a problem—robots don’t understand words the way we do! Let’s discover how we teach them to read, one step at a time.

🌟 Our Journey Map

graph TD
    A["📚 Raw Text"] --> B["🧹 Clean It Up"]
    B --> C["🔢 Turn to Numbers"]
    C --> D["📏 Make Same Length"]
    D --> E["✂️ Smart Word Pieces"]
    E --> F["🎯 Word Meanings"]
    F --> G["🏷️ Classify Text"]
    style A fill:#ff6b6b
    style G fill:#4ecdc4

📚 Text Data Fundamentals

What IS Text Data?

Think of text like LEGO bricks. When you read “I love pizza”, you see three words. But a computer? It sees a long string of letters, spaces, and symbols—like a messy pile of bricks!

Text data is simply:

Words, sentences, paragraphs
Emails, tweets, reviews, books
Anything humans write!

The Big Problem: Computers only understand numbers (0s and 1s). So we need to transform our words into numbers they can work with.

# What we see
text = "Hello World!"

# What computer sees initially
# Just characters: H-e-l-l-o- -W-o-r-l-d-!

Real Examples:

Text Type	Example
Tweet	“Love this movie! 🎬”
Review	“Best pizza in town.”
Email	“Meeting at 3pm today”

🧹 Text Preprocessing

Cleaning Up the Mess

Imagine your room is super messy with toys everywhere. Before you can play properly, you need to clean up! Text preprocessing is cleaning up messy text.

Why Clean?

“HELLO” and “hello” mean the same thing
“running” and “run” are related
Extra spaces and weird symbols confuse computers

The Cleaning Checklist

1. Lowercase Everything

text = "I LOVE Dogs!"
clean = text.lower()
# Result: "i love dogs!"

2. Remove Punctuation

import re
text = "Hello!!! How are you???"
clean = re.sub(r'[^\w\s]', '', text)
# Result: "Hello How are you"

3. Remove Extra Spaces

text = "Too   many    spaces"
clean = ' '.join(text.split())
# Result: "Too many spaces"

4. Remove Stop Words Stop words are common words like “the”, “is”, “at” that don’t add much meaning.

# Before: "The cat is on the mat"
# After:  "cat mat"

5. Stemming & Lemmatization Reducing words to their root form.

Original	Stemmed	Lemmatized
running	runn	run
better	better	good
cats	cat	cat

🔢 Text Encoding

Turning Words into Numbers

Here’s where the magic happens! We need to give each word a number ID, like giving every student in class a unique number.

Method 1: Simple Counting (Bag of Words)

Imagine dumping all words into a bag and counting them:

# Sentence: "I love cats. I love dogs."
# Word counts:
# I: 2, love: 2, cats: 1, dogs: 1

graph TD
    A["I love cats"] --> B["Word Bag"]
    C["I love dogs"] --> B
    B --> D["I:2 love:2 cats:1 dogs:1"]

Method 2: One-Hot Encoding

Give each word its own column with 1 or 0:

Vocabulary: [cat, dog, bird]

"cat"  → [1, 0, 0]
"dog"  → [0, 1, 0]
"bird" → [0, 0, 1]

Method 3: TF-IDF

Term Frequency - Inverse Document Frequency

This is smarter! It asks:

How often does this word appear? (TF)
Is it rare or common across all texts? (IDF)

Rare words = More important!

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["I love pizza", "Pizza is great"]
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)

📏 Sequence Padding

Making Everything the Same Size

Imagine you have boxes for storing toys, but all boxes must be the same size. Some sentences are short, some are long—we need to make them equal!

The Problem:

"Hi"           → 1 word
"Hello world"  → 2 words
"I love pizza" → 3 words

Computers need fixed-size inputs!

The Solution: Padding

Add zeros (or special tokens) to make all sequences the same length:

from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = [[1, 2], [3, 4, 5], [6]]

# Pad to length 4
padded = pad_sequences(sequences, maxlen=4)

# Result:
# [[0, 0, 1, 2],
#  [0, 3, 4, 5],
#  [0, 0, 0, 6]]

Two Types of Padding:

Type	Example	Use When
Pre-padding	[0, 0, 1, 2]	Default, works great!
Post-padding	[1, 2, 0, 0]	Sometimes for specific models

# Post-padding
pad_sequences(sequences, maxlen=4, padding='post')

Truncating Long Sequences: What if a sentence is TOO long?

# If maxlen=3 and sequence is [1,2,3,4,5]
# Pre-truncate: [3, 4, 5]
# Post-truncate: [1, 2, 3]

✂️ Subword Tokenization

The Smart Way to Split Words

Here’s a clever trick: What if instead of whole words, we split them into smaller pieces?

Why Subwords?

Handle words we’ve never seen before!
“unhappiness” → “un” + “happi” + “ness”
Smaller vocabulary needed

The Magic of BPE (Byte Pair Encoding)

graph TD
    A["lowest"] --> B["low + est"]
    C["lower"] --> D["low + er"]
    E["low"] --> F["low"]
    style A fill:#ffeaa7
    style C fill:#ffeaa7
    style E fill:#ffeaa7

How it Works:

Start with all characters
Find most common pairs
Merge them into new tokens
Repeat!

# Example breakdown:
"unhappiness" → ["un", "happi", "ness"]
"tensorflow"  → ["tensor", "flow"]
"playing"     → ["play", "ing"]

Popular Subword Methods

Method	Used By	Key Feature
BPE	GPT-2	Merge frequent pairs
WordPiece	BERT	Probability-based
SentencePiece	T5	Language-agnostic

import tensorflow as tf

# Using TensorFlow's TextVectorization
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=1000,
    output_mode='int'
)

🎯 Word Embeddings

Giving Words Their Personality

This is the coolest part! Instead of giving words simple numbers, we give them a whole list of numbers that capture their meaning.

Simple ID vs Embedding:

Simple: "king" = 42
Embedding: "king" = [0.2, 0.8, -0.3, 0.5, ...]

The Magic of Embeddings

Words with similar meanings have similar numbers!

graph LR
    A["King"] --- B["Queen"]
    A --- C["Prince"]
    D["Dog"] --- E["Cat"]
    D --- F["Puppy"]
    style A fill:#ff6b6b
    style B fill:#ff6b6b
    style D fill:#4ecdc4
    style E fill:#4ecdc4

Famous Discovery:

King - Man + Woman = Queen

The math actually works with embeddings!

Using Embeddings in TensorFlow

import tensorflow as tf

# Create embedding layer
embedding = tf.keras.layers.Embedding(
    input_dim=10000,  # vocabulary size
    output_dim=128    # embedding size
)

# Input: word IDs
# Output: dense vectors

Pre-trained Embeddings

Why train from scratch? Use embeddings others have trained!

Name	Trained On	Dimensions
Word2Vec	Google News	300
GloVe	Wikipedia	50-300
FastText	Common Crawl	300

# Load pre-trained GloVe
embedding_index = {}
with open('glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:])
        embedding_index[word] = coefs

🏷️ Text Classification Tasks

Teaching Computers to Sort Text

Now for the finale! We’ve prepared our text—time to teach the computer to understand and categorize it!

What is Text Classification? Putting text into categories, like sorting mail into folders.

Common Classification Tasks

graph TD
    A["Text Classification"] --> B["Sentiment"]
    A --> C["Spam Detection"]
    A --> D["Topic Labeling"]
    A --> E["Intent Detection"]
    style A fill:#667eea

1. Sentiment Analysis Is this review positive or negative?

"Great movie!" → Positive ✅
"Terrible food" → Negative ❌

2. Spam Detection

"You won $1000000!" → Spam 🚫
"Meeting tomorrow" → Not Spam ✉️

3. Topic Classification

"Stock prices rose" → Finance 📈
"Team wins championship" → Sports ⚽

Building a Text Classifier

import tensorflow as tf

model = tf.keras.Sequential([
    # Convert words to embeddings
    tf.keras.layers.Embedding(
        vocab_size, 64),

    # Process the sequence
    tf.keras.layers.GlobalAveragePooling1D(),

    # Hidden layer
    tf.keras.layers.Dense(64, activation='relu'),

    # Output: 2 classes (pos/neg)
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

The Complete Pipeline

graph TD
    A["Raw Text"] --> B["Preprocess"]
    B --> C["Tokenize"]
    C --> D["Pad Sequences"]
    D --> E["Embed Words"]
    E --> F["Neural Network"]
    F --> G["Prediction"]
    style A fill:#ff6b6b
    style G fill:#4ecdc4

Real Example:

# 1. Raw text
review = "This movie was absolutely amazing!"

# 2. Preprocess
clean = preprocess(review)  # "movie absolutely amazing"

# 3. Tokenize
tokens = tokenizer.texts_to_sequences([clean])
# [[45, 892, 234]]

# 4. Pad
padded = pad_sequences(tokens, maxlen=100)
# [[0, 0, ..., 45, 892, 234]]

# 5. Predict
prediction = model.predict(padded)
# Output: "Positive" with 95% confidence!

🎉 You Did It!

You’ve just learned how to:

✅ Understand what text data is
✅ Clean and preprocess text
✅ Convert words to numbers
✅ Handle different length texts
✅ Use smart subword tokenization
✅ Give words meaningful embeddings
✅ Build text classifiers

The Journey: Raw messy text → Clean text → Numbers → Same-size sequences → Meaningful embeddings → Smart predictions!

Now you know how computers learn to read! 🚀📚

Remember: Every time your phone autocompletes your text or Netflix recommends a movie, these techniques are working behind the scenes!

Text Processing

Unable to load concept

Coming Soon...

🎭 The Story of Teaching Computers to Read

🌟 Our Journey Map

📚 Text Data Fundamentals

What IS Text Data?

🧹 Text Preprocessing

Cleaning Up the Mess

The Cleaning Checklist

🔢 Text Encoding

Turning Words into Numbers

Method 1: Simple Counting (Bag of Words)

Method 2: One-Hot Encoding

Method 3: TF-IDF

📏 Sequence Padding

Making Everything the Same Size

✂️ Subword Tokenization

The Smart Way to Split Words

The Magic of BPE (Byte Pair Encoding)

Popular Subword Methods

🎯 Word Embeddings

Giving Words Their Personality

The Magic of Embeddings

Using Embeddings in TensorFlow

Pre-trained Embeddings

🏷️ Text Classification Tasks

Teaching Computers to Sort Text

Common Classification Tasks

Building a Text Classifier

The Complete Pipeline

🎉 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue