NLP Text Preprocessing: Teaching Computers to Read Like Humans
The Big Picture: A Kitchen Prep Analogy 🍳
Imagine you’re a chef preparing ingredients for a delicious meal. Before cooking, you:
- Wash the vegetables (remove dirt)
- Chop them into pieces
- Remove the parts you don’t need (stems, seeds)
- Organize everything neatly
Text preprocessing works exactly the same way!
Before a computer can understand text, we must clean, chop, and organize words. This is called Text Preprocessing — the essential first step in Natural Language Processing (NLP).
1. Text Cleaning Steps 🧹
What is it?
Text cleaning is like washing your vegetables. Raw text from the internet is messy — it has weird symbols, extra spaces, and things computers don’t need.
Why do we need it?
Computers get confused by:
Hello!!!vsHelloAPPLEvsapplecafévscafe
The Cleaning Checklist:
| Step | Before | After |
|---|---|---|
| Lowercase | HELLO World |
hello world |
| Remove punctuation | Hi! How are you? |
Hi How are you |
| Remove numbers | I have 5 cats |
I have cats |
| Remove extra spaces | too many spaces |
too many spaces |
| Remove special chars | email@test.com |
emailtestcom |
Simple Example:
Original: "OMG!!! I LOVE pizza 🍕 sooo much!!!"
Cleaned: "omg i love pizza so much"
Think of it like this: A messy room vs a clean room. Which one can you find your toys in faster?
2. Tokenization ✂️
What is it?
Tokenization means cutting text into smaller pieces called tokens. Just like cutting a pizza into slices!
Types of Tokenization:
Word Tokenization:
"I love cats" → ["I", "love", "cats"]
Sentence Tokenization:
"Hello. How are you?" → ["Hello.", "How are you?"]
Character Tokenization:
"cat" → ["c", "a", "t"]
Why do we need it?
Computers can’t read sentences. They need individual pieces to understand text — like reading one word at a time.
Real-World Example:
When you type in Google Search:
Your search: "best pizza near me"
Tokens: ["best", "pizza", "near", "me"]
Google looks for pages containing each token!
3. Stemming and Lemmatization 🌱
The Problem:
These words mean the same thing:
running,runs,ran→ all about runbetter,best→ all about good
How do we teach computers this?
Stemming (The Quick Chop)
Stemming cuts off word endings roughly. It’s fast but sometimes messy.
running → runn
happiness → happi
cats → cat
Like cutting vegetables quickly — not perfect, but done!
Lemmatization (The Careful Chop)
Lemmatization uses a dictionary to find the true root word.
running → run (correct!)
better → good (smart!)
was → be (knows grammar!)
Like a professional chef — takes more time, but perfect results.
Quick Comparison:
| Word | Stemming | Lemmatization |
|---|---|---|
| running | runn | run |
| studies | studi | study |
| better | better | good |
| was | wa | be |
4. Stop Words Removal 🚫
What are Stop Words?
Stop words are common words that don’t add meaning:
the,is,at,which,on,a,an
Why Remove Them?
Imagine searching for “the best pizza in the world”:
- With stop words:
the,best,pizza,in,the,world - Without stop words:
best,pizza,world
The important words stand out!
Example:
Before: "The cat is sitting on the mat"
After: "cat sitting mat"
Think of it like: Reading a story vs reading only the important words. You still understand!
Common Stop Words:
a, an, the, is, are, was, were, in, on, at, to, for, of, and, or, but, if, then
5. N-Grams and Bag of Words 🎒
Bag of Words (BoW)
Imagine throwing all words into a bag and counting them. Order doesn’t matter!
Sentence: "I love pizza and I love pasta"
Bag of Words:
- I: 2
- love: 2
- pizza: 1
- and: 1
- pasta: 1
Like counting candies in a jar — you know what you have, but not the order.
N-Grams (Word Groups)
N-grams capture groups of consecutive words:
Unigrams (1 word):
"I love cats" → ["I", "love", "cats"]
Bigrams (2 words):
"I love cats" → ["I love", "love cats"]
Trigrams (3 words):
"I love cats" → ["I love cats"]
Why Use N-Grams?
Single words lose context:
- “not good” → Bag of Words sees
notandgoodseparately (loses meaning!) - Bigram captures “not good” together (keeps meaning!)
graph TD A["Text: I love cats"] --> B["Unigrams"] A --> C["Bigrams"] A --> D["Trigrams"] B --> E["I, love, cats"] C --> F["I love, love cats"] D --> G["I love cats"]
6. TF-IDF Representation 📊
The Problem with Counting Words
If “the” appears 100 times and “pizza” appears 2 times, is “the” more important?
No! Common words aren’t special.
TF-IDF to the Rescue!
TF = Term Frequency (how often a word appears in one document)
IDF = Inverse Document Frequency (how rare a word is across all documents)
TF-IDF = TF × IDF
Simple Example:
Document 1: “I love pizza” Document 2: “I love pasta” Document 3: “I hate homework”
| Word | TF (Doc 1) | IDF | TF-IDF |
|---|---|---|---|
| I | High | Low (common) | Low |
| love | High | Medium | Medium |
| pizza | High | High (unique) | HIGH |
Pizza gets the highest score because it’s unique to Document 1!
Real-World Use:
Google uses TF-IDF to find relevant pages. If you search “pizza recipes”, pages with unique pizza-related words rank higher!
7. Subword Tokenization 🧬
The Problem with Regular Tokenization
What happens with:
- New words:
COVID-19,unfriendly - Rare words:
antidisestablishmentarianism - Typos:
pizzza
Regular tokenization says: “I don’t know this word!” 😕
The Solution: Break Words into Pieces!
Subword tokenization splits unknown words into known parts:
"unfriendly" → ["un", "friend", "ly"]
"playing" → ["play", "ing"]
"unhappiness" → ["un", "happi", "ness"]
Popular Methods:
BPE (Byte Pair Encoding):
- Starts with characters
- Merges common pairs
l + o = lo,lo + w = low
WordPiece:
- Used by BERT and Google
playing → play + ##ing- The
##means “attached to previous piece”
Why It’s Amazing:
| Method | Unknown Word | Result |
|---|---|---|
| Regular | unhappily |
??? |
| Subword | unhappily |
un + happy + ly |
Now the computer understands new words by recognizing their parts!
graph TD A["unhappiness"] --> B["un"] A --> C["happi"] A --> D["ness"] B --> E["means: not"] C --> F["means: happy"] D --> G["means: state of"]
The Complete Pipeline 🚀
Here’s how all steps work together:
graph TD A["Raw Text"] --> B["Text Cleaning"] B --> C["Tokenization"] C --> D["Stop Words Removal"] D --> E["Stemming/Lemmatization"] E --> F["Create Features"] F --> G["Bag of Words"] F --> H["TF-IDF"] F --> I["Subword Tokens"]
Full Example:
Input: "The DOGS are running!!! 🐕🐕🐕"
- Clean:
"the dogs are running" - Tokenize:
["the", "dogs", "are", "running"] - Remove Stop Words:
["dogs", "running"] - Lemmatize:
["dog", "run"] - Create TF-IDF:
{dog: 0.7, run: 0.7}
Now the computer understands!
Key Takeaways 🎯
- Text Cleaning — Wash your data (remove junk)
- Tokenization — Cut text into pieces
- Stemming/Lemmatization — Find root words
- Stop Words — Remove “filler” words
- N-Grams — Capture word groups
- TF-IDF — Score word importance
- Subword — Handle unknown words
Remember: Just like a chef preps ingredients before cooking, we prep text before AI can understand it!
You’ve Got This! 💪
Text preprocessing might seem like many steps, but each one is simple:
- Clean the mess
- Cut into pieces
- Remove extras
- Organize smartly
Now you understand how computers learn to read. That’s amazing! 🌟
