Text Preprocessing

Back

Loading concept...

NLP Text Preprocessing: Teaching Computers to Read Like Humans

The Big Picture: A Kitchen Prep Analogy 🍳

Imagine you’re a chef preparing ingredients for a delicious meal. Before cooking, you:

  • Wash the vegetables (remove dirt)
  • Chop them into pieces
  • Remove the parts you don’t need (stems, seeds)
  • Organize everything neatly

Text preprocessing works exactly the same way!

Before a computer can understand text, we must clean, chop, and organize words. This is called Text Preprocessing — the essential first step in Natural Language Processing (NLP).


1. Text Cleaning Steps 🧹

What is it?

Text cleaning is like washing your vegetables. Raw text from the internet is messy — it has weird symbols, extra spaces, and things computers don’t need.

Why do we need it?

Computers get confused by:

  • Hello!!! vs Hello
  • APPLE vs apple
  • café vs cafe

The Cleaning Checklist:

Step Before After
Lowercase HELLO World hello world
Remove punctuation Hi! How are you? Hi How are you
Remove numbers I have 5 cats I have cats
Remove extra spaces too many spaces too many spaces
Remove special chars email@test.com emailtestcom

Simple Example:

Original: "OMG!!! I LOVE pizza 🍕 sooo much!!!"
Cleaned:  "omg i love pizza so much"

Think of it like this: A messy room vs a clean room. Which one can you find your toys in faster?


2. Tokenization ✂️

What is it?

Tokenization means cutting text into smaller pieces called tokens. Just like cutting a pizza into slices!

Types of Tokenization:

Word Tokenization:

"I love cats" → ["I", "love", "cats"]

Sentence Tokenization:

"Hello. How are you?" → ["Hello.", "How are you?"]

Character Tokenization:

"cat" → ["c", "a", "t"]

Why do we need it?

Computers can’t read sentences. They need individual pieces to understand text — like reading one word at a time.

Real-World Example:

When you type in Google Search:

Your search: "best pizza near me"
Tokens: ["best", "pizza", "near", "me"]

Google looks for pages containing each token!


3. Stemming and Lemmatization 🌱

The Problem:

These words mean the same thing:

  • running, runs, ran → all about run
  • better, best → all about good

How do we teach computers this?

Stemming (The Quick Chop)

Stemming cuts off word endings roughly. It’s fast but sometimes messy.

running → runn
happiness → happi
cats → cat

Like cutting vegetables quickly — not perfect, but done!

Lemmatization (The Careful Chop)

Lemmatization uses a dictionary to find the true root word.

running → run (correct!)
better → good (smart!)
was → be (knows grammar!)

Like a professional chef — takes more time, but perfect results.

Quick Comparison:

Word Stemming Lemmatization
running runn run
studies studi study
better better good
was wa be

4. Stop Words Removal 🚫

What are Stop Words?

Stop words are common words that don’t add meaning:

  • the, is, at, which, on, a, an

Why Remove Them?

Imagine searching for “the best pizza in the world”:

  • With stop words: the, best, pizza, in, the, world
  • Without stop words: best, pizza, world

The important words stand out!

Example:

Before: "The cat is sitting on the mat"
After:  "cat sitting mat"

Think of it like: Reading a story vs reading only the important words. You still understand!

Common Stop Words:

a, an, the, is, are, was, were, in, on, at, to, for, of, and, or, but, if, then


5. N-Grams and Bag of Words 🎒

Bag of Words (BoW)

Imagine throwing all words into a bag and counting them. Order doesn’t matter!

Sentence: "I love pizza and I love pasta"

Bag of Words:
- I: 2
- love: 2
- pizza: 1
- and: 1
- pasta: 1

Like counting candies in a jar — you know what you have, but not the order.

N-Grams (Word Groups)

N-grams capture groups of consecutive words:

Unigrams (1 word):

"I love cats" → ["I", "love", "cats"]

Bigrams (2 words):

"I love cats" → ["I love", "love cats"]

Trigrams (3 words):

"I love cats" → ["I love cats"]

Why Use N-Grams?

Single words lose context:

  • “not good” → Bag of Words sees not and good separately (loses meaning!)
  • Bigram captures “not good” together (keeps meaning!)
graph TD A["Text: I love cats"] --> B["Unigrams"] A --> C["Bigrams"] A --> D["Trigrams"] B --> E["I, love, cats"] C --> F["I love, love cats"] D --> G["I love cats"]

6. TF-IDF Representation 📊

The Problem with Counting Words

If “the” appears 100 times and “pizza” appears 2 times, is “the” more important?

No! Common words aren’t special.

TF-IDF to the Rescue!

TF = Term Frequency (how often a word appears in one document)

IDF = Inverse Document Frequency (how rare a word is across all documents)

TF-IDF = TF × IDF

Simple Example:

Document 1: “I love pizza” Document 2: “I love pasta” Document 3: “I hate homework”

Word TF (Doc 1) IDF TF-IDF
I High Low (common) Low
love High Medium Medium
pizza High High (unique) HIGH

Pizza gets the highest score because it’s unique to Document 1!

Real-World Use:

Google uses TF-IDF to find relevant pages. If you search “pizza recipes”, pages with unique pizza-related words rank higher!


7. Subword Tokenization 🧬

The Problem with Regular Tokenization

What happens with:

  • New words: COVID-19, unfriendly
  • Rare words: antidisestablishmentarianism
  • Typos: pizzza

Regular tokenization says: “I don’t know this word!” 😕

The Solution: Break Words into Pieces!

Subword tokenization splits unknown words into known parts:

"unfriendly" → ["un", "friend", "ly"]
"playing" → ["play", "ing"]
"unhappiness" → ["un", "happi", "ness"]

Popular Methods:

BPE (Byte Pair Encoding):

  • Starts with characters
  • Merges common pairs
  • l + o = lo, lo + w = low

WordPiece:

  • Used by BERT and Google
  • playing → play + ##ing
  • The ## means “attached to previous piece”

Why It’s Amazing:

Method Unknown Word Result
Regular unhappily ???
Subword unhappily un + happy + ly

Now the computer understands new words by recognizing their parts!

graph TD A["unhappiness"] --> B["un"] A --> C["happi"] A --> D["ness"] B --> E["means: not"] C --> F["means: happy"] D --> G["means: state of"]

The Complete Pipeline 🚀

Here’s how all steps work together:

graph TD A["Raw Text"] --> B["Text Cleaning"] B --> C["Tokenization"] C --> D["Stop Words Removal"] D --> E["Stemming/Lemmatization"] E --> F["Create Features"] F --> G["Bag of Words"] F --> H["TF-IDF"] F --> I["Subword Tokens"]

Full Example:

Input: "The DOGS are running!!! 🐕🐕🐕"

  1. Clean: "the dogs are running"
  2. Tokenize: ["the", "dogs", "are", "running"]
  3. Remove Stop Words: ["dogs", "running"]
  4. Lemmatize: ["dog", "run"]
  5. Create TF-IDF: {dog: 0.7, run: 0.7}

Now the computer understands!


Key Takeaways 🎯

  1. Text Cleaning — Wash your data (remove junk)
  2. Tokenization — Cut text into pieces
  3. Stemming/Lemmatization — Find root words
  4. Stop Words — Remove “filler” words
  5. N-Grams — Capture word groups
  6. TF-IDF — Score word importance
  7. Subword — Handle unknown words

Remember: Just like a chef preps ingredients before cooking, we prep text before AI can understand it!


You’ve Got This! 💪

Text preprocessing might seem like many steps, but each one is simple:

  • Clean the mess
  • Cut into pieces
  • Remove extras
  • Organize smartly

Now you understand how computers learn to read. That’s amazing! 🌟

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.