🏷️ Pandas Categorical Data: Sorting Your Toy Box!
The Big Idea (In One Sentence!)
Imagine you have a magic label maker that helps you organize your toys into groups like “Cars,” “Dolls,” and “Blocks” — that’s exactly what Categorical data does in Pandas!
🧸 Our Story: The Messy Toy Room
Picture this: You walk into your room, and there are toys EVERYWHERE! Some are red cars, some are blue blocks, some are green dolls. Your mom says, “Please organize these!”
Instead of writing “red car” a hundred times, wouldn’t it be smarter to make labels and just stick them on?
That’s what Pandas Categorical does! It takes repeated text values (like “red,” “blue,” “green”) and turns them into smart, efficient labels.
📦 What is Categorical Data Type?
Think of it like this:
| Without Categories | With Categories |
|---|---|
| “Apple” written 1000 times | Label #1 = Apple |
| “Banana” written 1000 times | Label #2 = Banana |
| Uses LOTS of memory | Uses TINY memory! |
Why Does This Matter?
🚀 Faster — Computer doesn’t read “Apple” 1000 times 💾 Smaller — Takes less space in memory 🎯 Smarter — Can sort in special orders!
Simple Example
import pandas as pd
# Regular text column
fruits = pd.Series(['Apple', 'Banana',
'Apple', 'Cherry'])
# Convert to categorical
fruits_cat = fruits.astype('category')
print(fruits_cat)
Output:
0 Apple
1 Banana
2 Apple
3 Cherry
dtype: category
Categories (3, object):
['Apple', 'Banana', 'Cherry']
See? Pandas found 3 unique labels automatically! 🎉
🔧 The Categorical Accessor: .cat
Every superhero has a sidekick. For categorical data, that sidekick is .cat!
The .cat accessor is like a special toolbox that ONLY works with categorical columns.
graph LR A["Your Categorical Column"] --> B[".cat accessor"] B --> C["See all categories"] B --> D["Add new categories"] B --> E["Remove categories"] B --> F["Reorder categories"] B --> G["Get category codes"]
What Can .cat Do?
| Command | What It Does |
|---|---|
.cat.categories |
Shows all labels |
.cat.codes |
Shows number codes |
.cat.ordered |
Is it ordered? True/False |
.cat.add_categories() |
Add new labels |
.cat.remove_categories() |
Remove labels |
Example: Using .cat
colors = pd.Series(['Red', 'Blue', 'Red'])
colors = colors.astype('category')
# See all categories
print(colors.cat.categories)
# Output: Index(['Blue', 'Red'],
# dtype='object')
# Check if ordered
print(colors.cat.ordered)
# Output: False
🎨 Creating Categorical Columns
There are 3 main ways to create categorical data. Let’s learn each one!
Way 1: Convert Existing Column
You already have data? Just transform it!
sizes = pd.Series(['S', 'M', 'L', 'S', 'M'])
# Convert to category
sizes = sizes.astype('category')
print(sizes.dtype)
# Output: category
Way 2: Use pd.Categorical() Directly
Want more control? Use the Categorical constructor!
grades = pd.Categorical(
['A', 'B', 'A', 'C', 'B'],
categories=['A', 'B', 'C', 'D', 'F']
)
print(grades)
# Even 'D' and 'F' are categories
# (just not used yet!)
Way 3: Specify When Creating DataFrame
Build it right from the start!
df = pd.DataFrame({
'product': ['Shirt', 'Pants', 'Shirt'],
'size': pd.Categorical(['M', 'L', 'S'])
})
print(df['size'].dtype)
# Output: category
📊 Ordered Categories: First, Second, Third!
Here’s where it gets REALLY cool!
Question: Is “Small” less than “Medium”? Is “Bronze” worse than “Gold”?
With ordered categories, you can tell Pandas: “Yes! These have a specific order!”
The T-Shirt Size Example
sizes = pd.Categorical(
['M', 'S', 'L', 'XL', 'S'],
categories=['S', 'M', 'L', 'XL'],
ordered=True # <-- Magic word!
)
print(sizes)
Output:
['M', 'S', 'L', 'XL', 'S']
Categories (4, object):
['S' < 'M' < 'L' < 'XL']
See those < symbols? Pandas now KNOWS the order!
You Can Compare Them!
# Create a Series
shirt_sizes = pd.Series(sizes)
# Find all sizes bigger than 'M'
big_shirts = shirt_sizes[shirt_sizes > 'M']
print(big_shirts)
# Output: L, XL
Make Existing Categories Ordered
medals = pd.Series(['Gold', 'Silver', 'Bronze'])
medals = medals.astype('category')
# Set the order
medals = medals.cat.set_categories(
['Bronze', 'Silver', 'Gold'],
ordered=True
)
print(medals.cat.ordered)
# Output: True
🔢 Category Codes: The Secret Numbers
Behind every category label is a secret number!
Pandas converts each category to a number (0, 1, 2…) to work faster.
graph LR A["Apple = 0"] --> D["Faster Math!"] B["Banana = 1"] --> D C["Cherry = 2"] --> D
See The Codes
fruits = pd.Categorical(
['Banana', 'Apple', 'Cherry', 'Apple']
)
print(fruits.codes)
# Output: array([1, 0, 2, 0])
Wait, why is Banana = 1 and Apple = 0?
Because Pandas sorts alphabetically by default!
- Apple → 0
- Banana → 1
- Cherry → 2
Missing Values Get Code -1
data = pd.Categorical(['A', None, 'B', 'A'])
print(data.codes)
# Output: array([0, -1, 1, 0])
The -1 means “no label here!”
➕ Adding New Categories
Your toy box can grow! Add new labels even before you have toys for them.
Add One Category
colors = pd.Categorical(['Red', 'Blue'])
# Add 'Green' as an option
colors = colors.add_categories('Green')
print(colors.categories)
# Output: ['Blue', 'Red', 'Green']
Add Multiple Categories
colors = colors.add_categories(
['Yellow', 'Purple']
)
print(colors.categories)
# Output: ['Blue', 'Red', 'Green',
# 'Yellow', 'Purple']
Using .cat Accessor on Series
s = pd.Series(['A', 'B']).astype('category')
s = s.cat.add_categories(['C', 'D'])
print(s.cat.categories)
# Output: Index(['A', 'B', 'C', 'D'])
➖ Removing Categories
Time to clean up! Remove labels you don’t need anymore.
Remove Specific Categories
sizes = pd.Categorical(
['S', 'M', 'L'],
categories=['XS', 'S', 'M', 'L', 'XL']
)
# Remove unused XS and XL
sizes = sizes.remove_categories(['XS', 'XL'])
print(sizes.categories)
# Output: ['S', 'M', 'L']
Remove Unused Categories Automatically
data = pd.Categorical(
['Cat', 'Dog'],
categories=['Cat', 'Dog', 'Bird', 'Fish']
)
# Remove Bird and Fish (not used)
data = data.remove_unused_categories()
print(data.categories)
# Output: ['Cat', 'Dog']
⚠️ What Happens to Removed Values?
If you remove a category that’s being USED, those values become NaN (missing)!
pets = pd.Categorical(['Cat', 'Dog', 'Cat'])
pets = pets.remove_categories('Dog')
print(pets)
# Output: ['Cat', NaN, 'Cat']
Dog is gone, so that spot becomes empty!
🎯 Quick Summary
| Concept | What It Does | Example |
|---|---|---|
| Categorical | Efficient labels for repeated text | astype('category') |
.cat accessor |
Special tools for categories | .cat.categories |
| Creating | 3 ways to make categories | pd.Categorical() |
| Ordered | Categories with rank | ordered=True |
| Codes | Secret numbers behind labels | .cat.codes |
| Add | Expand your label set | .add_categories() |
| Remove | Clean up labels | .remove_categories() |
🏆 You Did It!
You now understand how Pandas Categorical data works!
Think of it as your smart label maker:
- 📁 Organize repeated text efficiently
- 🔢 Use secret codes for speed
- 📊 Order labels when needed (S < M < L)
- ➕ Add new labels anytime
- ➖ Remove old labels to clean up
Now go organize some data like a pro! 🎉
