Categorical Data

Back

Loading concept...

🏷️ Pandas Categorical Data: Sorting Your Toy Box!

The Big Idea (In One Sentence!)

Imagine you have a magic label maker that helps you organize your toys into groups like “Cars,” “Dolls,” and “Blocks” — that’s exactly what Categorical data does in Pandas!


🧸 Our Story: The Messy Toy Room

Picture this: You walk into your room, and there are toys EVERYWHERE! Some are red cars, some are blue blocks, some are green dolls. Your mom says, “Please organize these!”

Instead of writing “red car” a hundred times, wouldn’t it be smarter to make labels and just stick them on?

That’s what Pandas Categorical does! It takes repeated text values (like “red,” “blue,” “green”) and turns them into smart, efficient labels.


📦 What is Categorical Data Type?

Think of it like this:

Without Categories With Categories
“Apple” written 1000 times Label #1 = Apple
“Banana” written 1000 times Label #2 = Banana
Uses LOTS of memory Uses TINY memory!

Why Does This Matter?

🚀 Faster — Computer doesn’t read “Apple” 1000 times 💾 Smaller — Takes less space in memory 🎯 Smarter — Can sort in special orders!

Simple Example

import pandas as pd

# Regular text column
fruits = pd.Series(['Apple', 'Banana',
                    'Apple', 'Cherry'])

# Convert to categorical
fruits_cat = fruits.astype('category')

print(fruits_cat)

Output:

0     Apple
1    Banana
2     Apple
3    Cherry
dtype: category
Categories (3, object):
['Apple', 'Banana', 'Cherry']

See? Pandas found 3 unique labels automatically! 🎉


🔧 The Categorical Accessor: .cat

Every superhero has a sidekick. For categorical data, that sidekick is .cat!

The .cat accessor is like a special toolbox that ONLY works with categorical columns.

graph LR A["Your Categorical Column"] --> B[".cat accessor"] B --> C["See all categories"] B --> D["Add new categories"] B --> E["Remove categories"] B --> F["Reorder categories"] B --> G["Get category codes"]

What Can .cat Do?

Command What It Does
.cat.categories Shows all labels
.cat.codes Shows number codes
.cat.ordered Is it ordered? True/False
.cat.add_categories() Add new labels
.cat.remove_categories() Remove labels

Example: Using .cat

colors = pd.Series(['Red', 'Blue', 'Red'])
colors = colors.astype('category')

# See all categories
print(colors.cat.categories)
# Output: Index(['Blue', 'Red'],
#         dtype='object')

# Check if ordered
print(colors.cat.ordered)
# Output: False

🎨 Creating Categorical Columns

There are 3 main ways to create categorical data. Let’s learn each one!

Way 1: Convert Existing Column

You already have data? Just transform it!

sizes = pd.Series(['S', 'M', 'L', 'S', 'M'])

# Convert to category
sizes = sizes.astype('category')

print(sizes.dtype)
# Output: category

Way 2: Use pd.Categorical() Directly

Want more control? Use the Categorical constructor!

grades = pd.Categorical(
    ['A', 'B', 'A', 'C', 'B'],
    categories=['A', 'B', 'C', 'D', 'F']
)

print(grades)
# Even 'D' and 'F' are categories
# (just not used yet!)

Way 3: Specify When Creating DataFrame

Build it right from the start!

df = pd.DataFrame({
    'product': ['Shirt', 'Pants', 'Shirt'],
    'size': pd.Categorical(['M', 'L', 'S'])
})

print(df['size'].dtype)
# Output: category

📊 Ordered Categories: First, Second, Third!

Here’s where it gets REALLY cool!

Question: Is “Small” less than “Medium”? Is “Bronze” worse than “Gold”?

With ordered categories, you can tell Pandas: “Yes! These have a specific order!”

The T-Shirt Size Example

sizes = pd.Categorical(
    ['M', 'S', 'L', 'XL', 'S'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True  # <-- Magic word!
)

print(sizes)

Output:

['M', 'S', 'L', 'XL', 'S']
Categories (4, object):
['S' < 'M' < 'L' < 'XL']

See those < symbols? Pandas now KNOWS the order!

You Can Compare Them!

# Create a Series
shirt_sizes = pd.Series(sizes)

# Find all sizes bigger than 'M'
big_shirts = shirt_sizes[shirt_sizes > 'M']
print(big_shirts)
# Output: L, XL

Make Existing Categories Ordered

medals = pd.Series(['Gold', 'Silver', 'Bronze'])
medals = medals.astype('category')

# Set the order
medals = medals.cat.set_categories(
    ['Bronze', 'Silver', 'Gold'],
    ordered=True
)

print(medals.cat.ordered)
# Output: True

🔢 Category Codes: The Secret Numbers

Behind every category label is a secret number!

Pandas converts each category to a number (0, 1, 2…) to work faster.

graph LR A["Apple = 0"] --> D["Faster Math!"] B["Banana = 1"] --> D C["Cherry = 2"] --> D

See The Codes

fruits = pd.Categorical(
    ['Banana', 'Apple', 'Cherry', 'Apple']
)

print(fruits.codes)
# Output: array([1, 0, 2, 0])

Wait, why is Banana = 1 and Apple = 0?

Because Pandas sorts alphabetically by default!

  • Apple → 0
  • Banana → 1
  • Cherry → 2

Missing Values Get Code -1

data = pd.Categorical(['A', None, 'B', 'A'])

print(data.codes)
# Output: array([0, -1, 1, 0])

The -1 means “no label here!”


➕ Adding New Categories

Your toy box can grow! Add new labels even before you have toys for them.

Add One Category

colors = pd.Categorical(['Red', 'Blue'])

# Add 'Green' as an option
colors = colors.add_categories('Green')

print(colors.categories)
# Output: ['Blue', 'Red', 'Green']

Add Multiple Categories

colors = colors.add_categories(
    ['Yellow', 'Purple']
)

print(colors.categories)
# Output: ['Blue', 'Red', 'Green',
#          'Yellow', 'Purple']

Using .cat Accessor on Series

s = pd.Series(['A', 'B']).astype('category')

s = s.cat.add_categories(['C', 'D'])

print(s.cat.categories)
# Output: Index(['A', 'B', 'C', 'D'])

➖ Removing Categories

Time to clean up! Remove labels you don’t need anymore.

Remove Specific Categories

sizes = pd.Categorical(
    ['S', 'M', 'L'],
    categories=['XS', 'S', 'M', 'L', 'XL']
)

# Remove unused XS and XL
sizes = sizes.remove_categories(['XS', 'XL'])

print(sizes.categories)
# Output: ['S', 'M', 'L']

Remove Unused Categories Automatically

data = pd.Categorical(
    ['Cat', 'Dog'],
    categories=['Cat', 'Dog', 'Bird', 'Fish']
)

# Remove Bird and Fish (not used)
data = data.remove_unused_categories()

print(data.categories)
# Output: ['Cat', 'Dog']

⚠️ What Happens to Removed Values?

If you remove a category that’s being USED, those values become NaN (missing)!

pets = pd.Categorical(['Cat', 'Dog', 'Cat'])

pets = pets.remove_categories('Dog')

print(pets)
# Output: ['Cat', NaN, 'Cat']

Dog is gone, so that spot becomes empty!


🎯 Quick Summary

Concept What It Does Example
Categorical Efficient labels for repeated text astype('category')
.cat accessor Special tools for categories .cat.categories
Creating 3 ways to make categories pd.Categorical()
Ordered Categories with rank ordered=True
Codes Secret numbers behind labels .cat.codes
Add Expand your label set .add_categories()
Remove Clean up labels .remove_categories()

🏆 You Did It!

You now understand how Pandas Categorical data works!

Think of it as your smart label maker:

  • 📁 Organize repeated text efficiently
  • 🔢 Use secret codes for speed
  • 📊 Order labels when needed (S < M < L)
  • ➕ Add new labels anytime
  • ➖ Remove old labels to clean up

Now go organize some data like a pro! 🎉

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.