Data Ordering and Cleaning

Loading concept...

๐Ÿงน Data Ordering & Cleaning in Pandas

The Messy Toy Box Story

Imagine you have a giant toy box with hundreds of toys thrown in randomly. Some toys are broken, some are duplicates (you have three identical red cars!), and everything is jumbled up.

Pandas is like having a magical cleaning robot that helps you:

  • Sort toys by size, color, or name
  • Find and remove duplicate toys
  • Replace broken parts with new ones

Letโ€™s learn how to tidy up our data toy box!


๐ŸŽฏ Sorting by Values

Think of it like: Arranging books on a shelf by their titles.

When you sort by values, you pick a column and arrange all rows based on that columnโ€™s data.

import pandas as pd

# Our messy data
df = pd.DataFrame({
    'name': ['Zara', 'Anna', 'Mike'],
    'age': [25, 30, 22]
})

# Sort by age (smallest to biggest)
sorted_df = df.sort_values('age')

Result:

name age
Mike 22
Anna 30
Zara 25

Waitโ€ฆ thatโ€™s wrong! Let me fix:

name age
Mike 22
Zara 25
Anna 30

โœจ Now ages go: 22 โ†’ 25 โ†’ 30


๐Ÿ”ข Sorting by Index

Think of it like: Arranging your music playlist by track number.

Every DataFrame has an invisible โ€œrow numberโ€ called the index. Sorting by index puts rows back in their original order.

# After some shuffling, reset order
df_sorted = df.sort_index()

When to use:

  • After filtering, rows might be out of order
  • You want to restore original sequence
  • Working with time-series data
graph TD A[Shuffled Data<br/>Index: 3, 1, 2] --> B[sort_index] B --> C[Ordered Data<br/>Index: 1, 2, 3]

๐Ÿ“š Sorting by Multiple Columns

Think of it like: Organizing a library first by genre, then by author name.

Sometimes one column isnโ€™t enough. You want to sort by Country first, then by City within each country.

df = pd.DataFrame({
    'country': ['USA', 'USA', 'UK', 'UK'],
    'city': ['Boston', 'Austin', 'London', 'Bath'],
    'pop': [700, 950, 9000, 90]
})

# Sort by country, then city
df.sort_values(['country', 'city'])

Result:

country city pop
UK Bath 90
UK London 9000
USA Austin 950
USA Boston 700

๐Ÿ“ First sorted by country (UK before USA), then cities within each country are alphabetical!


โฌ†๏ธโฌ‡๏ธ Ascending and Descending Sort

Think of it like: Counting 1-2-3 (ascending) or 3-2-1 (descending).

  • Ascending = smallest to biggest (Aโ†’Z, 1โ†’100)
  • Descending = biggest to smallest (Zโ†’A, 100โ†’1)
# Ascending (default) - low to high
df.sort_values('price', ascending=True)

# Descending - high to low
df.sort_values('price', ascending=False)

Pro tip for multiple columns:

# Country Aโ†’Z, but price highโ†’low
df.sort_values(
    ['country', 'price'],
    ascending=[True, False]
)
graph TD A[Unsorted: 50, 10, 30] --> B{Direction?} B -->|ascending=True| C[10, 30, 50] B -->|ascending=False| D[50, 30, 10]

๐Ÿ† Ranking Data

Think of it like: Giving medals in a race โ€” 1st place, 2nd place, 3rd place.

Ranking assigns a position number based on values. Unlike sorting, it keeps rows in place but adds a rank column.

df = pd.DataFrame({
    'student': ['Ali', 'Bob', 'Cat'],
    'score': [85, 92, 85]
})

# Add rank (highest score = rank 1)
df['rank'] = df['score'].rank(
    ascending=False
)

Result:

student score rank
Ali 85 2.5
Bob 92 1.0
Cat 85 2.5

Wait, why 2.5?

Ali and Cat tied! By default, Pandas gives them the average of ranks 2 and 3 = 2.5

Other ranking methods:

  • method='min' โ†’ Both get rank 2
  • method='max' โ†’ Both get rank 3
  • method='first' โ†’ First one gets 2, second gets 3

๐Ÿ” Finding Duplicates

Think of it like: Spotting twin toys in your toy box.

Before removing duplicates, you need to find them first!

df = pd.DataFrame({
    'fruit': ['apple', 'banana', 'apple'],
    'price': [1.0, 0.5, 1.0]
})

# Find ALL duplicate rows
df.duplicated()

Result:

0    False  โ† First apple (original)
1    False  โ† Banana (unique)
2     True  โ† Second apple (duplicate!)

Check specific columns:

# Only check 'fruit' column
df.duplicated(subset=['fruit'])

Count total duplicates:

df.duplicated().sum()  # Returns: 1

๐Ÿ—‘๏ธ Removing Duplicates

Think of it like: Keeping only one of each twin toy.

Once found, duplicates are easy to remove!

# Remove duplicate rows
clean_df = df.drop_duplicates()

Control which duplicate to keep:

# Keep first occurrence (default)
df.drop_duplicates(keep='first')

# Keep last occurrence
df.drop_duplicates(keep='last')

# Remove ALL duplicates
df.drop_duplicates(keep=False)

Remove based on specific columns:

# Keep first row for each unique fruit
df.drop_duplicates(subset=['fruit'])
graph TD A[3 Rows<br/>apple, banana, apple] --> B[drop_duplicates] B --> C[2 Rows<br/>apple, banana]

๐Ÿ”„ Replacing Values

Think of it like: Swapping broken toy parts with new ones.

Sometimes data has wrong or messy values. Replace fixes them!

Simple replacement:

df = pd.DataFrame({
    'status': ['active', 'inactive', 'actve']
})

# Fix typo: 'actve' โ†’ 'active'
df['status'].replace('actve', 'active')

Multiple replacements:

# Replace several values at once
df.replace({
    'actve': 'active',
    'inactve': 'inactive'
})

Replace with dictionary mapping:

# Convert codes to labels
df['grade'].replace({
    'A': 'Excellent',
    'B': 'Good',
    'C': 'Average'
})

Replace in specific columns:

df.replace({
    'status': {'old': 'new'},
    'grade': {'F': 'Fail'}
})

๐ŸŽ Quick Reference Box

Task Code
Sort by column df.sort_values('col')
Sort by index df.sort_index()
Sort multiple df.sort_values(['a','b'])
Descending ascending=False
Rank values df['col'].rank()
Find duplicates df.duplicated()
Remove duplicates df.drop_duplicates()
Replace values df.replace('old','new')

๐ŸŒŸ You Did It!

You now know how to:

  • โœ… Sort data any way you want
  • โœ… Rank items like a competition judge
  • โœ… Hunt down sneaky duplicates
  • โœ… Clean up messy values

Your data toy box is now perfectly organized! ๐ŸŽ‰

Next time you see messy data, remember: Pandas is your magical cleaning robot. Just tell it what to sort, find, remove, or replace โ€” and watch the magic happen!

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.