🎯 GroupBy Operations: Sorting Your Toy Collection!
The Big Idea
Imagine you have a huge toy box with hundreds of toys mixed together. Cars, dolls, blocks, action figures — all jumbled up!
Now, what if someone asked: “How many cars do you have?” or “What’s the total value of all your dolls?”
You’d need to:
- Group all similar toys together (cars in one pile, dolls in another)
- Count or measure each pile
That’s exactly what GroupBy does in Pandas! It’s like having a magical helper that instantly sorts your messy toy box into neat piles and tells you interesting facts about each pile.
🧠 The GroupBy Concept
What is GroupBy?
GroupBy is a three-step process:
graph TD A["🎲 Your Messy Data"] --> B["✂️ SPLIT: Make Groups"] B --> C["⚙️ APPLY: Do Something"] C --> D["🧩 COMBINE: Get Results"]
Split → Apply → Combine
Think of it like this:
- SPLIT: Separate toys by type (cars here, dolls there)
- APPLY: Count each pile, or add up their prices
- COMBINE: Put all the answers together in a nice report
Real Example
Let’s say you have sales data:
import pandas as pd
# Your messy data
sales = pd.DataFrame({
'Store': ['Mall', 'Mall', 'Downtown', 'Downtown'],
'Item': ['Apple', 'Banana', 'Apple', 'Banana'],
'Sold': [50, 30, 40, 25]
})
| Store | Item | Sold |
|---|---|---|
| Mall | Apple | 50 |
| Mall | Banana | 30 |
| Downtown | Apple | 40 |
| Downtown | Banana | 25 |
📦 GroupBy with Single Column
The Simplest Way to Group
Want to know total sales per store? Group by the Store column!
# Group by Store, sum up Sold
result = sales.groupby('Store')['Sold'].sum()
print(result)
Output:
Store
Downtown 65
Mall 80
Name: Sold, dtype: int64
What Just Happened?
graph TD A["All Sales Data"] --> B{Group by Store} B --> C["📍 Downtown Pile<br>40 + 25 = 65"] B --> D["🏬 Mall Pile<br>50 + 30 = 80"]
It’s like asking: “Put all Downtown receipts in one pile, all Mall receipts in another, then add up each pile!”
More Single Column Examples
# Average sales per store
sales.groupby('Store')['Sold'].mean()
# Count items per store
sales.groupby('Store')['Sold'].count()
# Maximum sale per store
sales.groupby('Store')['Sold'].max()
🎨 GroupBy with Multiple Columns
Going Deeper with Two Keys
What if you want to know sales for each item at each store?
Group by BOTH columns!
# Group by Store AND Item
result = sales.groupby(['Store', 'Item'])['Sold'].sum()
print(result)
Output:
Store Item
Downtown Apple 40
Banana 25
Mall Apple 50
Banana 30
Name: Sold, dtype: int64
The Magic of Multiple Keys
graph TD A["All Sales"] --> B{Group by Store + Item} B --> C["Downtown + Apple: 40"] B --> D["Downtown + Banana: 25"] B --> E["Mall + Apple: 50"] B --> F["Mall + Banana: 30"]
It’s like sorting toys twice:
- First, by toy type (cars, dolls)
- Then, by color (red cars, blue cars)
Now you have super-specific piles!
Getting a Clean Table Back
# Reset index for a regular table
result = sales.groupby(
['Store', 'Item']
)['Sold'].sum().reset_index()
| Store | Item | Sold |
|---|---|---|
| Downtown | Apple | 40 |
| Downtown | Banana | 25 |
| Mall | Apple | 50 |
| Mall | Banana | 30 |
⚡ GroupBy Aggregation with agg()
One Function to Rule Them All
The .agg() method lets you apply any aggregation function!
# Using agg() with a single function
result = sales.groupby('Store')['Sold'].agg('sum')
Why Use agg()?
Because you can use custom functions!
# Custom function: range of sales
def sales_range(x):
return x.max() - x.min()
result = sales.groupby('Store')['Sold'].agg(sales_range)
Output:
Store
Downtown 15 # (40 - 25)
Mall 20 # (50 - 30)
Name: Sold, dtype: int64
Built-in Functions You Can Use
| Function | What It Does |
|---|---|
'sum' |
Add all values |
'mean' |
Average |
'count' |
How many? |
'min' |
Smallest |
'max' |
Biggest |
'std' |
How spread out? |
'first' |
First value |
'last' |
Last value |
🎭 Multiple Aggregation Functions
Ask Multiple Questions at Once!
Why ask one question when you can ask many?
# Multiple aggregations on one column
result = sales.groupby('Store')['Sold'].agg(
['sum', 'mean', 'count']
)
Output:
| sum | mean | count | |
|---|---|---|---|
| Downtown | 65 | 32.5 | 2 |
| Mall | 80 | 40.0 | 2 |
Different Functions for Different Columns
# Different aggregations per column
result = sales.groupby('Store').agg({
'Sold': ['sum', 'mean'],
'Item': 'count'
})
This is like asking:
- For Sold: Give me total and average
- For Item: Just count how many
The Power Move
# Full control with a dictionary
sales.groupby('Store').agg({
'Sold': ['sum', 'mean', 'max', 'min'],
'Item': ['count', 'nunique']
})
nunique = how many unique items (no repeats counted)
🏷️ Named Aggregation
The Cleanest Way to GroupBy
Named aggregation lets you:
- Choose which column to aggregate
- Pick the function to use
- Give it a nice name
All in one beautiful line!
result = sales.groupby('Store').agg(
total_sold=('Sold', 'sum'),
avg_sold=('Sold', 'mean'),
num_items=('Sold', 'count')
)
Output:
| Store | total_sold | avg_sold | num_items |
|---|---|---|---|
| Downtown | 65 | 32.5 | 2 |
| Mall | 80 | 40.0 | 2 |
The Syntax Explained
new_column_name=('source_column', 'function')
It’s like saying:
“Create a column called
total_soldby taking theSoldcolumn and applyingsum”
Why Named Aggregation is Amazing
Before (messy column names):
Sold
sum mean
Store
Mall 80 40
After (clean and clear):
total_sold avg_sold
Store
Mall 80 40
Complete Example
# Professional-looking summary
summary = sales.groupby('Store').agg(
total_revenue=('Sold', 'sum'),
average_sale=('Sold', 'mean'),
best_sale=('Sold', 'max'),
worst_sale=('Sold', 'min'),
transaction_count=('Sold', 'count')
).reset_index()
🎯 Quick Reference
| Want to… | Code |
|---|---|
| Group by one column | df.groupby('col') |
| Group by multiple | df.groupby(['a', 'b']) |
| Sum a column | .['col'].sum() |
| Use any function | .agg('func') |
| Multiple functions | .agg(['sum', 'mean']) |
| Named results | .agg(name=('col', 'func')) |
🚀 You Did It!
You just learned how to:
✅ Group data by one or more columns ✅ Apply aggregation functions like sum, mean, count ✅ Combine multiple functions in one go ✅ Name your results for clean, professional output
GroupBy is like having a super-smart assistant who can instantly organize any messy data and answer questions about each group. Now you have that power too!
Remember: Split → Apply → Combine
That’s all there is to it! 🎉
