🐼 Pandas Performance & Configuration: Make Your Data Fly!
Imagine your data is a big backpack. If you stuff it with heavy rocks when you only need pebbles, you’ll walk slow. Let’s learn to pack smart and move fast!
🎒 The Backpack Analogy
Think of your computer’s memory like a backpack for a hike:
- Heavy backpack = slow walking = slow code
- Light backpack = fast walking = fast code
Pandas helps you pack your data backpack smartly!
📊 Memory Usage: Know What’s In Your Backpack
Before packing smarter, you need to see what’s already inside.
How to Check Memory
import pandas as pd
df = pd.read_csv('my_data.csv')
# See total memory used
print(df.info(memory_usage='deep'))
# Get exact bytes
memory_bytes = df.memory_usage(deep=True)
print(memory_bytes)
# Total in megabytes
total_mb = memory_bytes.sum() / 1024**2
print(f"Total: {total_mb:.2f} MB")
What You’ll See
Column Memory
───────── ──────
name 2.4 MB
age 0.8 MB
city 4.1 MB
Total: 7.3 MB
Why it matters: If your backpack weighs 7 MB when it could weigh 2 MB, you’re carrying extra weight for nothing!
🪨➡️🪶 Memory Downcasting: Turn Rocks Into Pebbles
The Problem
Pandas sometimes uses a big container for small things:
Storing the number 5:
├── int64 uses 8 bytes (like a big box)
└── int8 uses 1 byte (like a tiny box)
That’s 8x more space than needed!
The Solution: Downcast!
# Before: Numbers stored as int64 (big)
df['age'] = df['age'].astype('int64')
# After: Numbers stored as int8 (tiny)
df['age'] = pd.to_numeric(
df['age'],
downcast='integer'
)
Downcast Cheatsheet
| Data Type | Use When | Command |
|---|---|---|
integer |
Whole numbers (1, 2, 99) | downcast='integer' |
unsigned |
Only positive (0, 1, 2) | downcast='unsigned' |
float |
Decimals (3.14, 9.99) | downcast='float' |
Category Type: The Secret Weapon
For columns with repeated values (like city names):
# Before: Each "New York" stored separately
# Uses: 1000 copies × full string
# After: Store once, reference many times
df['city'] = df['city'].astype('category')
# Uses: 1 copy + tiny numbers
graph TD A["1000 rows with 'New York'"] --> B{Category Type} B --> C["Store 'New York' once"] B --> D["Use number 1 for each row"] D --> E["Huge memory savings!"]
🚀 Vectorized Operations: The Conveyor Belt
The Slow Way: One by One
Imagine checking each apple for bruises, one at a time:
# SLOW - like picking apples one by one
result = []
for price in df['price']:
result.append(price * 1.1)
df['new_price'] = result
The Fast Way: All At Once
Imagine a magic machine that checks ALL apples instantly:
# FAST - like a conveyor belt
df['new_price'] = df['price'] * 1.1
Speed difference:
- Loop: 10 seconds for 1 million rows
- Vectorized: 0.01 seconds for 1 million rows
Common Vectorized Operations
# Math on whole columns (instant!)
df['total'] = df['price'] * df['quantity']
df['discounted'] = df['price'] * 0.9
df['rounded'] = df['price'].round(2)
# Conditions on whole columns
df['expensive'] = df['price'] > 100
df['category'] = np.where(
df['price'] > 50,
'Premium',
'Budget'
)
🧮 Eval Method: Write Math Like a Human
The Problem
Complex math looks messy in code:
# Hard to read!
df['result'] = (
(df['a'] + df['b']) * df['c']
/ (df['d'] - df['e'])
)
The Solution: eval()
# Easy to read - like writing on paper!
df['result'] = df.eval(
'(a + b) * c / (d - e)'
)
Why Use eval()?
- Cleaner code - reads like math
- Faster - uses less memory for big data
- Multiple columns at once:
df.eval('''
profit = revenue - cost
margin = profit / revenue
is_good = margin > 0.2
''', inplace=True)
Query: eval’s Cousin for Filtering
# Instead of this:
big_sales = df[
(df['amount'] > 1000) &
(df['region'] == 'West')
]
# Write this:
big_sales = df.query(
'amount > 1000 and region == "West"'
)
🚶 Row Iteration Methods: When You Must Walk
Sometimes you NEED to look at each row. Here are your options from fastest to slowest:
Option 1: itertuples() - The Fast Walker 🏃
for row in df.itertuples():
print(row.name, row.age)
# Access by name: row.name
# Access by position: row[1]
Option 2: iterrows() - The Slow Walker 🐢
for index, row in df.iterrows():
print(row['name'], row['age'])
# Returns a Series (slower)
Speed Comparison
graph LR A["1 Million Rows"] --> B["itertuples: 2 sec"] A --> C["iterrows: 30 sec"] A --> D["apply: 5 sec"] B --> E["Winner!"]
The Golden Rule
❌ Avoid loops when possible
✅ Use vectorized operations first
⚡ If you must loop, use itertuples()
When Looping Is Okay
- Complex logic that can’t be vectorized
- When you need the index AND row data
- Processing rows with external APIs
⚙️ Display Option Settings: Control What You See
The Problem
Pandas sometimes hides your data:
A B ... Z
0 1 2 ... 26
.. .. .. ... ..
999 1 2 ... 26
[1000 rows × 26 columns]
Take Control!
import pandas as pd
# See more rows
pd.set_option('display.max_rows', 100)
# See more columns
pd.set_option('display.max_columns', 50)
# Wider display
pd.set_option('display.width', 200)
# More decimal places
pd.set_option('display.precision', 4)
# Don't truncate long text
pd.set_option('display.max_colwidth', None)
Handy Options Table
| Option | What It Does | Example |
|---|---|---|
max_rows |
Rows shown | 100 |
max_columns |
Columns shown | 50 |
width |
Total display width | 200 |
precision |
Decimal places | 4 |
max_colwidth |
Column text length | None (all) |
Temporary Changes
# Change just for one block
with pd.option_context(
'display.max_rows', 10,
'display.precision', 2
):
print(df)
# Back to normal after!
Reset Everything
# Oops! Reset all options
pd.reset_option('all')
# Reset just one
pd.reset_option('display.max_rows')
🎯 Quick Summary
graph LR A["Pandas Performance"] --> B["Memory"] A --> C["Speed"] A --> D["Display"] B --> B1["Check with info"] B --> B2["Downcast numbers"] B --> B3["Use category"] C --> C1["Vectorize first!"] C --> C2["Use eval for math"] C --> C3["itertuples if needed"] D --> D1["set_option"] D --> D2["option_context"] D --> D3["reset_option"]
🌟 Remember This!
| Goal | Do This | Avoid This |
|---|---|---|
| Check memory | df.info(memory_usage='deep') |
Guessing |
| Save memory | downcast='integer' |
Default int64 |
| Fast math | df['a'] * df['b'] |
For loops |
| Clean math | df.eval('a * b') |
Messy brackets |
| Loop data | itertuples() |
iterrows() |
| See more | set_option() |
Truncated output |
You did it! 🎉
Your Pandas backpack is now lighter, your code is faster, and you can see exactly what you need. Go make your data fly!
