Performance and Configuration

Back

Loading concept...

🐼 Pandas Performance & Configuration: Make Your Data Fly!

Imagine your data is a big backpack. If you stuff it with heavy rocks when you only need pebbles, you’ll walk slow. Let’s learn to pack smart and move fast!


🎒 The Backpack Analogy

Think of your computer’s memory like a backpack for a hike:

  • Heavy backpack = slow walking = slow code
  • Light backpack = fast walking = fast code

Pandas helps you pack your data backpack smartly!


📊 Memory Usage: Know What’s In Your Backpack

Before packing smarter, you need to see what’s already inside.

How to Check Memory

import pandas as pd

df = pd.read_csv('my_data.csv')

# See total memory used
print(df.info(memory_usage='deep'))

# Get exact bytes
memory_bytes = df.memory_usage(deep=True)
print(memory_bytes)

# Total in megabytes
total_mb = memory_bytes.sum() / 1024**2
print(f"Total: {total_mb:.2f} MB")

What You’ll See

Column      Memory
─────────   ──────
name        2.4 MB
age         0.8 MB
city        4.1 MB
Total:      7.3 MB

Why it matters: If your backpack weighs 7 MB when it could weigh 2 MB, you’re carrying extra weight for nothing!


🪨➡️🪶 Memory Downcasting: Turn Rocks Into Pebbles

The Problem

Pandas sometimes uses a big container for small things:

Storing the number 5:
├── int64 uses 8 bytes (like a big box)
└── int8 uses 1 byte (like a tiny box)

That’s 8x more space than needed!

The Solution: Downcast!

# Before: Numbers stored as int64 (big)
df['age'] = df['age'].astype('int64')

# After: Numbers stored as int8 (tiny)
df['age'] = pd.to_numeric(
    df['age'],
    downcast='integer'
)

Downcast Cheatsheet

Data Type Use When Command
integer Whole numbers (1, 2, 99) downcast='integer'
unsigned Only positive (0, 1, 2) downcast='unsigned'
float Decimals (3.14, 9.99) downcast='float'

Category Type: The Secret Weapon

For columns with repeated values (like city names):

# Before: Each "New York" stored separately
# Uses: 1000 copies × full string

# After: Store once, reference many times
df['city'] = df['city'].astype('category')
# Uses: 1 copy + tiny numbers
graph TD A["1000 rows with 'New York'"] --> B{Category Type} B --> C["Store 'New York' once"] B --> D["Use number 1 for each row"] D --> E["Huge memory savings!"]

🚀 Vectorized Operations: The Conveyor Belt

The Slow Way: One by One

Imagine checking each apple for bruises, one at a time:

# SLOW - like picking apples one by one
result = []
for price in df['price']:
    result.append(price * 1.1)
df['new_price'] = result

The Fast Way: All At Once

Imagine a magic machine that checks ALL apples instantly:

# FAST - like a conveyor belt
df['new_price'] = df['price'] * 1.1

Speed difference:

  • Loop: 10 seconds for 1 million rows
  • Vectorized: 0.01 seconds for 1 million rows

Common Vectorized Operations

# Math on whole columns (instant!)
df['total'] = df['price'] * df['quantity']
df['discounted'] = df['price'] * 0.9
df['rounded'] = df['price'].round(2)

# Conditions on whole columns
df['expensive'] = df['price'] > 100
df['category'] = np.where(
    df['price'] > 50,
    'Premium',
    'Budget'
)

🧮 Eval Method: Write Math Like a Human

The Problem

Complex math looks messy in code:

# Hard to read!
df['result'] = (
    (df['a'] + df['b']) * df['c']
    / (df['d'] - df['e'])
)

The Solution: eval()

# Easy to read - like writing on paper!
df['result'] = df.eval(
    '(a + b) * c / (d - e)'
)

Why Use eval()?

  1. Cleaner code - reads like math
  2. Faster - uses less memory for big data
  3. Multiple columns at once:
df.eval('''
    profit = revenue - cost
    margin = profit / revenue
    is_good = margin > 0.2
''', inplace=True)

Query: eval’s Cousin for Filtering

# Instead of this:
big_sales = df[
    (df['amount'] > 1000) &
    (df['region'] == 'West')
]

# Write this:
big_sales = df.query(
    'amount > 1000 and region == "West"'
)

🚶 Row Iteration Methods: When You Must Walk

Sometimes you NEED to look at each row. Here are your options from fastest to slowest:

Option 1: itertuples() - The Fast Walker 🏃

for row in df.itertuples():
    print(row.name, row.age)
    # Access by name: row.name
    # Access by position: row[1]

Option 2: iterrows() - The Slow Walker 🐢

for index, row in df.iterrows():
    print(row['name'], row['age'])
    # Returns a Series (slower)

Speed Comparison

graph LR A["1 Million Rows"] --> B["itertuples: 2 sec"] A --> C["iterrows: 30 sec"] A --> D["apply: 5 sec"] B --> E["Winner!"]

The Golden Rule

❌ Avoid loops when possible
✅ Use vectorized operations first
⚡ If you must loop, use itertuples()

When Looping Is Okay

  • Complex logic that can’t be vectorized
  • When you need the index AND row data
  • Processing rows with external APIs

⚙️ Display Option Settings: Control What You See

The Problem

Pandas sometimes hides your data:

     A      B  ...  Z
0    1      2  ...  26
..  ..     ..  ...  ..
999  1      2  ...  26

[1000 rows × 26 columns]

Take Control!

import pandas as pd

# See more rows
pd.set_option('display.max_rows', 100)

# See more columns
pd.set_option('display.max_columns', 50)

# Wider display
pd.set_option('display.width', 200)

# More decimal places
pd.set_option('display.precision', 4)

# Don't truncate long text
pd.set_option('display.max_colwidth', None)

Handy Options Table

Option What It Does Example
max_rows Rows shown 100
max_columns Columns shown 50
width Total display width 200
precision Decimal places 4
max_colwidth Column text length None (all)

Temporary Changes

# Change just for one block
with pd.option_context(
    'display.max_rows', 10,
    'display.precision', 2
):
    print(df)
# Back to normal after!

Reset Everything

# Oops! Reset all options
pd.reset_option('all')

# Reset just one
pd.reset_option('display.max_rows')

🎯 Quick Summary

graph LR A["Pandas Performance"] --> B["Memory"] A --> C["Speed"] A --> D["Display"] B --> B1["Check with info"] B --> B2["Downcast numbers"] B --> B3["Use category"] C --> C1["Vectorize first!"] C --> C2["Use eval for math"] C --> C3["itertuples if needed"] D --> D1["set_option"] D --> D2["option_context"] D --> D3["reset_option"]

🌟 Remember This!

Goal Do This Avoid This
Check memory df.info(memory_usage='deep') Guessing
Save memory downcast='integer' Default int64
Fast math df['a'] * df['b'] For loops
Clean math df.eval('a * b') Messy brackets
Loop data itertuples() iterrows()
See more set_option() Truncated output

You did it! 🎉

Your Pandas backpack is now lighter, your code is faster, and you can see exactly what you need. Go make your data fly!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.