String Operations

Back

Loading concept...

🧵 Pandas String Operations: Your Text Toolkit

Imagine you have a magical toolbox. Inside are special tools that can clean, search, split, and transform any text. Pandas gives you this toolbox through the .str accessor!


🎯 The Big Picture

Think of a Pandas Series as a list of text messages. Sometimes these messages are messy:

  • Extra spaces
  • Wrong capitalization
  • Names stuck together
  • Hidden patterns you need to find

The .str accessor is your magic wand that transforms messy text into clean, useful data.


🔑 The String Accessor: .str

What Is It?

The .str accessor is like a key that unlocks text powers. Without it, your text just sits there. With it, you can do amazing things!

import pandas as pd

names = pd.Series(['Alice', 'Bob', 'Charlie'])

# Without .str - this won't work!
# names.upper()  ❌

# With .str - magic happens! ✨
names.str.upper()
# Output: ['ALICE', 'BOB', 'CHARLIE']

Why Do We Need It?

Simple Example:

Without .str, you’d write a loop. With .str, one line does it all!

emails = pd.Series(['JOHN@MAIL.COM', 'jane@mail.com'])
emails.str.lower()
# Output: ['john@mail.com', 'jane@mail.com']

🔍 String Matching Methods

Finding Needles in Haystacks

Sometimes you need to ask questions about your text:

  • Does it start with something?
  • Does it contain a word?
  • Does it end with something?

The Methods

Method Question It Answers
.str.contains() Does it have this inside?
.str.startswith() Does it begin with this?
.str.endswith() Does it finish with this?
.str.match() Does it match this pattern from the start?

Real Example

foods = pd.Series(['apple pie', 'banana', 'apple juice', 'cherry'])

# Find all apple items
foods.str.contains('apple')
# Output: [True, False, True, False]

# Filter to get only apple items
foods[foods.str.contains('apple')]
# Output: ['apple pie', 'apple juice']

More Examples

websites = pd.Series(['google.com', 'amazon.org', 'github.com'])

# Which ones end with .com?
websites.str.endswith('.com')
# Output: [True, False, True]

# Which ones start with 'g'?
websites.str.startswith('g')
# Output: [True, False, True]

🎣 Regex Extraction with .str.extract()

What Is Regex?

Regex (Regular Expression) is like a search template. Instead of looking for exact words, you describe a pattern.

Think of it like this:

  • “Find apple” → finds only “apple”
  • “Find any fruit” → finds apple, banana, cherry…

The Extract Method

.str.extract() pulls out the part that matches your pattern.

data = pd.Series(['Price: $100', 'Cost: $250', 'Value: $75'])

# Extract just the numbers
data.str.extract(r'(\d+)')
# Output: ['100', '250', '75']

How It Works

The pattern (\d+) means:

  • \d = any digit (0-9)
  • + = one or more
  • () = capture this part

Named Groups

You can even name what you extract!

info = pd.Series(['John-25', 'Jane-30', 'Bob-22'])

# Extract name and age separately
info.str.extract(r'(?P<name>\w+)-(?P<age>\d+)')
# Output:
#    name  age
# 0  John   25
# 1  Jane   30
# 2   Bob   22

✂️ String Split Method

Cutting Text Into Pieces

The .str.split() method is like scissors for text. You tell it where to cut, and it gives you pieces.

full_names = pd.Series(['John Smith', 'Jane Doe', 'Bob Wilson'])

# Split by space
full_names.str.split(' ')
# Output: [['John', 'Smith'], ['Jane', 'Doe'], ['Bob', 'Wilson']]

Getting Specific Parts

Use expand=True to get a neat table:

full_names.str.split(' ', expand=True)
# Output:
#       0       1
# 0  John   Smith
# 1  Jane     Doe
# 2   Bob  Wilson

Limit the Splits

text = pd.Series(['a-b-c-d', 'e-f-g-h'])

# Split only first 2 times
text.str.split('-', n=2)
# Output: [['a', 'b', 'c-d'], ['e', 'f', 'g-h']]

🔄 String Replace Method

Swap Old for New

The .str.replace() method finds text and swaps it with something else.

greetings = pd.Series(['Hello World', 'Hello Python', 'Hello Pandas'])

# Replace Hello with Hi
greetings.str.replace('Hello', 'Hi')
# Output: ['Hi World', 'Hi Python', 'Hi Pandas']

With Regex Power

prices = pd.Series(['$100', '$250', '$75'])

# Remove dollar signs
prices.str.replace(r'\#x27;, '', regex=True)
# Output: ['100', '250', '75']

Multiple Replacements

messy = pd.Series(['cat_dog', 'bird_fish'])

# Replace underscore with space
messy.str.replace('_', ' ')
# Output: ['cat dog', 'bird fish']

🔠 Case and Whitespace Methods

Changing Letter Case

Method What It Does Example
.str.lower() all lowercase ‘HELLO’ → ‘hello’
.str.upper() ALL UPPERCASE ‘hello’ → ‘HELLO’
.str.title() Title Case ‘hello world’ → ‘Hello World’
.str.capitalize() First letter only ‘hello’ → ‘Hello’
.str.swapcase() Flip the case ‘Hello’ → ‘hELLO’
names = pd.Series(['jOHN', 'JANE', 'bob'])

names.str.title()
# Output: ['John', 'Jane', 'Bob']

Cleaning Whitespace

Extra spaces are sneaky bugs in data!

Method What It Does
.str.strip() Remove spaces from both ends
.str.lstrip() Remove spaces from left
.str.rstrip() Remove spaces from right
messy = pd.Series(['  hello  ', '  world', 'python  '])

messy.str.strip()
# Output: ['hello', 'world', 'python']

Real World Example

usernames = pd.Series(['  John  ', '  JANE', 'bob  '])

# Clean and standardize
usernames.str.strip().str.lower()
# Output: ['john', 'jane', 'bob']

📏 String Len Method

Counting Characters

The .str.len() method counts how many characters are in each string.

words = pd.Series(['cat', 'elephant', 'dog'])

words.str.len()
# Output: [3, 8, 3]

Why Is This Useful?

Example 1: Find short passwords

passwords = pd.Series(['abc', 'secure123', 'hi', 'longpassword'])

# Find passwords shorter than 6 characters
weak = passwords[passwords.str.len() < 6]
# Output: ['abc', 'hi']

Example 2: Validate data

codes = pd.Series(['ABC123', 'XY99', 'ABCD1234'])

# Find codes that are exactly 6 characters
valid = codes[codes.str.len() == 6]
# Output: ['ABC123']

🗺️ How It All Connects

graph TD A["Raw Text Data"] --> B[".str accessor"] B --> C["Match &amp; Find"] B --> D["Extract Patterns"] B --> E["Split Text"] B --> F["Replace Text"] B --> G["Change Case"] B --> H["Measure Length"] C --> I["Clean Data!"] D --> I E --> I F --> I G --> I H --> I

🎯 Quick Reference

Task Method Example
Access string methods .str series.str.lower()
Check if contains .str.contains() series.str.contains('a')
Check start .str.startswith() series.str.startswith('A')
Check end .str.endswith() series.str.endswith('z')
Extract with regex .str.extract() series.str.extract(r'(\d+)')
Split text .str.split() series.str.split(',')
Replace text .str.replace() series.str.replace('a', 'b')
Lowercase .str.lower() series.str.lower()
Uppercase .str.upper() series.str.upper()
Title case .str.title() series.str.title()
Remove spaces .str.strip() series.str.strip()
Count characters .str.len() series.str.len()

🚀 You Did It!

You now have a complete toolkit for handling text in Pandas:

  1. .str - The key that unlocks everything
  2. Matching - Find what you’re looking for
  3. Extract - Pull out patterns with regex
  4. Split - Cut text into pieces
  5. Replace - Swap old for new
  6. Case/Whitespace - Clean and standardize
  7. Len - Measure your text

With these tools, messy text data doesn’t stand a chance! 🎉

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.