π Regular Expressions: The Secret Code Finder
Imagine you have a magical magnifying glass that can find ANY pattern in a mountain of text. Thatβs regex!
π The Story: Meet Detective Regex
Youβre a detective. Your job? Finding specific patterns in huge piles of letters, emails, and documents.
Without regex, youβd read every single word. Boring!
With regex, you write a magic search spell andβBOOMβevery match lights up instantly.
Letβs learn this superpower!
π What Weβll Learn
- Regex Basics
- Pattern Matching Functions
- Match Objects and Groups
- Metacharacters
- Quantifiers and Anchors
- Greedy vs Non-Greedy
- Regex Flags
1οΈβ£ Regex Basics
What is Regex?
Regex = Regular Expression = A pattern you write to find text.
Think of it like a treasure map. The pattern is your map. The text is the jungle. Regex finds the treasure!
Your First Regex in Python
import re
text = "I love cats and dogs"
pattern = "cats"
result = re.search(pattern, text)
print(result) # Found it!
What happened?
- We imported the
remodule (Pythonβs regex tool) - We wrote a simple pattern:
"cats" re.search()found βcatsβ in our text
The r Prefix (Raw Strings)
Always use r before your pattern:
pattern = r"\d+" # Good!
pattern = "\d+" # Risky!
Why? The r tells Python: βDonβt mess with my backslashes!β
2οΈβ£ Pattern Matching Functions
Python gives us 4 main tools:
re.search() - Find First Match
import re
text = "Call me at 555-1234"
match = re.search(r"\d+", text)
if match:
print(match.group()) # 555
Finds the first number in the text.
re.match() - Check the Beginning
text = "Hello World"
# This works (starts with Hello)
re.match(r"Hello", text) # β
# This fails (World is not at start)
re.match(r"World", text) # β
match() only looks at the beginning!
re.findall() - Find ALL Matches
text = "I have 2 cats and 3 dogs"
numbers = re.findall(r"\d", text)
print(numbers) # ['2', '3']
Returns a list of all matches!
re.sub() - Find and Replace
text = "I hate Mondays"
new_text = re.sub(r"hate", "love", text)
print(new_text) # I love Mondays
Like Find-Replace in your text editor!
3οΈβ£ Match Objects and Groups
Whatβs a Match Object?
When regex finds something, it creates a Match Objectβa little package of info.
text = "My email is bob@mail.com"
match = re.search(r"\w+@\w+\.\w+", text)
if match:
print(match.group()) # bob@mail.com
print(match.start()) # 12 (where it starts)
print(match.end()) # 24 (where it ends)
print(match.span()) # (12, 24)
Groups: Capture Parts
Use parentheses () to capture pieces:
text = "Born on 2005-03-15"
pattern = r"(\d{4})-(\d{2})-(\d{2})"
match = re.search(pattern, text)
if match:
print(match.group(0)) # 2005-03-15 (full)
print(match.group(1)) # 2005 (year)
print(match.group(2)) # 03 (month)
print(match.group(3)) # 15 (day)
Think of it like boxes inside boxes!
ββββββββββββββββββββββββββ
β Full Match (group 0) β
β βββββββ ββββββ ββββββ β
β β2005 β β 03 β β 15 β β
β β (1) β β(2) β β(3) β β
β βββββββ ββββββ ββββββ β
ββββββββββββββββββββββββββ
Named Groups
Give your groups names for clarity:
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})"
match = re.search(pattern, "Date: 2024-08")
print(match.group('year')) # 2024
print(match.group('month')) # 08
4οΈβ£ Metacharacters
Metacharacters are magic symbols with special powers:
The Dot . - Match Any Character
re.findall(r"c.t", "cat cot cut")
# ['cat', 'cot', 'cut']
The . matches any single character!
Character Classes []
# Match a, e, i, o, or u
re.findall(r"[aeiou]", "hello")
# ['e', 'o']
# Match any digit
re.findall(r"[0-9]", "abc123")
# ['1', '2', '3']
Negation [^]
# Match anything EXCEPT vowels
re.findall(r"[^aeiou]", "hello")
# ['h', 'l', 'l']
Shorthand Classes
| Symbol | Meaning | Same As |
|---|---|---|
\d |
Any digit | [0-9] |
\D |
Not a digit | [^0-9] |
\w |
Word character | [a-zA-Z0-9_] |
\W |
Not word char | [^a-zA-Z0-9_] |
\s |
Whitespace | [ \t\n\r] |
\S |
Not whitespace | [^ \t\n\r] |
text = "Call 555-1234 now!"
re.findall(r"\d", text) # ['5','5','5','1'...]
re.findall(r"\w+", text) # ['Call','555','1234','now']
The Pipe | - OR
re.findall(r"cat|dog", "I have a cat and dog")
# ['cat', 'dog']
5οΈβ£ Quantifiers and Anchors
Quantifiers: How Many?
| Symbol | Meaning | Example |
|---|---|---|
* |
0 or more | a* β ββ, βaβ, βaaaβ |
+ |
1 or more | a+ β βaβ, βaaaβ |
? |
0 or 1 | a? β ββ, βaβ |
{n} |
Exactly n | a{3} β βaaaβ |
{n,} |
n or more | a{2,} β βaaβ, βaaaβ |
{n,m} |
n to m | a{2,4} β βaaβ, βaaaβ |
text = "goood morning gooooood day"
re.findall(r"go+d", text)
# ['goood', 'gooooood']
re.findall(r"go{2,4}d", text)
# ['goood'] (only 2-4 o's)
Anchors: Where to Look?
| Symbol | Meaning |
|---|---|
^ |
Start of string |
$ |
End of string |
\b |
Word boundary |
text = "hello world"
re.search(r"^hello", text) # β Matches
re.search(r"^world", text) # β No match
re.search(r"worldquot;, text) # β Matches
re.search(r"helloquot;, text) # β No match
Word Boundaries:
text = "cat category caterpillar"
re.findall(r"\bcat\b", text)
# ['cat'] - only the standalone word!
re.findall(r"cat", text)
# ['cat', 'cat', 'cat'] - all occurrences
6οΈβ£ Greedy vs Non-Greedy
The Hungry Monster (Greedy)
By default, regex is GREEDY. It wants as much as possible!
text = "<h1>Title</h1><p>Text</p>"
# Greedy (default)
re.findall(r"<.*>", text)
# ['<h1>Title</h1><p>Text</p>']
# Ate EVERYTHING between first < and last >
The Polite Monster (Non-Greedy)
Add ? after a quantifier to make it lazy:
# Non-greedy
re.findall(r"<.*?>", text)
# ['<h1>', '</h1>', '<p>', '</p>']
# Takes minimum needed!
Visual Comparison
Text: <b>bold</b>
Greedy <.*> : <ββββββββββββ>
<b>bold</b>
Lazy <.*?> : <ββ> <βββ>
<b> </b>
All Non-Greedy Versions
| Greedy | Non-Greedy |
|---|---|
* |
*? |
+ |
+? |
? |
?? |
{n,m} |
{n,m}? |
7οΈβ£ Regex Flags
Flags change how your pattern works:
re.IGNORECASE (or re.I)
text = "Hello HELLO hello"
re.findall(r"hello", text)
# ['hello']
re.findall(r"hello", text, re.I)
# ['Hello', 'HELLO', 'hello']
re.MULTILINE (or re.M)
Makes ^ and $ work on each line:
text = """Line 1
Line 2
Line 3"""
re.findall(r"^Line", text)
# ['Line'] - only first line
re.findall(r"^Line", text, re.M)
# ['Line', 'Line', 'Line'] - all lines!
re.DOTALL (or re.S)
Makes . match newlines too:
text = "Hello\nWorld"
re.search(r"Hello.World", text) # β No match
re.search(r"Hello.World", text, re.S) # β Match!
re.VERBOSE (or re.X)
Write readable patterns with comments:
pattern = r"""
\d{3} # Area code
- # Separator
\d{4} # Phone number
"""
re.search(pattern, "555-1234", re.X)
Combining Flags
Use the | operator:
re.findall(r"hello", text, re.I | re.M)
π Quick Reference Flow
graph TD A["Start"] --> B{What do you need?} B --> C["Find first match"] C --> D["re.search"] B --> E["Check start only"] E --> F["re.match"] B --> G["Find all matches"] G --> H["re.findall"] B --> I["Replace text"] I --> J["re.sub"]
π― Real-World Examples
Validate an Email
pattern = r"^[\w.-]+@[\w.-]+\.\w+quot;
re.match(pattern, "user@email.com") # β
re.match(pattern, "bad-email") # β
Extract Phone Numbers
text = "Call 555-123-4567 or 999-876-5432"
pattern = r"\d{3}-\d{3}-\d{4}"
re.findall(pattern, text)
# ['555-123-4567', '999-876-5432']
Clean Extra Spaces
text = "Too many spaces"
clean = re.sub(r"\s+", " ", text)
print(clean) # "Too many spaces"
π You Did It!
You now have regex superpowers!
Remember:
- π
search()finds first - π
findall()finds all - π
sub()replaces - π¦ Groups
()capture parts - β‘ Flags change behavior
Practice makes perfect. Try building patterns for:
- URLs
- Dates
- Usernames
- Hashtags
Happy pattern hunting! π
