🔍 String Matching: Finding Needles in a Haystack
Imagine you lost your favorite toy in a huge pile of similar-looking toys. How would you find it? Would you check every single toy one by one? That works, but it’s slow! What if you had a magic trick to skip the toys you already know don’t match?
That’s exactly what string matching algorithms do! They help computers find a small word (called a pattern) inside a big text (called the haystack) — and they do it super fast.
🎯 What You’ll Learn
- KMP Algorithm — The “Memory Helper” that never forgets
- Rabin-Karp Algorithm — The “Fingerprint Detective”
- Longest Common Prefix (LCP) — Finding what strings have in common
📚 The Story of the Lost Word
Let’s say you have a long story:
TEXT: "ABABDABACDABABCABAB"
And you’re looking for:
PATTERN: "ABABCABAB"
The slow way: Compare every letter, one by one. If it doesn’t match, start over from the next position. This is like checking every toy in the pile, even if you just saw it!
The smart way: Remember what you already checked. If something doesn’t match, use your memory to skip ahead!
🧠 1. KMP Algorithm (Knuth-Morris-Pratt)
The “Memory Helper” Trick
Imagine this: You’re reading a book looking for the word “ABABC”. You’re at “ABAB” and the next letter is “D” instead of “C”.
❌ Dumb approach: Go back to the beginning and start over.
✅ KMP approach: Wait! “AB” at the end of “ABAB” looks just like “AB” at the beginning! I can continue from there!
How KMP Works
KMP builds a “failure function” (also called LPS — Longest Prefix Suffix) that tells us: “If we fail here, where can we safely continue?”
graph TD A[Start Matching] --> B{Characters Match?} B -->|Yes| C[Move Both Pointers Forward] C --> B B -->|No| D{Pattern at Start?} D -->|Yes| E[Move Text Pointer Only] D -->|No| F[Use LPS to Skip Ahead] F --> B E --> B
Building the LPS Array
For pattern “ABABCABAB”:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| Char | A | B | A | B | C | A | B | A | B |
| LPS | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 |
What does this mean?
- LPS[3] = 2 means: At position 3, the longest prefix that’s also a suffix is length 2 (“AB”)
- If we fail after position 3, we can skip back to position 2 instead of starting over!
Java Code Example
public int[] buildLPS(String pattern) {
int[] lps = new int[pattern.length()];
int len = 0;
int i = 1;
while (i < pattern.length()) {
if (pattern.charAt(i) ==
pattern.charAt(len)) {
len++;
lps[i] = len;
i++;
} else if (len > 0) {
len = lps[len - 1];
} else {
lps[i] = 0;
i++;
}
}
return lps;
}
Simple Example
Finding “ABA” in “ABABA”:
Text: A B A B A
Pattern: A B A
✓ ✓ ✓ → Found at index 0!
Move pattern using LPS...
Text: A B A B A
Pattern: A B A
✓ ✓ ✓ → Found at index 2!
Why KMP is Fast: Time complexity is O(n + m) where n = text length, m = pattern length. No backtracking!
🔐 2. Rabin-Karp Algorithm
The “Fingerprint Detective” Trick
Imagine: Instead of comparing every letter, what if each word had a fingerprint (a number)? You could just compare fingerprints!
- If fingerprints don’t match → Words definitely don’t match ❌
- If fingerprints match → Check the actual letters (might be a false alarm!) ✓
The Magic: Rolling Hash
The secret sauce is the rolling hash. When you slide your window by one character:
Window 1: [A B C] D E → Hash = 123
Window 2: A [B C D] E → Hash = 234 (calculated from 123!)
You don’t recalculate everything! You just:
- Remove the old first character’s contribution
- Add the new last character
Java Code Example
public void rabinKarp(String text,
String pattern) {
int d = 256; // alphabet size
int q = 101; // a prime number
int m = pattern.length();
int n = text.length();
int h = 1;
// h = d^(m-1) % q
for (int i = 0; i < m - 1; i++)
h = (h * d) % q;
// Calculate initial hashes
int p = 0, t = 0;
for (int i = 0; i < m; i++) {
p = (d * p + pattern.charAt(i)) % q;
t = (d * t + text.charAt(i)) % q;
}
// Slide and compare
for (int i = 0; i <= n - m; i++) {
if (p == t) {
// Verify actual characters
if (text.substring(i, i+m)
.equals(pattern))
System.out.println(
"Found at: " + i);
}
// Calculate next hash
if (i < n - m) {
t = (d * (t - text.charAt(i) * h)
+ text.charAt(i + m)) % q;
if (t < 0) t += q;
}
}
}
Simple Example
Finding “AB” in “AABAB”:
Let’s use a simple hash: A=1, B=2
Pattern "AB" → Hash = 1×10 + 2 = 12
Text positions:
[AA] → 1×10 + 1 = 11 ≠ 12 ❌
[AB] → 1×10 + 2 = 12 = 12 ✓ Check! Match!
[BA] → 2×10 + 1 = 21 ≠ 12 ❌
[AB] → 1×10 + 2 = 12 = 12 ✓ Check! Match!
Why Rabin-Karp is Cool:
- Great for multiple pattern matching (search for many words at once!)
- Average time: O(n + m)
- Worst case (many false positives): O(nm)
🔗 3. Longest Common Prefix (LCP)
Finding What Strings Share
Question: What do these words have in common at the start?
"flower"
"flow"
"flight"
Answer: “fl” — That’s the Longest Common Prefix!
Two Approaches
Approach 1: Horizontal Scanning
Compare strings two at a time:
graph TD A["flower, flow, flight"] --> B["LCP#40;flower, flow#41; = flow"] B --> C["LCP#40;flow, flight#41; = fl"] C --> D["Result: fl"]
Approach 2: Vertical Scanning
Compare column by column:
f l o w e r
f l o w
f l i g h t
↓ ↓ ↓
f l ✗ → Stop! LCP = "fl"
Java Code: Vertical Scanning
public String longestCommonPrefix(
String[] strs) {
if (strs == null || strs.length == 0)
return "";
for (int i = 0;
i < strs[0].length(); i++) {
char c = strs[0].charAt(i);
for (int j = 1;
j < strs.length; j++) {
if (i >= strs[j].length() ||
strs[j].charAt(i) != c) {
return strs[0]
.substring(0, i);
}
}
}
return strs[0];
}
Simple Example
String[] words = {"interstellar",
"internet",
"internal"};
// Column check:
// i: ✓ n: ✓ t: ✓ e: ✓ r: ✓ n: ?
// "interstellar" has 'n'
// "internet" has 'n'
// "internal" has 'n' ✓
// Continue...
// "interstellar" has 's'
// "internet" has 'e' ❌ STOP!
// Result: "intern"
🎯 When to Use What?
| Algorithm | Best For | Time | Space |
|---|---|---|---|
| KMP | Single pattern, no preprocessing overhead | O(n+m) | O(m) |
| Rabin-Karp | Multiple patterns, plagiarism detection | O(n+m) avg | O(1) |
| LCP | Finding common beginnings in word lists | O(S) | O(1) |
S = sum of all characters in all strings
🌟 Real-World Magic
These algorithms power things you use every day:
- 🔍 Search engines — Finding your query in billions of pages
- 📝 Text editors — Ctrl+F (Find and Replace)
- 🧬 DNA matching — Finding gene sequences
- 🔒 Plagiarism checkers — Finding copied text
- 🌐 Autocomplete — Suggesting words as you type (uses LCP!)
💡 Key Takeaways
- KMP = Build a “memory map” (LPS array) so you never check the same thing twice
- Rabin-Karp = Use “fingerprints” (hashes) for quick comparisons
- LCP = Scan character-by-character to find shared beginnings
Remember the metaphors:
- 🧠 KMP is your smart friend with perfect memory
- 🔍 Rabin-Karp is a detective with fingerprint files
- 🔗 LCP is finding the family resemblance between words
🚀 You Did It!
You now understand how computers find words at lightning speed! These algorithms might seem tricky at first, but remember:
- KMP: “If I’ve seen this before, I can skip ahead!”
- Rabin-Karp: “Let me check the fingerprint first!”
- LCP: “What do we all have in common?”
Go forth and search strings like a pro! 🎉