String Matching🔗
Library Solution🔗
Modern programming languages provide library solutions for basic string matching. These solutions are highly optimized and can be used directly for simple problems
C++🔗
C++ offers several built-in functions in the
find()
: Returns the index of the first occurrence of a substring.substr()
: Extracts a portion of the string.
string text = "hello world";
string pattern = "world";
size_t pos = text.find(pattern);
if (pos != string::npos) {
cout << "Pattern found at index " << pos << endl;
} else {
cout << "Pattern not found" << endl;
}
Python🔗
Python simplifies string matching with built-in methods:
find()
: Returns the index of the first occurrence of a substring or -1 if not found.in
operator: Checks for substring presence.
text = "hello world"
pattern = "world"
if pattern in text:
print(f"Pattern found at index {text.find(pattern)}")
else:
print("Pattern not found")
String Hashing🔗
String hashing is a preprocessing technique that converts a string into a numeric value (hash). This value allows efficient comparisons of substrings.
Algorithm
- Choose a base (e.g., 31) and a modulus (e.g., ).
-
Compute the hash of the string using the formula:
- Use prefix hashes to compute substring hashes efficiently.
vector<long long> compute_hash(string s, int base, int mod) {
vector<long long> hash(s.size() + 1, 0);
long long power = 1;
for (int i = 0; i < s.size(); i++) {
hash[i + 1] = (hash[i] * base + (s[i] - 'a' + 1)) % mod;
power = (power * base) % mod;
}
return hash;
}
Rabin-Karp🔗
The Rabin-Karp algorithm uses hashing to find all occurrences of a pattern in a text.
Algorithm
- Precompute the hash of the pattern and the initial substring of the text of the same length.
- Slide over the text:
- If the hashes match, verify the strings.
- Recompute the hash for the next substring in O(1)
- Complexity: O(N + M) on average.
- Things to consider while implementation
- Always verify with string comparison after hash match
- Consider double hashing for critical applications
- Birthday Paradox - The Birthday Paradox refers to the counterintuitive probability result that in a group of just 23 people, there is about a 50% chance that at least two people share the same birthday. Shows the vulnerability of hash functions to collisions, leading to the "birthday attack" where finding two inputs that hash to the same output is easier than expected.
bool rabin_karp(string text, string pattern) {
int n = text.size(), m = pattern.size();
if (m > n) return false; // Pattern longer than text
int base = 31;
int mod = 1000000007;
long long pattern_hash = 0, text_hash = 0, power = 1;
// Precompute power = base^(m-1) % mod
for (int i = 0; i < m - 1; i++) {
power = (power * base) % mod;
}
// Compute initial hashes for pattern and first window of text
for (int i = 0; i < m; i++) {
pattern_hash = (pattern_hash * base + (pattern[i] - 'a' + 1)) % mod;
text_hash = (text_hash * base + (text[i] - 'a' + 1)) % mod;
}
for (int i = 0; i <= n - m; i++) {
// Check hash match and then verify substring to avoid false positives
if (pattern_hash == text_hash && text.substr(i, m) == pattern) {
return true;
}
// Compute hash for next window (if any)
if (i < n - m) {
text_hash = (text_hash - ((text[i] - 'a' + 1) * power) % mod + mod) % mod; // Remove leading char
text_hash = (text_hash * base + (text[i + m] - 'a' + 1)) % mod; // Add trailing char
}
}
return false;
}
Z-Function🔗
The Z-function computes an array where Z[i] is the length of the longest substring starting from i that matches the prefix of the string.
Applications
- Pattern Matching (Exact Match) : concatenate P and T :
P + '$' + T
, compute Z-function for such string. If , the pattern occurs starting at ) - Finding Periods in Strings
- Counting Unique Substrings
- String Compression
Algorithm
- Initialize Z[0] = 0 .
- Maintain a window [L, R] that matches the prefix.
- For each i :
- If , use previously computed values.
- Otherwise, compute Z[i] directly
- Complexity: O(N)
vector<int> compute_z(string s) {
int n = s.size();
vector<int> z(n, 0);
int l = 0, r = 0;
for (int i = 1; i < n; i++) {
if (i <= r) z[i] = min(r - i + 1, z[i - l]);
while (i + z[i] < n && s[z[i]] == s[i + z[i]]) z[i]++;
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
}
return z;
}
Prefix Function (KMP) Algorithm🔗
The Knuth-Morris-Pratt (KMP) algorithm avoids redundant comparisons by precomputing a prefix function(lps).
Algorithm
- Compute the prefix function:
- is the length of the longest prefix that is also a suffix for the substring ending at i .
- Use to skip unnecessary comparisons
- Complexity: O(N + M).
vector<int> compute_prefix_function(string pattern) {
int m = pattern.size();
vector<int> pi(m, 0);
for (int i = 1, j = 0; i < m; i++) {
while (j > 0 && pattern[i] != pattern[j]) j = pi[j - 1];
if (pattern[i] == pattern[j]) j++;
pi[i] = j;
}
return pi;
}
// using above function to perform KMP Search
int strStr(string haystack, string needle) {
int n = needle.size();
if (n == 0) return 0;
vector<int> lps = compute_prefix_function(needle);
// Perform the KMP search
int i = 0;
j = 0;
while (i < haystack.size()) {
if (haystack[i] == needle[j]) {
i++;
j++;
}
if (j == n) {
return i - n; // Pattern found at index (i - n)
}
if (i < haystack.size() && haystack[i] != needle[j]) {
if (j > 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return -1; // Pattern not found
}