Skip to content

String Matching🔗

Library Solution🔗

Modern programming languages provide library solutions for basic string matching. These solutions are highly optimized and can be used directly for simple problems

C++🔗

C++ offers several built-in functions in the library for string matching:

  • find(): Returns the index of the first occurrence of a substring.
  • substr(): Extracts a portion of the string.
string text = "hello world";
string pattern = "world";
size_t pos = text.find(pattern);
if (pos != string::npos) {
    cout << "Pattern found at index " << pos << endl;
} else {
    cout << "Pattern not found" << endl;
}

Python🔗

Python simplifies string matching with built-in methods:

  • find(): Returns the index of the first occurrence of a substring or -1 if not found.
  • in operator: Checks for substring presence.
text = "hello world"
pattern = "world"
if pattern in text:
    print(f"Pattern found at index {text.find(pattern)}")
else:
    print("Pattern not found")

String Hashing🔗

String hashing is a preprocessing technique that converts a string into a numeric value (hash). This value allows efficient comparisons of substrings.

Algorithm

  • Choose a base (e.g., 31) and a modulus (e.g., ).
  • Compute the hash of the string using the formula:

  • Use prefix hashes to compute substring hashes efficiently.
vector<long long> compute_hash(string s, int base, int mod) {
    vector<long long> hash(s.size() + 1, 0);
    long long power = 1;
    for (int i = 0; i < s.size(); i++) {
        hash[i + 1] = (hash[i] * base + (s[i] - 'a' + 1)) % mod;
        power = (power * base) % mod;
    }
    return hash;
}

Rabin-Karp🔗

The Rabin-Karp algorithm uses hashing to find all occurrences of a pattern in a text.

Algorithm

  • Precompute the hash of the pattern and the initial substring of the text of the same length.
  • Slide over the text:
    • If the hashes match, verify the strings.
    • Recompute the hash for the next substring in O(1)
  • Complexity: O(N + M) on average.
bool rabin_karp(string text, string pattern) {
    int n = text.size(), m = pattern.size();
    int base = 31, mod = 1e9 + 7;
    long long pattern_hash = 0, text_hash = 0, power = 1;

    for (int i = 0; i < m; i++) {
        pattern_hash = (pattern_hash * base + (pattern[i] - 'a' + 1)) % mod;
        text_hash = (text_hash * base + (text[i] - 'a' + 1)) % mod;
        if (i > 0) power = (power * base) % mod;
    }

    for (int i = 0; i <= n - m; i++) {
        if (pattern_hash == text_hash && text.substr(i, m) == pattern)
            return true;
        if (i < n - m) {
            text_hash = (text_hash - (text[i] - 'a' + 1) * power) % mod;
            text_hash = (text_hash * base + (text[i + m] - 'a' + 1)) % mod;
            if (text_hash < 0) text_hash += mod;
        }
    }
    return false;
}

Z-Function🔗

The Z-function computes an array where Z[i] is the length of the longest substring starting from i that matches the prefix of the string.

Applications

  • Pattern Matching (Exact Match) : concatenate P and T : P + '$' + T, compute Z-function for such string. If , the pattern occurs starting at )
  • Finding Periods in Strings
  • Counting Unique Substrings
  • String Compression

Algorithm

  • Initialize Z[0] = 0 .
  • Maintain a window [L, R] that matches the prefix.
  • For each i :
    • If , use previously computed values.
    • Otherwise, compute Z[i] directly
  • Complexity: O(N)
vector<int> compute_z(string s) {
    int n = s.size();
    vector<int> z(n, 0);
    int l = 0, r = 0;

    for (int i = 1; i < n; i++) {
        if (i <= r) z[i] = min(r - i + 1, z[i - l]);
        while (i + z[i] < n && s[z[i]] == s[i + z[i]]) z[i]++;
        if (i + z[i] - 1 > r) {
            l = i;
            r = i + z[i] - 1;
        }
    }
    return z;
}

Prefix Function (KMP) Algorithm🔗

The Knuth-Morris-Pratt (KMP) algorithm avoids redundant comparisons by precomputing a prefix function(lps).

Algorithm

  • Compute the prefix function:
    • is the length of the longest prefix that is also a suffix for the substring ending at i .
  • Use to skip unnecessary comparisons
  • Complexity: O(N + M).
vector<int> compute_prefix_function(string pattern) {
    int m = pattern.size();
    vector<int> pi(m, 0);

    for (int i = 1, j = 0; i < m; i++) {
        while (j > 0 && pattern[i] != pattern[j]) j = pi[j - 1];
        if (pattern[i] == pattern[j]) j++;
        pi[i] = j;
    }
    return pi;
}
// using above function to perform KMP Search
int strStr(string haystack, string needle) {
    int n = needle.size();
    if (n == 0) return 0;

    vector<int> lps = compute_prefix_function(needle);

    // Perform the KMP search
    int i = 0; 
    j = 0; 
    while (i < haystack.size()) {
        if (haystack[i] == needle[j]) {
            i++;
            j++;
        }
        if (j == n) {
            return i - n; // Pattern found at index (i - n)
        }
        if (i < haystack.size() && haystack[i] != needle[j]) {
            if (j > 0) {
                j = lps[j - 1];
            } else {
                i++;
            }
        }
    }

    return -1; // Pattern not found
}

Suffix Array🔗

A suffix array is a sorted array of all suffixes of a string.

Algorithm

  • Sort suffixes of the string using:
    • Radix Sort or precomputed ranks.
  • Store starting indices of sorted suffixes.
  • Complexity: .
vector<int> build_suffix_array(string s) {
    int n = s.size();
    vector<int> suffix_array(n), rank(n), temp(n);

    for (int i = 0; i < n; i++) suffix_array[i] = i, rank[i] = s[i];

    for (int k = 1; k < n; k *= 2) {
        auto compare = [&](int i, int j) {
            if (rank[i] != rank[j]) return rank[i] < rank[j];
            int ri = (i + k < n) ? rank[i + k] : -1;
            int rj = (j + k < n) ? rank[j + k] : -1;
            return ri < rj;
        };
        sort(suffix_array.begin(), suffix_array.end(), compare);

        temp[suffix_array[0]] = 0;
        for (int i = 1; i < n; i++) {
            temp[suffix_array[i]] = temp[suffix_array[i - 1]] + compare(suffix_array[i - 1], suffix_array[i]);
        }
        rank = temp;
    }
    return suffix_array;
}