Suffix Tree🔗
- A suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values.
- Allows particularly fast implementations of many important string operations
- The Suffix tree for a text
X
of sizen
from an alphabet of sized
- stores all the
n(n-1)/2
suffixes ofX
inO(n)
space. - can be constructed in
O(dn)
time - supports arbitrary pattern matching and prefix matching queries in
O(dm)
time, wherem
is length of the pattern
- stores all the
Example - Given a string banana$
create its suffix tree. NOTE: $
: represents string termination.
Generating the suffixes by hand would give us following : banana$, anana$, nana$, ana$, na$, a$ , $
Construct a trie for the data, which is a space-intensive method to represent it as a tree. Instead, create a suffix tree from the data.
![]() |
---|
Above representation is not efficient in terms of space, we can represent a suffix as shown below. |
![]() |
The number in the leaf refers to the index into s where the suffix represented by the leaf begins. |
- Actual algorithm work in following way, To add a new suffix to tree, we walk down the current tree, until we come to a place where the path leads off of the current tree.
- This could happen at anywhere in tree, middle of the edge, or at an already existing node,
- In first cast we split the edge in two and add a new node with branching factor of 2 in the middle
- second case simply add a new edge from an already existing node.
- Thus number of node in the tree is O(n) (NOTE: Naive construction above is still which can be optimized)
- NOTE: Ukkonen’n Algorithm (linear time) for creating suffix tree. Link
Suffix Array🔗
- in practice, people rarely implement full suffix trees, especially not Ukkonen’s algorithm
- In a practical scenario : Suffix Array + LCP Array = Practical Suffix Tree Substitute
- A suffix array is simply a sorted list of all suffixes of the string. Taking previous example.
Suffixes: Sorted Order:
0: banana$ 6: $
1: anana$ 5: a$
2: nana$ 3: ana$
3: ana$ 1: anana$
4: na$ 0: banana$
5: a$ 4: na$
6: $ 2: nana$
- Suffix Array would be :
[6, 5, 3, 1, 0, 4, 2]
- LCP Array is Longest Common Prefix : It tells the length of the common prefix between consecutive suffixes in the sorted suffix array. Here LCP =
[0, 1, 3, 0, 0, 2, 0]
Feature | Suffix Tree | Suffix Array + LCP |
---|---|---|
Construction Time | O(n) with Ukkonen’s algo | O(n log n) or O(n) with tricks |
Space Usage | High (nodes, pointers) | Low (arrays only) |
Ease of Implementation | Very hard | Easy to moderate |
Pattern Matching | O(m) | O(m log n) |
Use in Practice | Rarely used directly | Common (bioinformatics, search engines, etc.) |
Suffix Array Implementation🔗
- Build Suffix Array in
- Build LCP Array using Kasai’s algorithm in
- Supports
- Pattern Matching
- Longest Repeated Substring
- Count of distinct substring
#include <bits/stdc++.h>
using namespace std;
class SuffixStructure {
private:
string s;
int n;
vector<int> sa, rank, lcp;
void buildSA() {
sa.resize(n);
vector<int> tmp(n), cnt(max(256, n));
iota(sa.begin(), sa.end(), 0);
rank.resize(n);
for (int i = 0; i < n; ++i) rank[i] = s[i];
for (int k = 1; k < n; k <<= 1) {
auto cmp = [&](int i, int j) {
if (rank[i] != rank[j]) return rank[i] < rank[j];
int ri = (i + k < n) ? rank[i + k] : -1;
int rj = (j + k < n) ? rank[j + k] : -1;
return ri < rj;
};
sort(sa.begin(), sa.end(), cmp);
tmp[sa[0]] = 0;
for (int i = 1; i < n; ++i)
tmp[sa[i]] = tmp[sa[i - 1]] + cmp(sa[i - 1], sa[i]);
rank = tmp;
}
}
void buildLCP() {
lcp.resize(n);
vector<int> inv(n);
for (int i = 0; i < n; ++i) inv[sa[i]] = i;
int k = 0;
for (int i = 0; i < n; ++i) {
if (inv[i] == 0) continue;
int j = sa[inv[i] - 1];
while (i + k < n && j + k < n && s[i + k] == s[j + k]) k++;
lcp[inv[i]] = k;
if (k) k--;
}
}
public:
SuffixStructure(const string &str) : s(str + "$"), n(str.size() + 1) {
buildSA();
buildLCP();
}
// Find if pattern exists in O(m log n)
bool contains(const string &pattern) {
int l = 0, r = n - 1;
while (l <= r) {
int m = (l + r) / 2;
string suffix = s.substr(sa[m], pattern.size());
if (suffix == pattern) return true;
else if (suffix < pattern) l = m + 1;
else r = m - 1;
}
return false;
}
// Longest repeated substring
string longestRepeatedSubstring() {
int len = 0, pos = 0;
for (int i = 1; i < n; ++i) {
if (lcp[i] > len) {
len = lcp[i];
pos = sa[i];
}
}
return s.substr(pos, len);
}
// Number of distinct substrings
long long countDistinctSubstrings() {
long long total = 1LL * n * (n - 1) / 2; // sum of lengths of all suffixes
long long repeated = accumulate(lcp.begin(), lcp.end(), 0LL);
return total - repeated;
}
// Getters
vector<int> getSA() { return sa; }
vector<int> getLCP() { return lcp; }
};
Problems🔗
Problem | Problem Type | Optimal with | Notes |
---|---|---|---|
#336. Palindrome Pairs | Substring checks | Trie / Suffix tree idea | Though typically solved with tries, suffix structures help in understanding efficient string lookups. |
#472. Concatenated Words | Substring parsing | Trie / DP | Suffix tree logic helps if asked to optimize DP further. |
#920. Number of Music Playlists | Rare use case | DP + Combinatorics | Suffix trees aren’t a fit here, but LCP might be used to optimize substring checks. |
#1044. Longest Duplicate Substring | Direct application | Suffix Array + LCP | This is the clearest modern Suffix Tree replacement problem on LeetCode. |
#1153. String Compression II | Optimization | DP + Preprocessing | Advanced string structure can help, but not common in brute-force. |
#1326. Minimum Number of Taps to Open to Water a Garden | Interval covering | Segment Tree | Not suffix-related, but string segment problems sometimes align with this. |
#1178. Number of Valid Words for Each Puzzle | Bitmask / Trie | Trie / hashing | Again, tries are more practical here. |
Summary🔗
- Most of technical interviews will not ask for Suffix Tree/Suffix Array unless you are going for competitive coding, in that case Suffix Array + LCP would be enough to answer most of the questions