Suffix Trees |
|
A Suffix Tree is a data-structure that allows many problems on strings
(sequences of characters) to be solved quickly.
If
The suffix tree for ' A given suffix tree can be used to search for a substring,
If the non-empty suffixes are sorted:
pat must be a prefix of a suffix
of txt , if it occurs in txt .tree substrings tree-->|---mississippi m .. mississippi | |---i-->|---ssi-->|---ssippi i .. ississippi | | | | | |---ppi issip,issipp,issippi | | | |---ppi ip, ipp, ippi | |---s-->|---si-->|---ssippi s .. ssissippi | | | | | |---ppi ssip, ssipp, ssippi | | | |---i-->|---ssippi si .. sissippi | | | |---ppi sip, sipp, sippi | |---p-->|---pi p, pp, ppi | |---i p, pi --- Suffix Tree for
"mississippi" --- Each edge (arc) of the suffix tree is labelled
with a substring of Suffix Tree DemonstrationChange the Text NB. If the string is "short", a simple sort routine is run first to sort the suffices the slow way for comparison with the tree; this is not done if the string is "long". If the termination of Building a Suffix Tree, (a) SlowlyWe show how a suffix tree might be built "by hand".
Three dots, ' tree1 tree-->----m... Adding the second character to get 'mi' there are now suffixes 'mi' and 'i': tree2 tree-->|---mi... | |---i... Next 'mis' tree3 tree-->|---mis... | |---is... | |---s... There is no need to add any more splits for 'miss' because 's' is part of 'ss'. tree4 tree-->|---miss... | |---iss... | |---ss... However, with 'missi' there must be a split because one 's' is followed by 'i', the other by 's' tree5 tree-->|---missi... | |---issi... | |---s-->|---si... | |---i... The 6th character, 's', brings us to 'missis' and no split because both 'i's are followed by 's's. tree6 tree-->|---missis... | |---issis... | |---s-->|---sis... | |---is... 'mississ' tree7 tree-->|---mississ... | |---ississ... | |---s-->|---siss... | |---iss... 'mississi' tree8 tree-->|---mississi... | |---ississi... | |---s-->|---sissi... | |---issi... A lot suddenly happens for 'mississip', because it brings the first 'p', and causes the third 'i' to be followed by 'p' where the other two are followed by 'ssi'. Consequently one of the 'ssi' is followed by 'p', the other by 'ssip', ditto 'si'. tree9 tree-->|---mississip... | |---i-->|---ssi-->|---ssip... | | | | | |---p... | | | |---p... | |---s-->|---si-->|---ssip... | | | | | |---p... | | | |---i-->|---ssip... | | | |---p... | |---p... By comparison 'mississipp' is very quiet tree10 tree-->|---mississipp... | |---i-->|---ssi-->|---ssipp... | | | | | |---pp... | | | |---pp... | |---s-->|---si-->|---ssipp... | | | | | |---pp... | | | |---i-->|---ssipp... | | | |---pp... | |---pp... 'mississippi' is an anti-climax tree11 tree-->|---mississippi | |---i-->|---ssi-->|---ssippi | | | | | |---ppi | | | |---ppi | |---s-->|---si-->|---ssippi | | | | | |---ppi | | | |---i-->|---ssippi | | | |---ppi | |---p-->|---pi | |---i and we are done. The challenge, to a computer scientist, is to make sure treei is updated to treei+1 efficiently. This can be done (Ukkonen 1992, 1995) so that treen can be built, starting from tree0, in O(n)-time overall. (b) FasterThe following terminology is adapted from Ukkonen (1995).
When treei-1 is expanded into treei, character txt[i] must be dealt with. This is done during a traversal of the boundary path. Any state on the boundary path before sj is a leaf and could be extended by adding txt[i] to the incoming arc, but this can be done for free by representing arcs to leaves by (L,∞) where '∞' is 'infinity'. So it it is only necessary to examine states from the active point sj and prior to the end-point sj' . "[states from sj and before sj'
create entirely new branches that start from states sh,
j<=h<j'.
// almost JavaScript (try view-source)
function upDate(s, k, i)
// (s, (k, i-1)) is the canonical reference pair for the active point
{ var oldr = root;
var (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i));
while (!endPoint)
{ r.addTransition(i, infinity, new State());
if (oldr != root) oldr.sLink = r; // build suffix-link active-path
oldr = r;
var (s,k) = canonize(s.sLink, k, i-1)
(endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i))
}
if(oldr != root) oldr.sLink = s;
return new pair(s, k);
}//upDate
Note that Where necessary,
function test_and_split(s, k, p, t)
{ if(k<=p)
{ // find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k' p1 is p' in Ukkonen '95
var ((k1,p1), s1) = s[Txt.charAt(k)];
if (t == Txt.charAt(k1 + p - k + 1))
return new pair(true, s);
else
{ var r = new State()
s.addTransition(k1, k1+p-k, r); // s---->r---->s1
r.addTransition( k1+p-k+1, p1, s1);
return new pair(false, r)
}
}
else // k > p; ? is there a t-transition from s ?
return new pair(s[t] != null, s);
}//test_and_split
function canonize(s, k, p) // s--->...
{ if(p < k) return new pair (s, k);
// find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k', p1 is p' in Ukk' '95
var ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1
while(p1-k1 <= p-k) // s--(k1,p1)-->s1--->...
{ k += p1 - k1 + 1; // remove |(k1,p1)| chars from front of (k,p)
s = s1;
if(k <= p)
{ ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1
}
}
return new pair(s, k);
}//canonize
The main controlling routine repeatedly takes the next character, updates the sites on the active path and finds and canonizes the new active point: function ukkonen95()// construct suffix tree for Txt[0..N-1]
{ var s, k, i;
var bt;
root = new State();
bt = new State(); // bt (bottom or _|_)
// Want to create transitions for all possible chars
// from bt to root
for (i=0; i < Txt.length; i++)
bt.addTransition(i,i, root);
root.sLink = bt;
s=root; k=0; // NB. k=0, unlike Ukkonen our strings are 0 based
for(i=0; i < Txt.length; i++)
{ var (s,k) = upDate(s, k, i); // follow path from active-point
(s,k) = canonize(s, k, i);
}
}//ukkonen95
It relies upon the fact (lemma 2 Ukkonen (1995)) that if (s,(k,i-1)) is a reference pair for the end-point, sj' , of treei-1 then (s,(k,i)) is a reference pair for the active point of treei. Suffix Tree ApplicationsSuffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below. String SearchSearching for a substring,
Longest Repeated SubstringAdd a special ''end of string'' character, e.g. '$',
to Longest Common SubstringThe longest common substring of two strings,
Equivalently, one can build a (basic) suffix tree for the string
Note that the 'longest common substring problem' is
different to the 'longest common subsequence problem'
which is closely related to the
'edit-distance
problem':
An instance of a subsequence
can have gaps where it appears in
PalindromesA palindrome is a string, P, such that P=reverse(P). e.g. 'abba'=reverse('abba'). e.g. 'ississi' is the longest palindrome in 'mississippi'. The longest palindrome of -- L.A.
Suffix Tree NotesSignificant developments are due to Weiner (1973), McCreight (1976) and Ukkonen (1992,1995).
|
|
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso-8859-1, fetched Monday, 02-Oct-2023 21:36:30 UTC. Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off. |