|
|
e.g. Given the string `mississippi',
`miss' is a prefix,
`ippi' is a suffix, &
`issi' is a substring.
Note that a substring is a prefix of a suffix, and
a suffix of a prefix.
|
A Suffix Tree is a data-structure that allows many problems on strings
(sequences of characters) to be solved quickly.
If
txt=t1t2...ti...tn
is a string, then
Ti=titi+1...tn
is the suffix of txt that starts at position i, e.g.
T1 = mississippi = txt
T2 = ississippi
T3 = ssissippi
T4 = sissippi
T5 = issippi
T6 = ssippi
T7 = sippi
T8 = ippi
T9 = ppi
T10 = pi
T11 = i
T12 = (empty)
The suffix tree for `txt' is a
Trie-like or
PATRICIA-like
data structure that represents the suffixes of txt.
A given suffix tree can be used to search for a substring,
pat[1..m] in O(m) time.
There are n(n+1)/2 substrings in txt[1..n]
so it is rather surprising that a suffix tree can be built in O(n) time.
Adding just one character to txt increases the number of
substrings by n+1, but they are not independent.
Weiner (1973) gave the first algorithm and
McCreight (1976) gave a more readable account for constructing the
suffix tree while processing txt from right to left.
Only much later did Ukkonen (1992, 1995) give a left-to-right
on-line algorithm,
i.e. an algorithm that maintains a suffix tree
for txt[1..i] at each step as i is increased from 1 to n.
If the non-empty suffixes are sorted:
T11 = i
T8 = ippi
T5 = issippi
T2 = ississippi
T1 = mississippi
T10 = pi
T9 = ppi
T7 = sippi
T4 = sissippi
T6 = ssippi
T3 = ssissippi
it becomes obvious that some of them (may) share common prefixes.
Here there are substrings starting with `i', `m', `p' and `s',
but all of those starting `is', in fact start `issi'.
Two or more common prefixes share
a common path from the root
of the suffix tree (as in a PATRICIA tree).
Now, a search (sub)string pat must be a prefix of a suffix
of txt, if it occurs in txt.
| |
tree substrings
tree-->|---mississippi m .. mississippi
|
|---i-->|---ssi-->|---ssippi i .. ississippi
| | |
| | |---ppi issip,issipp,issippi
| |
| |---ppi ip, ipp, ippi
|
|---s-->|---si-->|---ssippi s .. ssissippi
| | |
| | |---ppi ssip, ssipp, ssippi
| |
| |---i-->|---ssippi si .. sissippi
| |
| |---ppi sip, sipp, sippi
|
|---p-->|---pi p, pp, ppi
|
|---i p, pi
|
--- Suffix Tree for
"mississippi" ---
|
Each edge (arc) of the suffix tree is labelled
with a substring of txt
which is implemented by pointers to the start and end of the substring,
e.g. `ssi' by <3,5>.
One of the observation in Ukkonen's algorithm is that
an edge, <i,n>, leading to a leaf can be implemented by
<i,∞> where `∞', i.e. infinity,
means `to the end of the string'.
Suffix Tree Demonstration
Change the Text txt=... in the
HTML FORM below,
and click on `go'; experiment with different text strings:
NB. If the string is "short", a simple sort routine is run first
to sort the suffices the slow way for comparison with the tree;
this is not done if the string is "long".
If the termination of txt is important,
this can be indicated by a special terminating character
often denoted by `$' in papers on strings
(~zero char in C/Unix).
Building a Suffix Tree, (a) Slowly
We show how a suffix tree might be built "by hand".
Three dots, `...', are used to show the current
end of any suffix that will grow as more characters are processed.
Starting with the empty suffix tree,
consider the string `m':
Adding the second character to get `mi'
there are now suffixes `mi' and `i':
| |
tree2
tree-->|---mi...
|
|---i...
|
Next `mis'
| |
tree3
tree-->|---mis...
|
|---is...
|
|---s...
|
There is no need to add any more splits for `miss'
because `s' is part of `ss'.
| |
tree4
tree-->|---miss...
|
|---iss...
|
|---ss...
|
However, with `missi' there must be a split
because one `s' is followed by `i', the other by `s'
| |
tree5
tree-->|---missi...
|
|---issi...
|
|---s-->|---si...
|
|---i...
|
The 6th character, `s', brings us to `missis'
and no split because both `i's are followed by `s's.
| |
tree6
tree-->|---missis...
|
|---issis...
|
|---s-->|---sis...
|
|---is...
|
`mississ'
| |
tree7
tree-->|---mississ...
|
|---ississ...
|
|---s-->|---siss...
|
|---iss...
|
`mississi'
| |
tree8
tree-->|---mississi...
|
|---ississi...
|
|---s-->|---sissi...
|
|---issi...
|
A lot suddenly happens for `mississip',
because it brings the first `p', and
causes the third `i' to be followed by `p'
where the other two are followed by `ssi'.
Consequently one of the `ssi' is followed by `p', the other by `ssip',
ditto `si'.
| |
tree9
tree-->|---mississip...
|
|---i-->|---ssi-->|---ssip...
| | |
| | |---p...
| |
| |---p...
|
|---s-->|---si-->|---ssip...
| | |
| | |---p...
| |
| |---i-->|---ssip...
| |
| |---p...
|
|---p...
|
By comparison `mississipp' is very quiet
| |
tree10
tree-->|---mississipp...
|
|---i-->|---ssi-->|---ssipp...
| | |
| | |---pp...
| |
| |---pp...
|
|---s-->|---si-->|---ssipp...
| | |
| | |---pp...
| |
| |---i-->|---ssipp...
| |
| |---pp...
|
|---pp...
|
`mississippi' is an anti-climax
| |
tree11
tree-->|---mississippi
|
|---i-->|---ssi-->|---ssippi
| | |
| | |---ppi
| |
| |---ppi
|
|---s-->|---si-->|---ssippi
| | |
| | |---ppi
| |
| |---i-->|---ssippi
| |
| |---ppi
|
|---p-->|---pi
|
|---i
|
and we are done.
The challenge, to a computer scienctist, is to
make sure treei is updated to treei+1
efficiently. This can be done (Ukkonen 1992, 1995)
so that treen can be built,
starting from tree0, in O(n)-time overall.
(b) Faster
The following terminology is adapted from Ukkonen (1995).
- If `x' is a substring of txt then `x' represents the
state (i.e. location) in the suffix-tree
found by tracing out the characters of x from the root.
Note that x might be part-way along an edge of the tree.
- A vertex (node) of the suffix-tree is
called an explicit state.
- A substring x=txt[L..R] can be represented by (L,R).
- If `v' is a vertex of the suffix-tree,
the pair `(v,x)', equivalently (v,(L,R)),
represents the state (location) in the suffix-tree found
by tracing out the characters of x from v.
- (v,x) is canonical if v is the last explit state
on the path from v to (v,x).
NB. (v,empty) is canonical.
- A special vertex called `bottom' is added and is denoted _|_.
- The transition function, g( ), is defined as follows:
- g(_|_, a) = root, for all characters `a'.
- g(x, a) = y where y=xa, for character `a'.
- f( ):
- f(root)=_|_
- f(x)=y, if x~=empty and x=ay
- The suffix function f'( ) is defined as follows:
- f'(root)=_|_.
- If vertex v=x where x~=empty then f'(v)=y
where x=ay for some character `a' and substring y (possibly empty).
- The boundary path
s1, s2, ..., si, si+1
of suffix-treei-1:
- s1=(1,i-1),
i.e. the state corresponding to txt[1..i-1]
- s2=(2,i-1)
- ...
- si=root
- si+1=_|_
- The active point
is the first sj on the boundary path that is not a leaf,
and
- the end-point
is the first sj' that has a txt[i]-transition.
When treei-1 is expanded into treei,
character txt[i] must be dealt with.
This is done during a traversal of the boundary path.
Any state on the boundary path before sj is a leaf
and could be extended by adding txt[i] to the incoming arc,
but this can be done for free
by representing arcs to leaves by (L,∞) where `∞' is `infinity'.
So it it is only necessary to examine states
from the active point sj and
prior to the end-point sj' .
"[states from sj and before sj'
create entirely new branches that start from states sh,
j<=h<j'. ...
They are found along the boundary path of [treei-1] using
reference pairs and suffix links."
- Ukkonen (1995).
// almost JavaScript (try view-source)
function upDate(s, k, i)
// (s, (k, i-1)) is the canonical reference pair for the active point
{ var oldr = root;
var (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i));
while (!endPoint)
{ r.addTransition(i, infinity, new State());
if (oldr != root) oldr.sLink = r; // build suffix-link active-path
oldr = r;
var (s,k) = canonize(s.sLink, k, i-1)
(endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i))
}
if(oldr != root) oldr.sLink = s;
return new pair(s, k);
}//upDate
|
Note that r.addTransition(...) adds
an edge from state r,
labelling the edge with a substring.
New txt[i]-transitions must be "open" transitions
of the form (L,∞).
Where necessary,
test_and_split(...) replaces edges
s--->s1 with s--->r--->s1
for a new node r.
This makes r=(s,(k,p)) explicit.
function test_and_split(s, k, p, t)
{ if(k<=p)
{ // find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k' p1 is p' in Ukkonen '95
var ((k1,p1), s1) = s[Txt.charAt(k)];
if (t == Txt.charAt(k1 + p - k + 1))
return new pair(true, s);
else
{ var r = new State()
s.addTransition(k1, k1+p-k, r); // s---->r---->s1
r.addTransition( k1+p-k+1, p1, s1);
return new pair(false, r)
}
}
else // k > p; ? is there a t-transition from s ?
return new pair(s[t] != null, s);
}//test_and_split
|
Canonize(...) takes (s,w)=(s,(k,p))
and steps over intermediate nodes by spelling
out the characters of w=txt[k..p] for as far as possible.
function canonize(s, k, p) // s--->...
{ if(p < k) return new pair (s, k);
// find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k', p1 is p' in Ukk' '95
var ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1
while(p1-k1 <= p-k) // s--(k1,p1)-->s1--->...
{ k += p1 - k1 + 1; // remove |(k1,p1)| chars from front of (k,p)
s = s1;
if(k <= p)
{ ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1
}
}
return new pair(s, k);
}//canonize
|
The main controlling routine
repeatedly takes the next character,
updates the sites on the active path
and finds and canonizes the new active point:
function ukkonen95()// construct suffix tree for Txt[0..N-1]
{ var s, k, i;
var bt;
root = new State();
bt = new State(); // bt (bottom or _|_)
// Want to create transitions for all possible chars
// from bt to root
for (i=0; i < Txt.length; i++)
bt.addTransition(i,i, root);
root.sLink = bt;
s=root; k=0; // NB. k=0, unlike Ukkonen our strings are 0 based
for(i=0; i < Txt.length; i++)
{ var (s,k) = upDate(s, k, i); // follow path from active-point
(s,k) = canonize(s, k, i);
}
}//ukkonen95
|
It relies upon the fact
(lemma 2 Ukkonen (1995))
that if (s,(k,i-1)) is a reference pair for
the end-point, sj' ,
of treei-1 then
(s,(k,i)) is a reference pair for the active point of treei.
Suffix Tree Applications
Suffix Trees can be used to solve a large number of string problems
that occur in text-editing, free-text search, computational biology,
and other application areas. Some examples are given below.
String Search
Searching for a substring,
pat[1..m], in txt[1..n],
can be solved in O(m) time
(after the suffix tree for txt has been built in O(n) time).
Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$',
to txt[1..n] and build a suffix tree;
the longest repeated substring of txt[1..n]
is indicated by the deepest fork node in the suffix tree, where
depth is measured by the number of characters traversed from the root,
i.e. `issi' in the case of `mississippi'.
The longest repeated substring can be found in O(n) time using a suffix tree.
Longest Common Substring
The longest common substring of two strings,
txt1 and txt2,
can be found by building a generalized suffix tree
for txt1 and txt2:
Each node is marked to indicate if it represents a suffix of
txt1 or txt2 or both.
The deepest node marked for both
txt1 and txt2
represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string
txt1$txt2#,
where `$' is a special terminator for txt1
and `#' is a special terminator for txt2.
The longest common substring is indicated by the deepest fork node
that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)
Note that the `longest common substring problem' is
different to the `longest common subsequence problem'
which is closely related to the
`edit-distance
problem':
An instance of a subsequence
can have gaps where it appears in
txt1 and in txt2,
but an instance of a substring cannot have gaps.
Palindromes
A palindrome is a string, P, such that P=reverse(P).
e.g. `abba'=reverse(`abba').
e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n] can be found in O(n) time,
e.g. by building the suffix tree for
txt$reverse(txt)#
or by building the generalized suffix tree for
txt and reverse(txt).
(Try it.)
Suffix Tree Notes
Significant developments are due to
Weiner (1973),
McCreight (1976) and
Ukkonen (1992,1995).
- See also
Tries and
PATRICIA
trees, but remember that these are for implementing lookup-tables
containing many, usually short, strings rather than for finding
substrings of one long text.
- E. M. McCreight.
A Space-Economical Suffix Tree Construction Algorithm.
Jrnl. of Algorithms, 23(2) pp262-272, 1976.
Credited with making Weiner's suffix trees (1973)
more efficient and (a little) more accessible.
Algorithm still right-to-left.
- E. Ukkonen. Constructing Suffix Trees On-Line in Linear Time.
In Algorithms, Software, Architecture, J.v.Leeuwen (ed),
vol#1 of Information Processing 92,
Proc. IFIP 12th World Computer Congress, Madrid, Spain,
Elsevier Sci. Publ.,
pp484-492, 1992.
Gives an on-line algorithm, i.e. it processes the
input text incrementally from left to right and at each stage it has a
suffix tree for the part of the text that has been processed
so far. But see note about "inaccuracies" below!
E. Ukkonen. On-line Construction of Suffix Trees.
Algorithmica, 14(3) pp249-260, 1995.
p260: "J.Karkkainen pointed out some inaccuracies
in the earlier [1992] version of this work."
- P. Weiner. Linear Pattern Matching Algorithms.
Proc. 14th IEEE Annual Symp. on Switching and Automata Theory,
pp1-11, 1973.
Extremely important paper introducing suffix trees
and hence fast ways of solving many problems on strings.
- Also see
[String Searching].
|
|