
e.g. Given the string `mississippi',
`issi' is a substring,
`miss' is a prefix (& a substring), &
`ippi' is a suffix.
Note that a substring is a prefix of a suffix, and
a suffix of a prefix.

A Suffix Tree is a datastructure that allows many problems on strings
(sequences of characters) to be solved quickly.
If
txt=t_{1}t_{2}...t_{i}...t_{n}
is a string, then
T_{i}=t_{i}t_{i+1}...t_{n}
is the suffix of txt that starts at position i, e.g.
T_{1} = mississippi = txt
T_{2} = ississippi
T_{3} = ssissippi
T_{4} = sissippi
T_{5} = issippi
T_{6} = ssippi
T_{7} = sippi
T_{8} = ippi
T_{9} = ppi
T_{10} = pi
T_{11} = i
T_{12} = (empty)
The suffix tree for `txt ' is a
Trielike or
PATRICIAlike
data structure that represents the suffixes of txt .
A given suffix tree can be used to search for a substring,
pat[1..m] in O(m) time.
There are n(n+1)/2 substrings in txt[1..n]
so it is rather surprising that a suffix tree can be built in O(n) time.
Adding just one character to txt increases the number of
substrings by n+1, but they are not independent.
Weiner (1973) gave the first algorithm and
McCreight (1976) gave a more readable account for constructing the
suffix tree while processing txt from right to left.
Only much later did Ukkonen (1992, 1995) give a lefttoright
online algorithm,
i.e., an algorithm that maintains a suffix tree
for txt[1..i] at each step as i is increased from 1 to n.
If the nonempty suffixes are sorted:
T_{11} = i
T_{8} = ippi
T_{5} = issippi
T_{2} = ississippi
T_{1} = mississippi
T_{10} = pi
T_{9} = ppi
T_{7} = sippi
T_{4} = sissippi
T_{6} = ssippi
T_{3} = ssissippi
it becomes obvious that some of them (may) share common prefixes.
Here there are substrings starting with `i', `m', `p' and `s',
but all of those starting `is', in fact start `issi'.
Two or more common prefixes share
a common path from the root
of the suffix tree (as in a PATRICIA tree).
Now, a search (sub)string pat must be a prefix of a suffix
of txt , if it occurs in txt .

tree substrings
tree>mississippi m .. mississippi

i>ssi>ssippi i .. ississippi
  
  ppi issip,issipp,issippi
 
 ppi ip, ipp, ippi

s>si>ssippi s .. ssissippi
  
  ppi ssip, ssipp, ssippi
 
 i>ssippi si .. sissippi
 
 ppi sip, sipp, sippi

p>pi p, pp, ppi

i p, pi

 Suffix Tree for
"mississippi" 

Each edge (arc) of the suffix tree is labelled
with a substring of txt
which is implemented by pointers to the start and end of the substring,
e.g. `ssi' by <3,5>.
One of the observation in Ukkonen's algorithm is that
an edge, <i,n>, leading to a leaf can be implemented by
<i,∞> where `∞', i.e., infinity,
means `to the end of the string'.
Change the Text txt=... in the
HTML FORM below,
and click on `go'; experiment with different text strings:
NB. If the string is "short", a simple sort routine is run first
to sort the suffices the slow way for comparison with the tree;
this is not done if the string is "long".
If the termination of txt is important,
this can be indicated by a special terminating character
often denoted by `$' in papers on strings
(~zero char in C/Unix).
Building a Suffix Tree, (a) Slowly
We show how a suffix tree might be built "by hand".
Three dots, `... ', are used to show the current
end of any suffix that will grow as more characters are processed.
Starting with the empty suffix tree,
consider the string `m':
Adding the second character to get `mi'
there are now suffixes `mi' and `i':

tree_{2}
tree>mi...

i...

Next `mis'

tree_{3}
tree>mis...

is...

s...

There is no need to add any more splits for `miss'
because `s' is part of `ss'.

tree_{4}
tree>miss...

iss...

ss...

However, with `missi' there must be a split
because one `s' is followed by `i', the other by `s'

tree_{5}
tree>missi...

issi...

s>si...

i...

The 6th character, `s', brings us to `missis'
and no split because both `i's are followed by `s's.

tree_{6}
tree>missis...

issis...

s>sis...

is...

`mississ'

tree_{7}
tree>mississ...

ississ...

s>siss...

iss...

`mississi'

tree_{8}
tree>mississi...

ississi...

s>sissi...

issi...

A lot suddenly happens for `mississip',
because it brings the first `p', and
causes the third `i' to be followed by `p'
where the other two are followed by `ssi'.
Consequently one of the `ssi' is followed by `p', the other by `ssip',
ditto `si'.

tree_{9}
tree>mississip...

i>ssi>ssip...
  
  p...
 
 p...

s>si>ssip...
  
  p...
 
 i>ssip...
 
 p...

p...

By comparison `mississipp' is very quiet

tree_{10}
tree>mississipp...

i>ssi>ssipp...
  
  pp...
 
 pp...

s>si>ssipp...
  
  pp...
 
 i>ssipp...
 
 pp...

pp...

`mississippi' is an anticlimax

tree_{11}
tree>mississippi

i>ssi>ssippi
  
  ppi
 
 ppi

s>si>ssippi
  
  ppi
 
 i>ssippi
 
 ppi

p>pi

i

and we are done.
The challenge, to a computer scientist, is to
make sure tree_{i} is updated to tree_{i+1}
efficiently. This can be done (Ukkonen 1992, 1995)
so that tree_{n} can be built,
starting from tree_{0}, in O(n)time overall.
(b) Faster
The following terminology is adapted from Ukkonen (1995).
 If `x' is a substring of txt then `x' represents the
state (i.e., location) in the suffixtree
found by tracing out the characters of x from the root.
Note that x might be partway along an edge of the tree.
 A vertex (node) of the suffixtree is
called an explicit state.
 A substring x=txt[L..R] can be represented by (L,R).
 If `v' is a vertex of the suffixtree,
the pair `(v,x)', equivalently (v,(L,R)),
represents the state (location) in the suffixtree found
by tracing out the characters of x from v.
 (v,x) is canonical if v is the last explit state
on the path from v to (v,x).
NB. (v,empty) is canonical.
 A special vertex called `bottom' is added and is denoted __.
 The transition function, g( ), is defined as follows:
 g(__, a) = root, for all characters `a'.
 g(x, a) = y where y=xa, for character `a'.
 f( ):
 f(root)=__
 f(x)=y, if x~=empty and x=ay
 The suffix function f'( ) is defined as follows:
 f'(root)=__.
 If vertex v=x where x~=empty then f'(v)=y
where x=ay for some character `a' and substring y (possibly empty).
 The boundary path
s_{1}, s_{2}, ..., s_{i}, s_{i+1}
of suffixtree_{i1}:
 s_{1}=(1,i1),
i.e., the state corresponding to txt[1..i1]
 s_{2}=(2,i1)
 ...
 s_{i}=root
 s_{i+1}=__
 The active point
is the first s_{j} on the boundary path that is not a leaf,
and
 the endpoint
is the first s_{j'} that has a txt[i]transition.
When tree_{i1} is expanded into tree_{i},
character txt[i] must be dealt with.
This is done during a traversal of the boundary path.
Any state on the boundary path before s_{j} is a leaf
and could be extended by adding txt[i] to the incoming arc,
but this can be done for free
by representing arcs to leaves by (L,∞) where `∞' is `infinity'.
So it it is only necessary to examine states
from the active point s_{j} and
prior to the endpoint s_{j'} .
"[states from s_{j} and before s_{j'}
create entirely new branches that start from states s_{h},
j<=h<j'. ...
They are found along the boundary path of [tree_{i1}] using
reference pairs and suffix links."
 Ukkonen (1995).
// almost JavaScript (try viewsource)
function upDate(s, k, i)
// (s, (k, i1)) is the canonical reference pair for the active point
{ var oldr = root;
var (endPoint, r) = test_and_split(s, k, i1, Txt.charAt(i));
while (!endPoint)
{ r.addTransition(i, infinity, new State());
if (oldr != root) oldr.sLink = r; // build suffixlink activepath
oldr = r;
var (s,k) = canonize(s.sLink, k, i1)
(endPoint, r) = test_and_split(s, k, i1, Txt.charAt(i))
}
if(oldr != root) oldr.sLink = s;
return new pair(s, k);
}//upDate

Note that r.addTransition(...) adds
an edge from state r ,
labelling the edge with a substring.
New txt[i]transitions must be "open" transitions
of the form (L,∞).
Where necessary,
test_and_split(...) replaces edges
s>s1 with s>r>s1
for a new node r.
This makes r=(s,(k,p)) explicit.
function test_and_split(s, k, p, t)
{ if(k<=p)
{ // find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k' p1 is p' in Ukkonen '95
var ((k1,p1), s1) = s[Txt.charAt(k)];
if (t == Txt.charAt(k1 + p  k + 1))
return new pair(true, s);
else
{ var r = new State()
s.addTransition(k1, k1+pk, r); // s>r>s1
r.addTransition( k1+pk+1, p1, s1);
return new pair(false, r)
}
}
else // k > p; ? is there a ttransition from s ?
return new pair(s[t] != null, s);
}//test_and_split

Canonize(...) takes (s,w)=(s,(k,p))
and steps over intermediate nodes by spelling
out the characters of w=txt[k..p] for as far as possible.
function canonize(s, k, p) // s>...
{ if(p < k) return new pair (s, k);
// find the t_k transition g'(s,(k',p'))=s' from s
// k1 is k', p1 is p' in Ukk' '95
var ((k1,p1), s1) = s[Txt.charAt(k)]; // s(k1,p1)>s1
while(p1k1 <= pk) // s(k1,p1)>s1>...
{ k += p1  k1 + 1; // remove (k1,p1) chars from front of (k,p)
s = s1;
if(k <= p)
{ ((k1,p1), s1) = s[Txt.charAt(k)]; // s(k1,p1)>s1
}
}
return new pair(s, k);
}//canonize

The main controlling routine
repeatedly takes the next character,
updates the sites on the active path
and finds and canonizes the new active point:
function ukkonen95()// construct suffix tree for Txt[0..N1]
{ var s, k, i;
var bt;
root = new State();
bt = new State(); // bt (bottom or __)
// Want to create transitions for all possible chars
// from bt to root
for (i=0; i < Txt.length; i++)
bt.addTransition(i,i, root);
root.sLink = bt;
s=root; k=0; // NB. k=0, unlike Ukkonen our strings are 0 based
for(i=0; i < Txt.length; i++)
{ var (s,k) = upDate(s, k, i); // follow path from activepoint
(s,k) = canonize(s, k, i);
}
}//ukkonen95

It relies upon the fact
(lemma 2 Ukkonen (1995))
that if (s,(k,i1)) is a reference pair for
the endpoint, s_{j'} ,
of tree_{i1} then
(s,(k,i)) is a reference pair for the active point of tree_{i}.
Suffix Tree Applications
Suffix Trees can be used to solve a large number of string problems
that occur in textediting, freetext search, computational biology,
and other application areas. Some examples are given below.
String Search
Searching for a substring,
pat[1..m] , in txt[1..n] ,
can be solved in O(m) time
(after the suffix tree for txt has been built in O(n) time).
Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$',
to txt[1..n] and build a suffix tree;
the longest repeated substring of txt[1..n]
is indicated by the deepest fork node in the suffix tree, where
depth is measured by the number of characters traversed from the root,
i.e., `issi' in the case of `mississippi'.
The longest repeated substring can be found in O(n) time using a suffix tree.
Longest Common Substring
The longest common substring of two strings,
txt_{1} and txt_{2} ,
can be found by building a generalized suffix tree
for txt_{1} and txt_{2} :
Each node is marked to indicate if it represents a suffix of
txt_{1} or txt_{2} or both.
The deepest node marked for both
txt_{1} and txt_{2}
represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string
txt_{1}$txt_{2}# ,
where `$' is a special terminator for txt_{1}
and `#' is a special terminator for txt_{2} .
The longest common substring is indicated by the deepest fork node
that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)
Note that the `longest common substring problem' is
different to the `longest common subsequence problem'
which is closely related to the
`editdistance
problem':
An instance of a subsequence
can have gaps where it appears in
txt_{1} and in txt_{2} ,
but an instance of a substring cannot have gaps.
Palindromes
A palindrome is a string, P, such that P=reverse(P).
e.g. `abba'=reverse(`abba').
e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n] can be found in O(n) time,
e.g. by building the suffix tree for
txt$reverse(txt)#
or by building the generalized suffix tree for
txt and reverse(txt) .
(Try it.)
 L.A.
Suffix Tree Notes
Significant developments are due to
Weiner (1973),
McCreight (1976) and
Ukkonen (1992,1995).
 See also
Tries and
PATRICIA trees,
but remember that they are for implementing lookuptables
containing many, usually short, strings rather than for indexing
substrings of one long text.
 E. M. McCreight.
A SpaceEconomical Suffix Tree Construction Algorithm.
Jrnl. of Algorithms,
23(2) pp.262272, 1976.
Credited with making Weiner's suffix trees (1973)
more efficient and (a little) more accessible (understandable).
"... algorithms due to Weiner ('73) and McCreight ('76)
do not process the string ... in the
natural lefttoright order."
 [Ukk. 1992, p.484].
"... McC. ... adds the suffixes to the tree in
decreasing order of their length."
 [Ukkonen 1995, p.249].
 E. Ukkonen. Constructing Suffix Trees OnLine in Linear Time.
In Algorithms, Software, Architecture, J.v.Leeuwen (ed),
vol.1 of Information Processing 92,
Proc. IFIP 12th World Computer Congress, Madrid, Spain,
Elsevier Sci. Publ.,
pp.484492, 1992.
Gives an online algorithm, i.e., it processes the
input text incrementally from left to right, and at each stage it has a
suffix tree for the part of the text that has been processed so far.
But see the note about "inaccuracies"
below!
 E. Ukkonen.
Online Construction of Suffix Trees.
Algorithmica, 14(3),
pp.249260, 1995.
 p260: "J.Karkkainen pointed out some
inaccuracies in the earlier [1992] version of this work."
 P. Weiner. Linear Pattern Matching Algorithms.
Proc. 14th IEEE Annual Symp. on Switching and Automata Theory,
pp.111, 1973.
Extremely important paper introducing suffix trees
and hence fast ways of solving many problems on strings.
"... algorithms due to Weiner ('73) and McC. ('76)
do not process the string ... in the
natural lefttoright order."
 [Ukk. 1992, p.484].
"... W. ... proceeds right to left and adds the suffixes ...
in increasing order of their length ..."
 [Ukkonen 1995, p.249].
 Also see
[String Searching],
[suffix array],
and
[BWT].

