
 Example:


0  1  2  3  4  5  6  7  8  9  1 0  1 1 
S = 
a  g  c  a  g  c  a  g  a  c  t  $ 
 where the end of sequence pseudosymbol, $,
is less than all proper symbols.

S → BWT(S)
 Sort the suffixes of S;
the Burrows Wheeler transform [BW94] of S, BWT(S),
consists of the symbols before each sorted suffix in turn.
Note that $ comes before S[0].
 Equivalently (with $), sort the rotations of S;
BWT(S) consists of the last symbol of each sorted rotation in turn.


 suffix#  BWT(S)  suffix/rotation  
0  11  t  $agcagcagact  $... 
1  8  g  act$agcagcag 
a... 
2  6  c  agact$agcagc 
3  3  c  agcagact$agc 
4  0  $  agcagcagact$ 
5  5  g  cagact$agcag  c... 
6  2  g  cagcagact$ag 
7  9  a  ct$agcagcaga 
8  7  a  gact$agcagca  g... 
9  4  a  gcagact$agca 
10  1  a  gcagcagact$a 
11  10  c  t$agcagcagac  t... 
 
ch 
$  a  c  g  t 
rank(ch) 
0 
1 
5 
8 
11 

 BWT(agcagcagact) = tgcc$ggaaaac^{ }

 Demonstration:_{ }


 A virtue of the transform is that BWT(S) is likely to
be compressible by quite simple datacompression methods,
unless S is inherently incompressible.
For example, suppose that some word, w=w_{0}...w_{k1},
occurs frequently in S.
Then all of the suffixes starting w_{1}...w_{k1}...
will be sorted in a "block" and there will be a
corresponding "run" of w_{0} in BWT(S).
Similarly for suffixes starting w_{2}...w_{k1}...
and runs of w_{1}, and so on.
Recovering S from BWT(S)
 Surprisingly, it is possible to recover S from BWT(S).
 The first symbol of BWT(S), here t, must be the last symbol of S because
the empty suffix ($) is the first suffix.
 That t starts the suffix t$, at position 11 = rank(t)
in the sorted list.
 It must be preceded in S by symbol 11 of BWT(S), c.
 Suffix ct$ must be in
the range [rank(c), rank(g)) = [5, 8)
in the sorted list; in fact it must be at position 7,
because two other cs precede it in BWT(S).
 Hence ct$ must be preceded by the 7th symbol of BWT(S), a.
 All of S can be recovered in this way.
function recover(bwt)
{ // Recover original string from its transform
var pos = 0,
ans = endChar; // $terminated here
for( var i = 1; i < bwt.length; i++ )
{ ans = bwt.charAt(pos) + ans;
pos = inverse(pos, bwt);
}
return ans;
}//recover(BWT)
function inverse(pos, bwt)
{ // one step of the reverse reconstruction
var ch = bwt.charAt(pos);
var chCode = ch.charCodeAt(0);
return rank[chCode] + occ(ch, bwt, pos);
}//inverse(pos,bwt)
function occ(ch, bwt, pos)
// returns the # of occurences of ch in bwt
// before position pos
 It might seem that occ() must take O(S)time,
function occSlow(ch, bwt, i) // (SLOW, but see occFast)
{ var count = 0;
for( var j = 0; j < i; j++ )
if( bwt.charAt(j) == ch ) count++;
return count;
}//occSlow(ch,bwt,i)
 but
it can be made to take O(step)time by "caching" symbolcounts
at regular intervals of size 'step' along BWT(S).
function occFast(ch, bwt, loc)
{ if( loc < 0 ) return 0;
var bucket = Math.floor(loc/freqBucketSize);
var lo = bucket * freqBucketSize;
var count = freqCache[bucket][ch.charCodeAt(0)];
for(var j = lo; j < loc; j++ )
if( bwt.charAt(j) == ch ) count ++ ;
return count;
}//occFast(ch,bwt,loc)
 Note that [FM00] use two levels of caching, [LD09] one.
Multiplicity
 The number of occurences, if any, of a pattern, pat, in S
can be found, using BWT(S), by a reverse search of pat.
Initially all of BWT(S),
the range [lo, hi) = [0, BWT(S) ), is considered.
 Assuming that pat does occur in S,
one or more suffixes must start with the last symbol of pat, s.
 These suffixes will occur in a block, from rank[s] upwards.
 One or more will be preceded by the 2ndlast symbol of pat, s'.
 Variables lo and hi are changed depending on rank[s'] and the
number of occurrences of s' before lo and before hi respectively.
function multiplicity(pat, bwt)
{ // Return the number of times, if any, that pat occurs
// in refStr where bwt is the transform of refStr.
var lo = 0, hi = bwt.length; // i.e. [lo,hi)
for( var i = pat.length  1; hi = lo && i >= 0; i )
{ var pati = pat.charAt(i);
var patiCode = pati.charCodeAt(0);
lo = rank[patiCode] + occ(pati, bwt, lo);
hi = rank[patiCode] + occ(pati, bwt, hi);
}//for
return hi  lo;
}//multiplicity(pat,bwt)
 This is equivalent to a search on an implicit trie (or similar)
of reversed prefixes.
Note that there is no actual explicit trie but
the BWT can be used as a substitue for one, and
a range, [lo, hi), created during search,
acts like a node in the implicit trie:

sorted reverse prefixes  ~ trie 
agcagcagact$  $ 
gcagcagact$a  $  a 
gcagact$agca  $  agc 
gact$agcagca  $agc 
ct$agcagcaga  $agcagcag 
t$agcagcagac  $agcagcaga  c 
agcagact$agc  $  ag 
agact$agcagc  $agc 
cagcagact$ag  $  ag 
cagact$agcag  $  agc 
act$agcagcag  $agc 
$agcagcagact  $agcagcagact 
 For example, gca occurs twice in agcagcagact.
Search
 The locations of occurrences, if any, of a pattern, pat, in S
can be found by first finding the multiplicity of pat in S
and then locating the positions, in S, that correspond to
each BWT(S) position in the final range, [lo, hi).
In other words the suffix array,
the suffix# column above,
can be reconstructed when needed.
function locations(pat, bwt)
{ var lo = 0, hi = bwt.length; // [lo,hi)
for( var i = pat.length  1; hi > lo && i >= 0; i )
{ var pati = pat.charAt(i);
var patiCode = pati.charCodeAt(0);
lo = rank[patiCode] + occ(pati, bwt, lo);
hi = rank[patiCode] + occ(pati, bwt, hi);
}//for
for( var i = lo; i < hi; i++ )
print( locate(i,bwt) + "," );
println( "." );
return;
}//locations(pat,bwt)
 The naive way to map a position, pos, in BWT(S) onto S
is to recover S from that symbol back to the start of S,
counting the number of steps necessary to do so.
function locateSlow(pos, bwt)
{ pos = inverse(pos, bwt);
var count;
for( count = 0; pos > 0; count ++ )
pos = inverse(pos, bwt);
return count;
}//locateSlow(pos,bwt)
 This is, of course, slow.
It can be speeded up by caching a map from selected points
in BWT(S) to the corresponding positions in S.
function locateFast(pos, bwt) // Fast
{ var count;
for( count = 0; pos % saBucketSize > 0; count++ )
pos = inverse(pos, bwt);
return (count+saCache[pos/saBucketSize]) % bwt.length;
}//locateFast(pos,bwt)
 The process can be halted early when a cached value is available.
 The BWA [LD09] implementation,
which is for DNA,
divides BWT(S) up into equal sized intervals.
These will not in general correspond to equal sized intervals
in S so the worstcase timecomplexity of BWA's search
may be much slower than the best case.
In contrast, the "FMindex" [FM00] gives better performance guarantees.
It divides S, rather than BWT(S),
up into equal sized intervals (buckets), and
one of these boundaries must be encountered in bounded time, but
this requires more sophisticated data structures.
Approximate Search
 Exact search (above) is equivalent to descending a trie of
reversed prefixes as directed by a reverse scan of the search pattern.
The BWtransform of S acts as a compact substitute for the trie.
Just as the trie can be traversed for approximate search,
given an "allowance" of mutations between the pattern and
what is considered an approximate instance of it,
so approximate search can be performed [LD09]
using BWT(S) as a substitute for the trie.
function approx(pat, errsLeft)
{ approxB(pat, errsLeft, pat.length1, 0, bwt.length);
}//approx(pat,errsLeft)
function approxB(pat, errsLeft, loc, lo, hi)
{ if( errsLeft < 0 ) return; // fail, else >=0
if( loc < 0 ) // done all pat, ... are we ok?
{ for( var i = lo; i < hi; i ++ )
print( locate(i,bwt) + ',' );
return;
}//else loc >= 0
approxB(pat, errsLeft1, loc1, lo, hi); //del(*)
var patLoc = pat.charCodeAt(loc);
for( var sy = minCode; sy <= maxCode; sy ++ )
{ var rankSy = rank[sy],
syAsChar = String.fromCharCode(sy);
var lo2 = rankSy + occ(syAsChar, bwt, lo),
hi2 = rankSy + occ(syAsChar, bwt, hi);
if( lo2 < hi2 ) // not a deadend, yet
{ approxB(pat, errsLeft1, loc, lo2, hi2); //ins(*)
var e2 = errsLeft  (sy == patLoc ? 0 : 1);
approxB(pat, e2, loc1, lo2, hi2);
}//if
}//for
return;
}//approxB(,,,,)
 Note that (approximately) matching instances of the pattern in S
may overlap and even have common start and/or end positions.
 A further refinment is that approximate search can be pruned.
The search above fails when another error is required but
the allowance has been used up.
It is possible [LD09] to estimate a useful lower bound on the
number of errors required, at a given position in the pattern,
to successfully complete an approximate search yielding
at least one approximate match.
 L.A., A.K., 3/2009 & 9/2009 with thanks to S.P..
 [BW94] M. Burrows & D. J. Wheeler,
A blocksorting lossless data compression algorithm,
TR124, SRC (Digital, Palo Alto), pp.18, 1994.
 The original BWTransform (BWT) paper.
 [FM00] P. Ferragina & G. Manzini,
Opportunistic data structures with applications,
Proc. 41st Annual Symp. on Foundations of Comp. Sci.,
pp.390, 2000.
 Efficient, but complex, BWTbased index.
 [LD09] H. Li & R. Durbin,
Fast and accurate short read alignment with BurrowsWheeler transform,
J. Bioinformatics, 25(14), pp.17541760, 2009.
 A simpler, "BWA", implementation of
a BWTbased index. (Also see
bioinformatics.)

