|
Given k>2 sequences, the multiple alignment problem is to
pad them out with null characters, '-', so that
they all have the same length and to do this
in such a way as to optimise some criterion --
minimise a cost or maximise a score.
The most popular choices, in no particular order,
for specifying the cost (score) function of mutiple alignment are:
- All-pairs costs: The multiple alignment is "projected" onto all
k×(k-1)/2 possible pairs out of the k sequences,
the cost (score) of every such two-way alignment is calculated, and
they are all summed.
- Star costs: A single hypothetical ancestor
is assumed connected to each one of the given k sequences by an edge.
The minimum possible sum of costs (max score) from the ancestor
to each leaf is calculated.
- Tree-costs: The leaf sequences are related by a
phylogenetic tree
(evolutionary tree, family tree).
The cost is the sum of the costs corresponding to each edge in the tree.
The tree can be unrooted or rooted and, especially in the latter case,
a "molecular clock" may be assumed (or not).
This models the evolutionary process most closely but,
the problem is, where to get the phylogenetic tree?
The simple multiple alignment algorithm for
two sequences
of length ~n takes O(n2)-time.
The corresponding algorithm for
k≥3 sequences
takes O(nk)-time.
This is infeasible if n is "large" and k is more than a "few"
in which case either heuristics or a stochastic search must be used.
For example, each internal hypothetical node in an unrooted
phylogenetic tree has three neighbours so it is possible to do
repeated three-way alignment
in either a greedy or a stochastic way.
Alternatively a k-way alignment may be "projected"
onto any k'<k of those sequences.
Therefore, under tree-costs the phylogenetic tree can be "broken" on any edge
and the two corresponding sub-alignments of leaves can be realigned in
about O(n2)-time.
One can either (Gibbs-) sample
random alignments
or "cool" the process towards an optimal multiple alignment
[Allison & Wallace].
|
|