|
|
Given k>2 sequences, the multiple alignment problem is to
pad them out with null characters, `-', so that
they all have the same length and to do this
in such a way as to optimise some criterion --
minimise a cost or maximise a score.
The most popular choices for specifying the cost (score) function
in a mutiple alignment are:
- All-pairs costs: The multiple alignment is "projected" onto all
k×(k-1)/2 possible pairs out of the k sequences,
the cost (score) of every such two-way alignment is calculated, and
they are all summed.
- Star costs: A single hypothetical ancestor
is assumed connected to each one of the given k sequences by an edge.
The minimum possible sum of costs (max score) from the ancestor
to each leaf is calculated.
- Tree-costs: The leaf sequences are related by a phylogenetic tree.
The cost is the sum of the costs corresponding to each edge in the tree.
The tree can be unrooted or rooted and, especially in the latter case,
a "molecular clock" may be assumed, or not.
This models the evolutionary process most closely, but
a problem is where to get the phylogenetic tree?
The simple alignment algorithm for 2 sequences of length ~n
takes O(N2)-time.
The corresponding algorithm for k≥3 sequences takes O(Nk)-time.
This is infeasible of N is "large" and k is more than a "few"
in which case either heuristics or a stochastic search must be used.
For example, each internal hypothetical node in an unrooted
phylogenetic tree has three neighbours so it is possible to do
repeated three-way alignment
in either a greedy or a stochastic way.
Alternatively a k-way alignment may be "projected"
onto any k'<k of those sequences.
Therefore, under tree-costs the phylogenetic tree can be "broken" on any edge
and the two corresponding sub-alignments of leaves can be realigned in
about O(N2)-time.
One can either (Gibbs-) sample random alignments or "cool" the process
towards an optimal alignment
[Allison & Wallace 1994]
|
|