|
To align two or more sequences is
to pad out each sequence with null "characters", `-',
until they have the same length and
in such a way as to optimise some criterion --
to maximise a score or to minimise a cost.
The sequences can then be written out one above the other, e.g.,
S1: ACGTA-GTACGT
|| || ||| ||
S2: AC-TACGTAGGT
with their characters lining up in columns.
If you care to think of S1 as the ancestor and of S2 as the descendant then
a `-' in S2 indicates a deletion and
a '-' in S1 indicates an insertion;
insertions and deletions are collectively called indels.
However you can just as well consider S2 to be the ancestor and
S1 the descendant.
If S1 and S2 are present day sequences related by evolutionary history
then they are both descendants of some unknown hypothetical ancestor.
Generally, a match (identical characters) is "good"
and has a high score (low cost), and
mismatch, insertion or deletion are "bad" and
have low scores (high costs).
An alignment that optimises the criterion is optimal.
An optimal alignment can be found with a
dynamic programming algorithm[DPA].
In general, there may be more than one optimal alignment.
If there are more than two sequences to align the process is called
multiple alignment.
There are various ways of defining a score function or a cost function
on multiple alignments.
Alignment of three sequences
is a useful special case because each internal node
of a phylogenetic tree has three neighbours.
|