Alignment |
|
To align two or more sequences is to pad out each sequence with null "characters", ‘-’, until they have the same length and in such a way as to optimise some criterion -- to maximise a score or to minimise a cost. The sequences can then be written out one above the other, e.g.,
S1: ACGTA-GTACGT
|| || ||| ||
S2: AC-TACGTAGGT
with their characters lining up in columns.
If you care to think of S1 as the ancestor and of S2 as the descendant then a ‘-’ in S2 indicates a deletion and a ‘-’ in S1 indicates an insertion; insertions and deletions are collectively called indels. However you can just as well consider S2 to be the ancestor and S1 the descendant. If S1 and S2 are present day sequences related by evolutionary history then they are both descendants of some unknown hypothetical ancestor. Generally, a match (identical characters) is "good" and has a high score (low cost), and mismatch, insertion & deletion are "bad" and have low scores (high costs). An alignment that optimises the criterion is optimal. An optimal alignment can be found with a dynamic programming algorithm [DPA]. In general, there may be more than one optimal alignment, even a great many. If there are more than two sequences to align the process is called
Alignment Probability DensityIn general there can be a great many optimal and near optimal alignments. It may be impossible to say which is the correct one; the data may not justify selecting a single one. It is possible to calculate an alignment probability density plot, e.g., ![]() Chicken hemoglobin α chain v. β chain, CHKHBAM[120..195] (top) v. CHKHBBM[170..255] (left), three-state machine (linear costs), alignment probability density plot, J. Molec. Evol., 35(1), pp.77-89, 1992 [doi:10.1007/BF00160262]. Such probabilistic alignment plots show the degree of uncertainty in a problem. It is also possible to sum the probability of all alignments (still in O(|S1|*|S2|)-time) and answer the question of whether S1 and S2 are related or not, even in the "grey zone" of distantly related sequences. Even when no single alignment is statistically acceptable, their sum may be. Unequal SymbolsNot all symbols (bases or amino acids) are created equal.
For example, Plasmodium falciparum is ~80% AT-rich,
so a C or a G is much rarer than an A or a T, and is worth much more
Bias in Distance EstimatesNote that optimal aligments give biased estimates of the parameters of evolutionary relatedness, particular for distant relationships, but average alignments give unbiased estimates [JME], [Bioinf]. |
|
|