Alignment

LA home
Computing
Bioinformatics
 Glossary
 Alignment
 Multiple
 Three

  AI2004
  CompJ99
  IPL99
  IPL94
  Bioinf93
  JTB93
  JME92
  IPL92
  AIM90
  BMB90
  IPL86

also see
 DPA

To align two or more sequences is to pad out each sequence with null "characters", ‘-’, until they have the same length and in such a way as to optimise some criterion -- to maximise a score or to minimise a cost. The sequences can then be written out one above the other, e.g.,

  S1: ACGTA-GTACGT
      || || ||| ||
  S2: AC-TACGTAGGT
with their characters lining up in columns.

If you care to think of S1 as the ancestor and of S2 as the descendant then a ‘-’ in S2 indicates a deletion and a ‘-’ in S1 indicates an insertion; insertions and deletions are collectively called indels. However you can just as well consider S2 to be the ancestor and S1 the descendant. If S1 and S2 are present day sequences related by evolutionary history then they are both descendants of some unknown hypothetical ancestor.

Generally, a match (identical characters) is "good" and has a high score (low cost), and mismatch, insertion & deletion are "bad" and have low scores (high costs).

An alignment that optimises the criterion is optimal. An optimal alignment can be found with a dynamic programming algorithm [DPA]. In general, there may be more than one optimal alignment, even a great many.

If there are more than two sequences to align the process is called [multiple alignment]. There are various ways of defining a score function or a cost function on multiple alignments. Alignment of [three sequences] is a useful special case because each internal node of a [phylogenetic tree] (evolutionary tree) has three neighbours.

Alignment Probability Density

In general there can be a great many optimal and near optimal alignments. It may be impossible to say which is the correct one; the data may not justify selecting a single one. It is possible to calculate an alignment probability density plot, e.g.,

visualization of the probabilistic alignment plot of Chicken hemoglobin alpha and beta chains, CHKHBAM and CHKHBBM, J. Molec. Ecol., 35(1), pp.77-89, 1992
Chicken hemoglobin α chain v. β chain,
CHKHBAM[120..195] (top) v. CHKHBBM[170..255] (left),
three-state machine (linear costs),
alignment probability density plot,
J. Molec. Evol., 35(1), pp.77-89, 1992
[doi:10.1007/BF00160262].

Such probabilistic alignment plots show the degree of uncertainty in a problem.

It is also possible to sum the probability of all alignments (still in O(|S1|*|S2|)-time) and answer the question of whether S1 and S2 are related or not, even in the "grey zone" of distantly related sequences. Even when no single alignment is statistically acceptable, their sum may be.

Unequal Symbols

Not all symbols (bases or amino acids) are created equal. For example, Plasmodium falciparum is ~80% AT-rich, so a C or a G is much rarer than an A or a T, and is worth much more ( - log 0.1 v. - log 0.4). The probability of a symbol can also depend on the context. For example, an A can be much more likely after a run of As, ...AAA<?>, or after ...TAT<?>. The alignment and the [compression] of sequences can be combined to model such effects and produce better alignments, [M Align] (modelling alignment).

Bias in Distance Estimates

Note that optimal aligments give biased estimates of the parameters of evolutionary relatedness, particular for distant relationships, but average alignments give unbiased estimates [JME], [Bioinf].

www

free:
Linux operating-sys
OpenOffice office-suite, ver. 3.1+
The GIMP ~photoshop
Firefox web browser
FlashBlock flash on/off

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Tuesday, 09-Feb-2010 23:38:57 EST.