Modelling Alignment

Most sequence alignment algorithms assume that all characters (bases, amino acids, residues) are equal. However, it is well known that "low-information content" sequences, i.e., compressible sequences, can cause false-positive matches in sequence alignment. As just one example, the DNA of Plasmodium falciparum (a malaria parasite) is AT-rich, and it is possible to find a seemingly good alignment between two randomly-selected, unrelated sequences of such DNA — there are so many As and Ts that it is easy to find an alignment of the sequences that has many matches.

This problem that low-information sequences cause in alignment is well known. One ad hoc technique to reduce the problem is to shuffle the sequences and realign them. If the original alignment is not much better than that of the shuffled sequences it is judged to be invalid. Another ad hoc technique is to entirely mask-out low-information sections of sequences but this is throwing information away – the sections are of low-information but not zero-information.

Modelling-alignment [APD99] is a better way to solve the problem. It incorporates a statistical model of a family of sequences into the dynamic programming algorithm (DPA) so that it correctly weighs each character in context, and also the benefit of matching it to a character in the other sequence. This allows the information content of (i) the null-hypothesis (the sequences are unrelated) and of (ii) the alignment-hypothesis (the sequences are related in some way) to be calculated and compared. Modelling-alignment performs better than other methods giving fewer false-positives and fewer false-negatives [PAD04].

Below is a Javascript implmentation of the simplest version of modelling-alignment for simple gap costs. (It has also been extended [PAD04] to local- and global-alignment, to linear gap costs, and to the sum-of-probabilities over all alignments in addition to optimal alignment.)

The default values of sequences S and T below are unrelated. Each is a random string of 100 bases generated from the poly-AT-ish model on the right. S and T are both generated from that model but have nothing else in common.

S and T are aligned under two different assumptions. First, it is assumed that an individual sequence is "incompressible", that it is not a "low information" sequence. Second, it is (correctly) assumed that each individual sequence comes from a poly-AT-ish family, that it is a low-information (but not a zero-information) sequence.

You can change the values of S and T in the form.


          S[i]
    A    C    G    T
 .------------------->P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/20
 |
G| 9/20 1/20 1/20 9/20
 |
T| 9/12 1/12 1/12 1/12

References

[APD99] L. Allison, D. Powell and T. I. Dix, Compression and approximate matching, Comp. J., 42(1) pp.1-10, 1999: Initial paper on modelling-alignment, incorporating a model of a family of sequences into the dynamic programming algorithm for the optimal, global alignment of two sequences under simple gap costs. The model of the family can be almost any statistical model of sequences.
[PAD04] D. Powell, L. Allison and T. I. Dix, Modelling alignment for non-random sequences, Springer Verlag, LNCS/LNAI, vol.3339, pp.203-214, 2004: Extension of [APD99] to local- and global-alignment, and to linear gap costs, and to optimal-alignment and total probability over all alignments. ROC curves & other results from tests on real and artificial data show modelling-alignment being more accurate than standard algorithms as in FASTA.