Modelling Alignment |
|
Most sequence alignment algorithms assume that all characters (bases, amino acids, residues) are equal. However, it is well known that "low-information content" sequences, i.e., compressible sequences, can cause false-positive matches in sequence alignment. As just one example, the DNA of Plasmodium falciparum (a malaria parasite) is AT-rich, and it is possible to find a seemingly good alignment between two randomly-selected, unrelated sequences of such DNA — there are so many As and Ts that it is easy to find an alignment of the sequences that has many matches. This problem that low-information sequences cause in alignment is well known. One ad hoc technique to reduce the problem is to shuffle the sequences and realign them. If the original alignment is not much better than that of the shuffled sequences it is judged to be invalid. Another ad hoc technique is to entirely mask-out low-information sections of sequences but this is throwing information away – the sections are of low-information but not zero-information. Modelling-alignment [APD99] is a better way to solve the problem. It incorporates a statistical model of a family of sequences into the dynamic programming algorithm (DPA) so that it correctly weighs each character in context, and also the benefit of matching it to a character in the other sequence. This allows the information content of (i) the null-hypothesis (the sequences are unrelated) and of (ii) the alignment-hypothesis (the sequences are related in some way) to be calculated and compared. Modelling-alignment performs better than other methods giving fewer false-positives and fewer false-negatives [PAD04]. Below is a Javascript implmentation of the simplest version of modelling-alignment for simple gap costs. (It has also been extended [PAD04] to local- and global-alignment, to linear gap costs, and to the sum-of-probabilities over all alignments in addition to optimal alignment.)
References
|
|
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso-8859-1, fetched Friday, 29-Mar-2024 14:15:44 UTC. Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off. |