The Computer Journal, Volume 42, Issue 1, pp.1-10, 1999.
L. Allison, D. Powell and T. I. Dix
School of Computer Science and Software Engineering,
Monash University, Australia 3168
A population of sequences is called non-random if there is a
statistical model and an associated compression algorithm that
allows members of the population to be compressed, on average.
Any available statistical model of a population should be
incorporated into algorithms for alignment of the sequences and
doing so changes the rank order of possible alignments in general.
The model should also be used in deciding if a resulting approximate match
between two sequences is significant or not.
It is shown how to do this for two plausible interpretations involving
pairs of sequences that might or might not be related.
Efficient alignment algorithms are described for quite
general statistical models of sequences.
The new alignment algorithms are more sensitive to
what might be termed 'features' of the sequences.
A natural significance test is shown to be rarely fooled by
apparent similarities between two sequences that are
merely typical of all or most members of the population,
even unrelated members.
- Also see:
- D. R. Powell,
T. I. Dix,
Modelling-Alignment for Non-Random Sequences,
LNCS/LNAI 3339 (AI2004),
- which generalised the method to
linear gap costs,
local alignment and global alignment,
optimal alignment and the relatedness problem
(i.e., the sum of all probabilistic alignments).
The method makes the symbol-matching
scoring function context-sensitive and
allows ‘features’ of sequences to be weighted appropriately --
above or below average as necessary.