Most sequence alignment algorithms assume that
all characters (bases, amino acids, residues) are equal.
However, it is well known that "low-information content" sequences,
i.e., compressible sequences,
can cause false-positive matches in sequence alignment.
As just one example,
the DNA of Plasmodium falciparum (a malaria parasite) is AT-rich, and
it is possible to find a seemingly good alignment between
two randomly-selected, unrelated sequences of such DNA —
there are so many As and Ts that it is easy to find an alignment of
the sequences that has many matches.
This problem that low-information sequences cause in
alignment is well known.
One ad hoc technique to reduce the problem is to
shuffle the sequences and realign them.
If the original alignment is not much better than that of
the shuffled sequences it is judged to be invalid.
Another ad hoc technique is to entirely mask-out low-information sections
of sequences but this is throwing information away –
the sections are of low-information but not zero-information.
is a better way to solve the problem.
It incorporates a statistical model of a family of sequences into the
dynamic programming algorithm (DPA) so that it correctly weighs
each character in context, and also
the benefit of matching it to a character in the other sequence.
This allows the information content of
(i) the null-hypothesis (the sequences are unrelated) and of
(ii) the alignment-hypothesis (the sequences are related in some way)
to be calculated and compared.
Modelling-alignment performs better than other methods giving
fewer false-positives and
fewer false-negatives [PAD04].
of modelling-alignment for simple gap costs.
(It has also been extended [PAD04]
to local- and global-alignment,
to linear gap costs, and
to the sum-of-probabilities over all alignments
in addition to optimal alignment.)
The default values of sequences S and T below are unrelated.
Each is a random string of 100 bases generated from
the poly-AT-ish model on the right.
S and T are both generated from that model but have nothing else in common.
S and T are aligned under two different assumptions.
First, it is assumed that an individual sequence is "incompressible",
that it is not a "low information" sequence.
Second, it is (correctly) assumed that
each individual sequence comes from a poly-AT-ish family,
that it is a low-information (but not a zero-information) sequence.
You can change the values of S and T in the form.
A C G T
A| 1/12 1/12 1/12 9/12
C| 9/20 1/20 1/20 9/20
G| 9/20 1/20 1/20 9/20
T| 9/12 1/12 1/12 1/12
Initial paper on modelling-alignment,
incorporating a model of a family of sequences into
the dynamic programming algorithm for the
optimal, global alignment of two sequences under simple gap costs.
The model of the family can be almost any statistical model
Extension of [APD99]
to local- and global-alignment, and
to linear gap costs, and
to optimal-alignment and total probability over all alignments.
ROC curves & other results from tests on real and artificial data
show modelling-alignment being more accurate than
standard algorithms as in FASTA.