Most sequence alignment algorithms assume that
all characters (bases, amino acids, residues) are equal.
However, it is well known that "lowinformation content" sequences,
i.e., compressible sequences,
can cause falsepositive matches in sequence alignment.
As just one example,
the DNA of Plasmodium falciparum (a malaria parasite) is ATrich, and
it is possible to find a seemingly good alignment between
two randomlyselected, unrelated sequences of such DNA —
there are so many As and Ts that it is easy to find an alignment of
the sequences that has many matches.
This problem that lowinformation sequences cause in
alignment is well known.
One ad hoc technique to reduce the problem is to
shuffle the sequences and realign them.
If the original alignment is not much better than that of
the shuffled sequences it is judged to be invalid.
Another ad hoc technique is to entirely maskout lowinformation sections
of sequences but this is throwing information away –
the sections are of lowinformation but not zeroinformation.
Modellingalignment [APD99]
is a better way to solve the problem.
It incorporates a statistical model of a family of sequences into the
dynamic programming algorithm (DPA) so that it correctly weighs
each character in context, and also
the benefit of matching it to a character in the other sequence.
This allows the information content of
(i) the nullhypothesis (the sequences are unrelated) and of
(ii) the alignmenthypothesis (the sequences are related in some way)
to be calculated and compared.
Modellingalignment performs better than other methods giving
fewer falsepositives and
fewer falsenegatives [PAD04].
Below is a Javascript implmentation of the simplest version
of modellingalignment for simple gap costs.
(It has also been extended [PAD04]
to local and globalalignment,
to linear gap costs, and
to the sumofprobabilities over all alignments
in addition to optimal alignment.)
The default values of sequences S and T below are unrelated.
Each is a random string of 100 bases generated from
the polyATish model on the right.
S and T are both generated from that model but have nothing else in common.
S and T are aligned under two different assumptions.
First, it is assumed that an individual sequence is "incompressible",
that it is not a "low information" sequence.
Second, it is (correctly) assumed that
each individual sequence comes from a polyATish family,
that it is a lowinformation (but not a zeroinformation) sequence.
You can change the values of S and T in the form.
S[i]
A C G T
.>P(S[i]S[i1])
A 1/12 1/12 1/12 9/12

C 9/20 1/20 1/20 9/20

G 9/20 1/20 1/20 9/20

T 9/12 1/12 1/12 1/12
Initial paper on modellingalignment,
incorporating a model of a family of sequences into
the dynamic programming algorithm for the
optimal, global alignment of two sequences under simple gap costs.
The model of the family can be almost any statistical model
of sequences.
Extension of [APD99]
to local and globalalignment, and
to linear gap costs, and
to optimalalignment and total probability over all alignments.
ROC curves & other results from tests on real and artificial data
show modellingalignment being more accurate than
standard algorithms as in FASTA.