Minh Duc Cao, Trevor I. Dix, and Lloyd Allison
BMC Bioinformatics, 11:599, 2010,
Traditional genome alignment methods consider sequence alignment as
a variation of the string edit distance problem, and
perform alignment by matching characters of the two sequences.
They are often computationally expensive and unable to deal with
low information regions. Furthermore, they lack a well-principled
objective function to measure the performance of sets of parameters.
Since genomic sequences carry genetic information, this article
proposes that the information content of each nucleotide in a position
should be considered in sequence alignment. An information-theoretic
approach for pairwise genome local alignment, namely XMAligner,
is presented. Instead of comparing sequences at the character level,
XMAligner considers a pair of nucleotides from two sequences to
be related if their mutual information in context is significant.
The information content of nucleotides in sequences is measured by
a lossless compression technique.
Experiments on both simulated data and real data show that
XMAligner is superior to conventional methods especially on
distantly related sequences and statistically biased data.
XMAligner can align sequences of eukaryote genome size with
only a modest hardware requirement. Importantly, the method
has an objective function which can obviate the need to choose
parameter values for high quality alignment.
The alignment results from XMAligner can be integrated into
a visualisation tool for viewing purpose.
The information-theoretic approach for sequence alignment is shown
to overcome the mentioned problems of conventional character
matching alignment methods. The article shows that,
as genomic sequences are meant to carry information,
considering the information content of nucleotides is
helpful for genomic sequence alignment.
Downloadable binaries, documentation and data can be found