Minh Duc Cao, Trevor I. Dix, and Lloyd Allison
- Springer Verlag, LNCS 5476/2009, PAKDD09,
Substitution matrices describe the rates of mutating one character in a
biological sequence to another character, and are important for many
knowledge discovery tasks such as phylogenetic analysis and sequence alignment.
Computing substitution matrices for very long genomic sequences of divergent
or even unrelated species requires sensitive algorithms that can take into
account differences in composition of the sequences.
We present a novel algorithm that addresses this by computing a
nucleotide substitution matrix specifically for the two genomes being aligned.
The method is founded on information theory and in the
expectation maximisation framework.
The algorithm iteratively uses compression to align the sequences and estimates
the matrix from the alignment, and then applies the matrix to find a better
alignment until convergence. Our method reconstructs, with high accuracy,
the substitution matrix for synthesised data generated from a known matrix with
introduced noise. The model is then successfully applied to real data for
various malaria parasite genomes, which have differing phylogenetic distances
and composition that lessens the effectiveness of standard statistical analysis