Compression and Approximate Matching

LA home
Computing
 COMPJ99

Also see
 M-Align
Bioinformatics

The Computer Journal, Volume 42, Issue 1, pp.1-10, 1999.

L. Allison, D. Powell and T. I. Dix

School of Computer Science and Software Engineering, Monash University, Australia 3168

Abstract

A population of sequences is called non-random if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the sequences and doing so changes the rank order of possible alignments in general. The model should also be used in deciding if a resulting approximate match between two sequences is significant or not. It is shown how to do this for two plausible interpretations involving pairs of sequences that might or might not be related. Efficient alignment algorithms are described for quite general statistical models of sequences. The new alignment algorithms are more sensitive to what might be termed 'features' of the sequences. A natural significance test is shown to be rarely fooled by apparent similarities between two sequences that are merely typical of all or most members of the population, even unrelated members.

Link: Computer Journal [doi:10.1093/comjnl/42.1.1]['05] and [pdf]['05].


Also see: D. R. Powell, L. Allison, T. I. Dix, Modelling-Alignment for Non-Random Sequences, 17th ACS Australian Joint Conf. on Artificial Intelligence (AI2004), Springer-Verlag, LNCS/LNAI 3339, isbn:3-540-24059-4, pp.203-214, 2004, (including software).


The method makes the character-matching scoring function context dependent and allows "features" of sequences to be weighted appropriately.

www

free:
Linux operating-sys
OpenOffice office-suite, ver. 2.4+
The GIMP ~photoshop
Firefox web browser
FlashBlock flash on/off

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Thursday, 20-Nov-2008 22:25:18 EST.