A Simple Statistical Algorithm for Biological Sequence Compression

Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears

LA home
Computing
Publications
 DCC'07
  preprint.pdf
  XM software

Also see
 BMCbioinf07
Bioinformatics
 compression
Abstract: This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.
 
Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears.
A Simple Statistical Algorithm for Biological Sequence Compression.
IEEE Data Compression Conference (DCC), pp.43-52, 2007
[doi:10.1109/DCC.2007.7]['07]
[preprint.pdf]
On the eXpert Model (XM) for data compression of DNA and protein sequences.
XM software: See top-left.
www:


© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Created with "vi (Linux or Solaris)",  charset=iso-8859-1,  fetched Tuesday, 19-Sep-2017 14:42:17 EDT.

free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop,
Firefox web-browser, FlashBlock flash on/off.