Compression

LA home
Computing
Bioinformatics
 Glossary
 Compression
  DCC07
  BMCbio07
  AI04
  MBP01
  CC00
  CompJ99
  ISMB98

Also see
MML

The information in learning of an event E of probability pr(E) is -log2(pr(E)) bits. For example, if the four DNA bases {A,C,G,T} each occured 1/4 of the time, an optimal code would be

A 00,
C 01,
G 10,
T 11

Note that -log2(0.25)=4: Each base would be worth 2-bits of information. However, if the probabilities of the bases were A 1/2, C 1/4, G 1/8, T 1/8, say, then

A 0,
C 10,
G 110,
T 111

would be optimal; note -log2(1/8)=3, etc.. In this case the average code length would be

0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 13/4-bits/base

which is less than before. A G or a T contains more information than a C which contains more information than an A. There is more information on information under [MML].

In general the probability of the next symbol, S[i], of a sequence, S, may depend on previous symbols, and then we deal with conditional probabilities pr(S[i] | S[1..i-1]). This gives much better compression of DNA and protein sequences [Cao et al 07] [Allison et al 98].

Information content can be used to discover patterns, repeats, gene duplications and the like in sequences. It can also give a distance between DNA sequences or protein sequences, for classification or for inference of phylogenetic (evolutionary) trees, without aligning the sequences. And "costing" the symbols in an alignment according to the symbols' information content gives an alignment algorithm that is more accurate in detecting genuine relatedness in populations of non-random (repetitive, low information content, compressible) sequences [Allison 1999] [Powell et al 2004].

www

free:
Linux operating-sys
OpenOffice office-suite, ver. 2.4+
The GIMP ~photoshop
Firefox web browser
FlashBlock flash on/off

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Saturday, 22-Nov-2008 14:49:44 EST.