Compression |
|
The information in learning of an event E of probability pr(E) is -log2(pr(E)) bits. For example, if the four DNA bases {A,C,G,T} each occured 1/4 of the time, an optimal code would be
Note that -log2(0.25)=4:
Each base would be worth 2-bits of information.
However, if the probabilities of the bases were
would be optimal; note -log2(1/8)=3, etc.. In this case the average code length would be
which is less than before.
In general the probability of the next symbol, S[i], of a sequence, S,
may depend on previous symbols, and then we
deal with conditional probabilities Information content can be used to discover patterns, repeats, gene duplications and the like in sequences. It can also give a distance between DNA sequences or protein sequences, for classification or for inference of phylogenetic (evolutionary) trees, without aligning the sequences. And "costing" the symbols in an alignment according to the symbols' information content gives an alignment algorithm that is more accurate in detecting genuine relatedness in populations of non-random (repetitive, low information content, compressible) sequences [Allison 1999] [Powell et al 2004]. |
|
|