
L. Allison^{a}, L. Stern^{b},
T. Edgoose^{a} and T.I. Dix^{a}
(a) School of Computer Science and Software Engineering,
Monash University, Melbourne, 3168 Australia
(b) Department of Computer Science and Software Engineering,
The University of Melbourne, Melbourne, 3052 Australia
Received 7 August 1998; accepted 18 February 1999
Full paper here:
[10.1016/S00978485(00)800066][4/'03],
[sciencedirect(click)][4/'03].
Abstract:
A new statistical model for DNA considers a sequence to be a mixture of
regions with little structure and regions that are approximate repeats of
other subsequences, i.e. instances of repeats do not need to match
each other exactly. Both forward and reversecomplementary repeats
are allowed. The model has a small number of parameters which are
fitted to the data. In general there are many explanations for a
given sequence and how to compute the total probability of the data
given the model is shown. Computer algorithms are described
for these tasks. The model can be used to compute the
information content of a sequence, either in total or base by base.
This amounts to looking at sequences from a datacompression point of
view and it is argued that this is a good way to tackle
intelligent sequence analysis in general.
Keywords:
Algorithm; DNA; Complexity; Entropy; Pattern discovery; Sequence analysis
(Also see
ISMB'98, pp.816, 1998 and
Mol. Biochem, Parasitology, 18(2), pp.175186, 2001.)

