Linda Sterna, Lloyd Allisonb,
Ross L. Coppelc and Trevor I. Dixb
(a) Department of Computer Science and Software Engineering,
The University of Melbourne, Melbourne, Victoria 3010, Australia.
(b) School of Computer Science and Software Engineering, Monash University,
Clayton, Victoria 3800, Australia.
(c) Department of Microbiology, Monash University,
Clayton, Victoria 3800,
with full text in pdf.
A method has been developed for discovering patterns in DNA sequences.
Loosely based on the well-known Lempel Ziv model for text compression,
the model detects repeated sequences in DNA. The repeats can be forward
or inverted, and they need not be exact. The method is particularly useful
for detecting distantly related sequences, and for finding patterns in
sequences of biased nucleotide composition, where spurious patterns are
often observed because the bias leads to coincidental nucleotide matches.
We show here the utility of the method by applying it to genomic sequences
of Plasmodium falciparum. A single scan of chromosomes 2 and 3
of P. falciparum, using our method and no other a priori information
about the sequences, reveals regions of low complexity in both telomeric and
central regions, long repeats in the subtelomeric regions, and shorter repeat
areas in dense coding regions. Application of the method to a recently
sequenced contig of chromosome 10 that has a particularly biased base
composition detects a long internal repeat more readily than does the
conventional dot matrix plot. Space requirements are linear, so the method
can be used on large sequences. The observed repeat patterns may be related
to large-scale chromosomal organization and control of gene expression.
The method has general application in detecting patterns of potential interest
in newly sequenced genomic material.
Compression; information theory; repeated sequences; pattern discovery.