Pattern Discovery

Linda Stern^a, Lloyd Allison^b, Ross L. Coppel^c and Trevor I. Dix^b, Molecular and Biochemical Parasitology, 118(2), pp.175-186, doi:10.1016/S0166-6851(01)00388-7, 2001

(a) Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Victoria 3010, Australia.
(b) School of Computer Science and Software Engineering, Monash University, Clayton, Victoria 3800, Australia.
(c) Department of Microbiology, Monash University, Clayton, Victoria 3800, Australia.

Abstract: A method has been developed for discovering patterns in DNA sequences. Loosely based on the well-known Lempel Ziv model for text compression, the model detects repeated sequences in DNA. The repeats can be forward or inverted, and they need not be exact. The method is particularly useful for detecting distantly related sequences, and for finding patterns in sequences of biased nucleotide composition, where spurious patterns are often observed because the bias leads to coincidental nucleotide matches. We show here the utility of the method by applying it to genomic sequences of Plasmodium falciparum. A single scan of chromosomes 2 and 3 of P. falciparum, using our method and no other a priori information about the sequences, reveals regions of low complexity in both telomeric and central regions, long repeats in the subtelomeric regions, and shorter repeat areas in dense coding regions. Application of the method to a recently sequenced contig of chromosome 10 that has a particularly biased base composition detects a long internal repeat more readily than does the conventional dot matrix plot. Space requirements are linear, so the method can be used on large sequences. The observed repeat patterns may be related to large-scale chromosomal organization and control of gene expression. The method has general application in detecting patterns of potential interest in newly sequenced genomic material.

Keywords: Compression; information theory; repeated sequences; pattern discovery.

Link: [doi:10.1016/S0166-6851(01)00388-7] [5/'03] with pdf.

Linda Sterna, Lloyd Allisonb, Ross L. Coppelc and Trevor I. Dixb, Molecular and Biochemical Parasitology, 118(2), pp.175-186, doi:10.1016/S0166-6851(01)00388-7, 2001

Linda Stern^a, Lloyd Allison^b, Ross L. Coppel^c and Trevor I. Dix^b, Molecular and Biochemical Parasitology, 118(2), pp.175-186, doi:10.1016/S0166-6851(01)00388-7, 2001