|
- Java source file:
[Biased.java], and class files:
[1], [2]
- To run, e.g.
- java Biased "AT" 0.77 dataFileName
- for a particular figure (here 0.77 "AT"), or ...
- java Biased "AT" -1 dataFileName
- to run through a range of values.
- NB. Make sure that you have a just-in-time compiler if you want to
analyse long sequences.
- NB. Characters other than acgtuACGTUrnyRNY are skipped but
this is trivial to change.
- Example of running [time (click)].

-- Figure: cummulative sum, first-below & last-above --
Sung Kwon Kim suggested a 3-pass, (always) linear-time algorithm
(personnal communication, 28/7/2003).
I have coded this up in a program that runs both algorithms:
[new code .java].
|
21/10/2002: Examples from PlasmoDB C.D.,
seq's dated 2002.8.19 [19/8/2002]
NB. Plasmodium falciparum causes malaria.
Its DNA is 80% AT, contains repetitive patterns and is highly
[compressible]
compared to ``most'' other DNA.
The data files of chromosome sequences 10, 11, 14 contain
some non-{A,C,G,T} characters, e.g. M={A,C}, W={A,T}, K={G,T} etc..
For these experiments,
a read-routine was used that does read these characters
(to maintain position numbering),
but they were treated as ``regular'' characters,
not contributing to AT-richness, say, because
(i) it makes only a little difference unless
there are many occurrences of such characters, and
(ii) it is not quite clear what is the best way
to treat some of them anyway
(e.g. does M count towards AT or not?).
In any case, the matter is orthogonal to the algorithm;
the ``special characters'' can be included in the list of
desired characters, e.g. "ATW", at your choice.
Running time, on a 900MHz Pentium, Java, Linux,
for the largest sequence (chr14) is 7 or 8 seconds
for five passes with different values of f
and this includes reading the data.
- Key:
>=<min_frac>:
[<start>..<end>]
(<length>)
<beginning>...<ending>
-
- chr1
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr01-2002-8-19-Sanger
- # 643292bp CTAAACCTAAACCTAA...
>=1.0: | [459134..459485] | (352) | AAAATAATAA...AATAAATAAA |
>=0.95: | [458479..461841] | (3363) | TTAACATTTT...TTTATTATAT |
>=0.9: | [458757..465678] | (6922) | ATGTTTCTAA...TTTATTTAAT |
>=0.85: | [439465..465974] | (26510) | TTAATTGTAA...TCTTTTGATT |
>=0.8: | [8596..625720] | (617125) | ATTTCTACAC...TATATTAAGA |
>=0.75: | [0..643291] | (643292) | CTAAACCTAA...GGAATAGGGT |
- #--- end ---
-
- chr2
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr02-200208019-TIGR
- # 947102bp AACCCTAAACCCTAAACCCTAAACCCTA...
>=1.0: | [449263..449686] | (424) | TTATTTTAAT...TATATTATTT |
>=0.95: | [447292..450582] | (3291) | TACAAAATAA...AAAAAAAATA |
>=0.9: | [445946..451647] | (5702) | TATATTTTAT...GGAAAAAAAT |
>=0.85: | [807673..840316] | (32644) | ATGAAAATTA...ATTTATCATA |
>=0.8: | [0..947101] | (947102) | AACCCTAAAC...TCAGGGTTCA |
- #--- end ---
-
- chr3
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr03-2002-8-19-Sanger
- # 1060087bp TAAACCCTGAACCCTAAACCCT...
>=1.0: | [594855..595306] | (452) | AAAATAAATT...ATTAATAATT |
>=0.95: | [594084..597349] | (3266) | AAATAAATAA...TTTTGTTAAA |
>=0.9: | [590628..597388] | (6761) | AATACTAAAG...ATATACTTAT |
>=0.85: | [76744..100990] | (24247) | AATATTAAAC...AACTCATAAA |
>=0.8: | [0..1060086] | (1060087) | TAAACCCTGA...TCAGGTTTTA |
- #--- end ---
-
- chr4
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr04-2002-8-19-Sanger
- # 1204112bp AACCCTAAACCCTGA...
>=1.0: | [650061..650262] | (202) | TTTTAATAAT...ATAATTAAAA |
>=0.95: | [648539..651828] | (3290) | TATAAACATT...TTTTATATTT |
>=0.9: | [647731..654304] | (6574) | AACATTTAAA...AATAAAAAAA |
>=0.85: | [639114..665595] | (26482) | CAATTTCTAA...TATATTTTTT |
>=0.8: | [41709..1188699] | (1146991) | TTATTTTTTT...AGTTAAGTTA |
>=0.75: | [0..1204111] | (1204112) | AACCCTAAAC...TTAGGGTTTA |
- #--- end ---
-
- chr5
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr05-2002-8-19-Sanger
- # 1343552bp CTAAACCCTGAACCCT...
>=1.0: | [208403..208597] | (195) | AAAAAAAAAT...AATAAAATAA |
>=0.95: | [454931..458053] | (3123) | CAAATTTTTG...TGGAATATAA |
>=0.9: | [453877..459351] | (5475) | TTTATTTTAT...TATAAGAAAT |
>=0.85: | [1304498..1321991] | (17494) | TTTAAAAATG...TGTTTTAATT |
>=0.8: | [0..1343551] | (1343552) | CTAAACCCTG...GGTTCAGGGT |
- #--- end ---
-
- chr6
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr06-2002-8-19-Sanger
- # 1377956bp CCTAAACCCTGAACCCTAA...
>=1.0: | [39936..40118] | (183) | ATAAAATAAA...AAAATTTTAT |
>=0.95: | [528479..529541] | (1063) | ATTAAAATAA...ATTTTTAGAT |
>=0.9: | [1015753..1019942] | (4190) | TTATTATTAA...TTAGTTTTAT |
>=0.85: | [1012661..1025787] | (13127) | CAAAACCACC...TTTATGGATT |
>=0.8: | [0..1377955] | (1377956) | CCTAAACCCT...AGGGTTTAGG |
- #--- end ---
-
- chr7
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr07-2002-8-19-Sanger
- # 1350452bp AAACCCCAACCCTAAACCCTA...
>=1.0: | [714118..714370] | (253) | TTATTTTTAA...TAATTAATAT |
>=0.95: | [713290..715097] | (1808) | AATATATTTA...ATCTTAATTT |
>=0.9: | [105239..108481] | (3243) | AACATAAATA...ATATTAAAAA |
>=0.85: | [67481..86120] | (18640) | TTCACAATTT...AAAATATATT |
>=0.8: | [0..1350451] | (1350452) | AAACCCCAAC...TAAGACAGTA |
- #--- end ---
-
- chr8
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr08-2002-08-19-Sanger
- # 1323195bp TCAAATATTCCGAAAACCCTAA...
>=1.0: | [344230..344437] | (208) | ATTATTTATA...TTTTTTTTTT |
>=0.95: | [1022302..1023765] | (1464) | TTTTATATGA...ATTTAAAAAA |
>=0.9: | [378658..381469] | (2812) | ATTTTGATAA...TTTTTAGTTT |
>=0.85: | [500245..516452] | (16208) | TATTAATTTA...AATATATATT |
>=0.8: | [0..1323194] | (1323195) | TCAAATATTC...TTAGGGTTGA |
- #--- end ---
-
- chr9
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr09-2002-8-19-Sanger
- # 1541723bp AACCCTGAACCCTAAA...
>=1.0: | [1130660..1130874] | (215) | ATAAATTAAT...TATATTAATA |
>=0.95: | [1241781..1244923] | (3143) | TAATTATAAA...AATACATTTA |
>=0.9: | [323390..329864] | (6475) | TAATTATATG...TTATCATAAT |
>=0.85: | [1373697..1416898] | (43202) | TATATTTTGA...TATTTATTTT |
>=0.8: | [0..1541722] | (1541723) | AACCCTGAAC...GTTTAGGGTT |
#--- end ---
-
- chr10
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr10-2002-8-19-TIGR
- # 1694445bp CTAAACCCTGAACCC...
>=1.0: | [937258..937557] | (300) | TATATAATTT...ATATTATATT |
>=0.95: | [936412..938559] | (2148) | ATTTATTAAT...TAAAACTTAA |
>=0.9: | [934530..938484] | (3955) | AATAAAATAT...TTTTTATGAT |
>=0.85: | [87704..106686] | (18983) | AATATGAACA...ACCATATAAA |
>=0.8: | [0..1694444] | (1694445) | CTAAACCCTG...TTTTAGGGTT |
#--- end ---
-
- chr11
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr11-2002-8-19-TIGR
- # 2035250bp AACCCTAAACCC...
>=1.0: | [829543..830061] | (519) | TATTAAATAT...TAAATATTTA |
>=0.95: | [829012..831558] | (2547) | AATGTAAATA...TACTTATATT |
>=0.9: | [827029..831902] | (4874) | ATAAACAAAT...AAATATATAT |
>=0.85: | [114132..140968] | (26837) | TATTTGATAA...TATTTTTTAA |
>=0.8: | [0..2035249] | (2035250) | AACCCTAAAC...TTGGGTTTAG |
- #--- end ---
-
- chr12
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr12-2002-8-19-Stanford
- # 2271477bp CTGAACCCTAAACCCTA...
>=1.0: | [1282796..1283108] | (313) | AAATAAATAA...AAATAATATT
| >=0.95: | [1281883..1285109] | (3227) | AAAAAAAACG...ATATATATTT
| >=0.9: | [1279598..1285959] | (6362) | ATTTTTTTTA...ATTTTAAATA
| >=0.85: | [2156101..2180543] | (24443) | TAATTAGAAG...TATATTGTAT
| >=0.8: | [0..2271476] | (2271477) | CTGAACCCTA...AGAGTAAGTA
|
- #--- end ---
-
- chr13_1 (NB)
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr13_1-2002-8-19-Sanger
- # 2729159bp AACCCTAAACCTAAACCCTAA...
>=1.0: | [918090..918461] | (372) | AAATATTATA...TAAATATATA |
>=0.95: | [1766404..1767423] | (1020) | ATATTATATT...TTATTATTAT |
>=0.9: | [1817357..1822403] | (5047) | AATTAATATA...ATAATTTTTA |
>=0.85: | [1809128..1827314] | (18187) | ATTTTTACTT...TACCGTAATT |
>=0.8: | [0..2729158] | (2729159) | AACCCTAAAC...GTTTAGGGTT |
- #--- end --
-
- chr13_2 (NB)
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr13_2-2002-8-19-Sanger
- # 18168bp TATTTTAGTTTATATTTATTACA...
>=1.0: | [10716..10816] | (101) | TTAATTATAA...ATTAAAATAT |
>=0.95: | [10604..10816] | (213) | AAAAGTATTT...ATTAAAATAT |
>=0.9: | [9293..9869] | (577) | AATAAAAAAA...TATATAAATT |
>=0.85: | [9070..12951] | (3882) | TTACTATATA...AATAAGATAT |
>=0.8: | [8084..18162] | (10079) | AATGCCCCCA...TAAAAAAATA |
>=0.75: | [0..18167] | (18168) | TATTTTAGTT...AAATAGCAAC |
- #--- end ---
-
- chr14
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = chr14-2002-8-19-TIGR
- # 3291006bp CTGAACCCTAAACCCTAAAC...
>=1.0: | [1072223..1072590] | (368) | TATATTAATA...AAATATAATT |
>=0.95: | [1071769..1073716] | (1948) | TATAAAATAT...TTGTTTTATT |
>=0.9: | [2716237..2720216] | (3980) | TTTTTAATAA...CTCACAGAAT |
>=0.85: | [2792750..2814996] | (22247) | ATCTAAAATG...TTTTTTTTTT |
>=0.8: | [0..3291005] | (3291006) | CTGAACCCTA...GTTTAGGGTT |
- #--- end ---
-
17/10/2002: Examples from Malaria Information Resource (CD) Release 5
- Key: >=<min_frac>: [<start>..<end>]
(<length>) <beginning>...<ending>
-
- Example chr5: /mnt/cdrom1/genome/sanger/chr5/z010701f.txt
of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = c5-z010701f.txt
- # 1343552bp CTAAACCCTGAACCCTA...
>=1.0: | [208403..208597] | (195) | AAAAAAAAAT...AATAAAATAA |
>=0.95: | [454931..458053] | (3123) | CAAATTTTTG...TGGAATATAA |
>=0.9: | [453877..459351] | (5475) | TTTATTTTAT...TATAAGAAAT |
>=0.85: | [1304498..1321991] | (17494) | TTTAAAAATG...TGTTTTAATT |
>=0.8: | [0..1343551] | (1343552) | CTAAACCCTG...GGTTCAGGGT |
- #--- end ---
-
- Example chr10: /mnt/cdrom1/genome/tigr/chr10/pfa1_10.txt
of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = pfa1_10.txt
- # 1694401bp CTAAACCCTGAACC...
>=1.0: | [937258..937557] | (300) | TATATAATTT...ATATTATATT |
>=0.95: | [936412..938559] | (2148) | ATTTATTAAT...TAAAACTTAA |
>=0.9: | [934530..938484] | (3955) | AATAAAATAT...TTTTTATGAT |
>=0.85: | [87704..106686] | (18983) | AATATGAACA...ACCATATAAA |
>=0.8: | [0..1694400] | (1694401) | CTAAACCCTG...TTTTAGGGTT |
- #--- end ---
-
- Example chr11: /mnt/cdrom1/genome/tigr/chr11/pfa1_11.txt
of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[1] = -1
- # argv[2] = pfa1_11.txt
- # 2035244bp AACCCTAAACCCTGAACCCTGAACCCTGAA...
>=1.0: | [829539..830057] | (519) | TATTAAATAT...TAAATATTTA |
>=0.95: | [829008..831554] | (2547) | AATGTAAATA...TACTTATATT |
>=0.9: | [827025..831898] | (4874) | ATAAACAAAT...AAATATATAT |
>=0.85: | [114130..140966] | (26837) | TATTTGATAA...TATTTTTTAA |
>=0.8: | [0..2035243] | (2035244) | AACCCTAAAC...TTGGGTTTAG |
- #--- end ---
-
- Example chr12: /mnt/cdrom1/genome/stanford/chr12/z010524.txt
(>Chr12Contig01.010524 2270790 bp) of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = c12-z010524f.txt
- # 2270790bp CTAAACCCTGAACCCTAAAC...
>=1.0: | [1281901..1282213] | (313) | AAATAAATAA...AAATAATATT |
>=0.95: | [1280988..1284214] | (3227) | AAAAAAAACG...ATATATATTT |
>=0.9: | [1278703..1285042] | (6340) | ATTTTTTTTA...TTAGAAGAAA |
>=0.85: | [2155396..2179822] | (24427) | TAATTAGAAG...GAAAATGAAT |
>=0.8: | [0..2270789] | (2270790) | CTAAACCCTG...GTAAGGAATG |
- #--- end ---
-
- Example chr14: /mnt/cdrom1/genome/tigr/chr14/pfa1_14.txt
of 26 Aug 2001
- #--- Biased.java, L.A., CSSE, Monash, .au ---
- # argv[0] = AT
- # argv[1] = -1
- # argv[2] = pf-chr14.txt
- # 3290890bp CTGAACCCTAAACCCTAAACCCTAAACCCTAA...
>=1.0 : | [1072219..1072586] | (368) | TATATTAATA...AAATATAATT |
>=0.95: | [1071765..1073712] | (1948) | TATAAAATAT...TTGTTTTATT |
>=0.9 : | [2716123..2720102] | (3980) | TTTTTAATAA...CTCACAGAAT |
>=0.85: | [2792661..2814887] | (22227) | ATCTTGTAAA...TTTCCCCATA |
>=0.8 : | [0..3290889] | (3290890) | CTGAACCCTA...GTTTAGGGTT |
- #--- end ---
-
- Time
(chr14, bayes 900mHz Intel + Linux):
real 0m7.799s;
user 0m7.250s;
sys 0m0.340s.
Note that the time to read the input file was
a significant part of the elapsed time.
|
|