Modelling Is More Versatile Than Shuffling 

L. Allison, D. Powell and T. I. Dix,

...indels... ...become...   ...indels... ...become... (i) (ii) Figure 1: High and Low Information Matches. 
In the simplest version of the shuffling technique, two sequences, S1 and S2, are aligned and get a certain cost C (or alternatively a score). S1 and S2 are then permuted at random to give S1' and S2' respectively. S1' and S2' are aligned and get a cost C'. The original alignment of S1 and S2 is only considered to be `acceptable' if C is better than C' by some safety margin. The underlying assumption must be that the sequences are generated by zeroorder Markov models and that S1 and S1' are typical sequences from a population modelled by the same zeroorder model, similarly S2 and S2'. Permutation destroys any higher order statistical structure in S1 and S2. Altschul and Erickson [Alt85] do show how to randomize a sequence while maintaining dinucleotide and codon usage frequencies, but it is hard to see how to randomize a sequence while maintaining the statistical properties of an arbitrary population model. In the English text example, one would at least want to preserve the appearance of plausible English words and names.
Wootton [Woo97] proposes `masking out' regions of low compositional complexity before performing alignment. Compositional complexity is defined as the entropy of the characters within a sliding window, i.e. a moving average, under a multistate distribution [Bou69]. Masking out is drastic because low information content regions do provide nonzero amounts of information which should not be thrown away. In fairness, it was proposed as a preprocessing step to searches against large databases where speed is of the essence. In the English text example it amounts to ignoring `become' as an example of a `stop word', as is often done in information retrieval [vRi79,p18].
In what follows, we examine an alternative to shuffling, and to masking out. This alternative calculates probabilities for an alignment of two sequences and for the null hypothesis, i.e. that the sequences are not related by descent. This has previously been achieved for random sequences and recently extended to nonrandom, that is medium to lowinformation content, sequences. Here we include models for populations of sequences that S1 and S2 might be drawn from. The resulting algorithms produce alignments that weight unusual `features' of sequences heavily, provide a significance test for alignments, and can identify which populations sequences probably came from.
The generic dynamic programming algorithm (DPA), shown in Figure 2, can be made to calculate the editdistance [Lev65][Sel74], longest common subsequence [Hir75], most probable alignment, or sum of probability over all alignments, for sequences S1 and S2 by suitable choices for z, c, g and f; later we show how to modify it to take account of the information content of S1 and S2.
M[0,0] = z M[i,0] = f(M[i1,0], c(S1(S1[i], null)), i=1..S1 M[0,j] = f(M[0,j1], c(S2[null, S2[j])), j=1..S2 M[i,j] = g(f(M[i1,j1], c(S1[i],S2[j])), match/change f(M[i1,j ], c(S1[i], null)), delete f(M[i ,j1], c(null, S2[j])) ) insert i=1..S1 and j=1..S2 Figure 2: Dynamic Programming Algorithm (DPA). 
Linear and piecewise linear gap costs [Got82] can be included by elaborating the `state' information stored in M and by changes to f and g. Since there are a great many variations of this kind, we study the simplest case of a 1state system as the canonical example. The new method of incorporating the information content of the sequences, described in this paper, can clearly be applied to more complex multistate edit models.
If probabilities are assigned to the basic mutations, a probability can be assigned to two sequences under the hypothesis that they are related by such mutations [All92], see Figure 3. Mutation probabilities can be estimated by an expectation maximization [Bau67][Bau70] process when they are not known in advance.
z = 1 g = max f = multiply c(x,null) = P(delete).P(x) c(null,y) = P(insert).P(y) c(x,x) = P(match) .P(x) c(x,y) = P(change).P(x,y), x!=y Figure 3: For DPA, Most Probable Alignment. 
The null hypothesis in alignment, is that two given sequences are not related by descent, i.e. that they are independent members of the population, here the uniform population. In that case, the probability of a DNA sequence, S, of a given length is 4**(S). The difference in message lengths between an alignment and the null hypothesis is their negative log odds ratio. No alignment with a message length greater than the null hypothesis is acceptable. These notions are extended to nonrandom sequences below.
A population of sequence, F, is called `nonrandom' if there is a compression algorithm, m, such that m^{1}(m(S))=S for all S in F, and the length of m(S), measured in bits, is less than the length of S, on average for S in F. Many compression algorithms now consist of a `predictor', which yields P(S[i]S[1..i1]), and an encoder. The decompression algorithm uses an identical predictor with a decoder. The encoding/decoding problem has essentially been solved by arithmetic coding [Lan84] which can get arbitrarily close to the theoretical limit of log(P(I)) bits for an item I of probability P(I). Much data compression research is now concentrated on statistical modelling and on prediction algorithms for various sources. The best model is the one that leads to the greatest data compression. For example, loworder Markov models compress "typical" DNA sequences to about 1.9 bits per character but more sophisticated models (Grumbach and Tahi 1994, Loewenstern and Yianilos 1997, Allison et al 1998) [Gru94][Loe97][All98a] do much better.
In order to find a most probable alignment, the DPA of Figures 2 and 3 needs estimates for the probabilities of characters involved in matches, changes, inserts and deletes. When the given sequences are assumed to come from particular models, the models provide predictions for P(S1[i]S1[1..i1]) and for P(S2[j]S2[1..j1]) and these predictions should be used in some efficient and effective way. The formulae given in Figure 4 define the simplest method of achieving these aims [All99]. Note in particular that `match' and `change' treat S1 and S2 symmetrically. Renormalisation takes place in a change because two necessarily different character values are involved. Note that there are almost no constraints on how P(S1[i]S1[1..i1]), say, is estimated, i.e. on the form of models of populations. The models' predictions for S1[i] and S2[j] can either be computed `on the fly' in the DPA or can be precomputed and cached, whichever is most convenient and efficient.
P(delete).P(xS1[1..i1]), where S1[j]=x P(insert).P(yS2[1..j1]), where S2[j]=y P(match).(P(xS1[1..i1])+P(xS2[1..j1]))/2, where S1[i]=S2[j]=x P(change).(P(xS1[1..i1]) * P(yS2[1..j1])) / (1/(1P(yS1[1..i1]))+1/(1P(xS2[1..j1))) / 2, where S1[i]=x!=y=S2[j] Figure 4: DPA Probabilities for NonRandom Sequences. 
Unrelated pairs and related pairs of artificial DNA sequences were generated from various populations. 100 pairs of unrelated sequences of length 200 were generated from each of the uniform, MMf and MMg models as in Figure 5. These models were chosen purely on the grounds of simplicity and the method is not limited to them. MMf is an (AT)rich zeroorder model and MMg is an ATrich firstorder model. For a pair of related sequences, one sequence S1 of length 200 was first generated from the relevant model  uniform, MMf or MMg. Sequence S2 was formed by copying S1 with a certain probability of mutation. The probabilities of change, insertion and deletion were kept in the ratios 2:1:1 for simplicity. When a character was inserted at S2[j], it was chosen from the distribution implied by the model given S2[1..j1]. When a character was changed it was chosen from the distribution renormalised after the original character value was removed. S2 is thus from a similar, but not necessarily identical, model to S1. The probability of mutation was varied from a low 10% to a high 50%. 100 pairs of sequences were generated for each parameter setting.
For each set of pairs of sequences, each pair in the set was aligned with the modified DPA assuming (a) the uniform model for S1 and S2, (b) unknown zeroorder models for S1 and S2, and (c) unknown firstorder models for S1 and S2. Note that the parameters of the models must be inferred from the data in (b) and (c) and that these parameters must be stated, to optimum accuracy, as part of their respective hypotheses [Bou69]. The information content of the null hypothesis was also calculated under each of these three population assumptions, giving six hypotheses in total. The means and standard deviations of the hypotheses' message lengths, in bits per character, were calculated over a number of trials. The number of times that each hypothesis had the shortest message length, i.e. highest posterior probability, were also counted.
As a comparison, each sequence in a pair was shuffled and the shuffled sequences were aligned under the uniform model. The standard deviation, SD, of the message lengths of the alignments of the shuffled sequences was also calculated. The number of times that an alignment of two original sequences bettered the alignment of their shuffled selves (under the uniform model) by margins of one, two and three times SD, i.e. was considered acceptable by these safety margins, were also recorded. The use of the quantity bits per character allowed for slight differences in sequence lengths. The results of the experiments are summarised in table 1.
A:9/20 C:1/20 G:1/20 T:9/20 Figure 5(a): ZeroOrder Model MMf. 
A C G T .> S[i] A  1/12 1/12 1/12 9/12 S[i1] C  9/20 1/20 1/20 9/20 G  9/20 1/20 1/20 9/20 T  9/12 1/12 1/12 1/12 Figure 5(b): FirstOrder Model MMg. 
MMf is an (AT)rich zeroorder model  Figure 5(a). The effective alphabet is less than four, typical sequences being compressible to 1.5 to 1.6 bits per character. Sequences are successfully identified as coming from a zeroorder model. Almost all related pairs are identified as being related up to a mutation probability of 0.2. A significant number of optimal alignments are unacceptable at a mutation probability of 0.3 and almost all are unacceptable beyond this level. This is consistent with Deken's bounds of 0.61 to 0.78 for the LCS of unrelated sequences from a ternary alphabet. In contrast, a margin of three standard deviations is needed for shuffling to return almost no false positives on unrelated sequences. With that margin however, over one quarter of alignments are still considered acceptable at a mutation probability of 0.5, which is surely unreasonable given the alphabet's small effective size. Note that MMf is a zeroorder model and should suit this instance of shuffling well.
The message length criterion correctly identifies the population class of data from MMg  Figure 5(b). The transition from optimal alignment to (firstorder) null hypothesis being the more probable occurs at a mutation probability of about 0.3. In contrast, shuffling often considers optimal alignments of even unrelated sequences to be acceptable at a safety margin as large as three standard deviations. Of course its assumption here, that the data comes from a zeroorder population, is incorrect and this confirms that shuffling is invalid in general when it does not preserve the statistics of the population. This is not surprising but it does raise the question of how to use shuffling when the nature of the population is not simple and known in advance? The message length test can, on the other hand, select between a number of population models, including very general ones, while also giving a significance test for alignments.
With reference to the tests above, note that the uniform model is a special case of a zeroorder model which is in turn a special case of a firstorder model. A zeroorder model over DNA has three parameters and a firstorder model over DNA has 12. The number of parameters rises with model complexity. Overfitting would be a danger if model costs were not included in the hypotheses. Data from a uniform population will contain some chance correlations that a zero or firstorder model will detect but, when the costs of the increasingly complex models are included in their message lengths, these correlations are shown to be chance and not significant. When the sequence lengths are short, say 50 characters, MMg sequences are sometimes classified as zeroorder data which is a reasonable outcome when only given two short sequences.
It is natural to ask what is the best model of sequences to incorporate into alignment algorithms. The message length criterion can identify the best out of a given collection of alternatives but it cannot invent the best model out of thin air. To answer that question would in general require a search over all models and this is infeasible, although various subsets can be searched systematically. Fortunately there is usually some prior biological knowledge about the possible nature of a population of sequences and one or more models can be based on this. Some general guidelines can also be given: A model should be simple; in particular the number of parameters to be fitted to the data should be small compared to the length of the sequences. If sequences are expected to be nonhomogeneous, e.g. mixtures of uniform and repetitive sections say, then an `adaptive' model that uses recent observations to predict the next character can perform well. Research on the compression of biological sequences (Grumbach and Tahi 1994, Loewenstern and Yianilos 1997, Allison et al 1998) [Gru94][Loe97][All98a] can also suggest models to use.
We have argued that statistical models of populations of sequences should be used in alignment algorithms. Doing so can change the rank order of alignments and, in particular, change which are considered to be optimal. It gives a simple and efficient significance test for alignments of high or low information content sequences. A number of alternative models can be compared when the true nature of the population is not known. Tests show a good ability to infer this information from given data. The method is versatile and can be used with any model of sequences that makes predictions, i.e. yields probability estimates P(S[i]=xS[1..i1]) where x ranges over the alphabet. The question of what is a good model to use must depend on the sequences being analysed. Fortunately biologists often have prior knowledge which can be used to design a model. It is not crucial that this is correct in every detail provided that some simple and plausible alternatives are also included.
In contrast, the well known shuffling technique cannot change the rank order of alignments; it only changes the criterion for acceptability after the alignment process. It can also give incorrect results when the assumptions of its implicit model are invalid and this situation can be hard to identify.
The `masking out' technique discards useful information that is present in even low information subsequences; modelling during alignment can use this information.
Table 1: Alignment and Null, x 3 Models v. Shuffling
modelling  

2 x uniform  2 x order 0  2 x order 1  shuffling  
align  null  align  null  align  null  1SD  2SD  3SD  
uniform data:  
10%  1.40+/.06  2.0  1.44+/0.6  2.04+/.01  1.51+/.06  2.09+/.01  100  100  100 
[100]  
20%  1.66+/.06  2.0  1.70+/.06  2.04+/.01  1.76+/.06  2.09+/.01  100  100  100 
[100]  
30%  1.84+/.06  2.0  1.88+/.06  2.04+/.01  1.94+/.06  2.09+/.01  100  100  100 
[100]  
40%  1.99+/.05  2.0  2.03+/.05  2.04+/.01  2.09+/.05  2.09+/.01  99  93  76 
[61]  [39]  
50%  2.08+/.04  2.0  2.12+/.04  2.04+/.01  2.17+/.04  2.09+/.01  73  43  23 
[4]  [96]  
unrel  2.13+/.04  2.0  2.17+/.04  2.04+/.01  2.23+/.04  2.09+/.01  22  5  4 
[100]  
MMf data:  
10%  1.39+/.06  2.0  1.16+/.06  1.53+/.06  1.23+/.06  1.60+/.06  100  100  100 
[100]  
20%  1.62+/.06  2.0  1.36+/.07  1.53+/.07  1.43+/.07  1.60+/.07  100  100  100 
[99]  [1]  
30%  1.78+/.05  2.0  1.49+/.06  1.53+/.06  1.57+/.06  1.60+/.06  100  100  100 
[74]  [26]  
40%  1.90+/.04  2.0  1.60+/.06  1.55+/.06  1.67+/.06  1.61+/.05  99  92  75 
[7]  [93]  
50%  1.97+/.04  2.0  1.66+/.07  1.55+/.07  1.73+/.07  1.62+/.06  81  51  27 
[100]  
unrel  2.02+/.03  2.0  1.63+/.06  1.51+/.04  1.70+/.05  1.58+/.04  24  15  1 
[100]  
MMg data:  
10%  1.39+/.06  2.0  1.25+/.06  1.68+/.05  1.14+/.07  1.43+/.08  100  100  100 
[100]  
20%  1.64+/.06  2.0  1.50+/.06  1.71+/.05  1.39+/.07  1.51+/.08  100  100  100 
[100]  
30%  1.80+/.06  2.0  1.65+/.06  1.71+/.05  1.54+/.07  1.53+/.07  100  100  100 
[40]  [60]  
40%  1.92+/.04  2.0  1.78+/.06  1.73+/.04  1.67+/.06  1.57+/.06  100  97  95 
[3]  [97]  
50%  1.99+/.04  2.0  1.84+/.05  1.73+/.05  1.73+/.06  1.57+/.06  92  77  47 
[100]  
unrel  1.93+/.04  2.0  1.80+/.05  1.67+/.04  1.58+/.07  1.36+/.06  99  93  74 
[100] 
Key:  unrel=unrelated pairs  units: bits/character 
10% etc: related Pr(mutation)  [##] times best, where nonzero  
shuffling: # of "acceptable" alignments by margins of 1, 2 or 3 SD's  
length of S1: 200. 100 trials per setting 
www: 
