When is a String Like a String? 

L. Allison, C. S. Wallace and C. N. Yee. Department of Computer Science, Monash University, Victoria, Australia 3168.Supported by Australian Research Council grant A48830856 version: 15 Dec 1989Presented at the
International Symposium on Artificial Intelligence and Mathematics
(AIM), Ft. Lauderdale, Florida,
Abstract. The question of whether or not two strings are related and, if so, of how they are related and the problem of finding a good model of string relation are treated as inductive inference problems. A method of inductive inference known as minimum message length encoding is applied to them. It allows the posterior oddsratio of two theories or hypotheses to be computed. New string comparison algorithms and methods are derived and existing algorithms are placed in a unifying framework. The connection between string comparison algorithms and models of relation is made explicit. The posterior probability of two strings' being related can be calculated, giving a test of significance. The methods are relevant to DNA and to other biological macromolecules. Keywords: string, inductive inference, minimum message length, pattern matching, DNA, macromolecule. 1. IntroductionThe inference of similarities, pattern and structure is a hallmark of intelligence. We apply a method of inductive inference known as minimum message length (MML) encoding to problems over strings of characters. The original motivation for this work[1] comes from the analysis of DNA strings. A DNA sequence is a string over the alphabet of four bases {A,C,G,T}. The remarkable success of molecular biologists in sequencing macromolecules, which is likely to accelerate with the human genome project, has created ample applications. The paper is framed in terms of DNA strings but the techniques apply to strings over other alphabets such as proteins over the alphabet of twenty aminoacids and ascii strings. Strings and languages are central to computer science and related problems occur in file comparison for example. It is assumed that any one string is random in the sense of Kolmogorov complexity[5,11,14]. This is not quite the case for biological macromolecules but it is a good approximation to the truth and it is certainly very difficult to encode a typical DNA string in much less than two bits per character. This assumption is not essential to the MML method but makes its application easier. Even though an individual string is random, it may be clear that two strings are similar in some way. This raises many questions, for example:
We give algorithms to solve the first three problems for simple but useful models of relation. We discuss how the fourth problem can be approached. In addition, existing string comparison algorithms are placed in a unifying framework. The relation models considered are relevant to simple models of DNA mutation, but have more general application. String comparison is used in more complex problems. Notably, several algorithms for the inference of a phylogenetic tree for a set of strings rely on having a matrix of pairwise distances[3]. To illustrate the MML method in string comparison, consider a transmitter who must send two strings to a receiver. The two parties may previously have agreed on any theories, algorithms and conventions. If the two strings are unrelated, the transmitter can do no better than transmit one string and then the other. If the strings are related there must be some relationship between the characters of the two strings. If the transmitter can find a "good" theory or hypothesis about such a relationship then it may be possible to encode the two strings succinctly. However, for the message to be intelligible it must include details of the theory that have not been agreed on and this tends to increase the total message length. A very detailed theory may allow a very short encoding of the strings proper but itself adds a great deal to the message length. MML encoding defines an optimal degree of detail in this tradeoff. The transmitterreceiver paradigm keeps us honest in that anything not agreed to apriori must be included in the message. If accurate prior probabilities for strings and relations are known, an optimal code can be designed for them. The message length is minus log_{2} of the probability of the event described. Often such probabilities are not known. However a model having one or more parameters may be agreed on. A message must include the parameter values; the best parameter values give the shortest message. Two hypotheses may be compared on the basis of their message lengths. The more plausible hypothesis leads to the shorter message and the difference in message lengths is minus log_{2} of their posterior oddsratio. If a good nulltheory can be identified, this leads to hypothesis testing: no hypothesis having a message length longer than the nulltheory is acceptable. The natural nulltheory here is that two strings are unrelated, requiring about two bits per base for its message. There are a number of practical and theoretical advantages to MML encoding as a method of inference. It focuses attention on the language being encoded, on what it can express and on the probabilities of messages. If accurate prior probabilities are not known, nevertheless all the techniques and heuristics of datacompression and of coding theory[10,12,25] can be used to get a good code (theory). MML encoding is safe in that the use of suboptimal encoding for a proposed theory cannot make the theory appear more acceptable than it should be. Where parameters over continuous domains are involved, MML encoding defines an optimal degree of precision for their estimates. Central to the string problems is the definition of relatedness. There can only be one good universal definition and it must be related to the Kolmogorov complexity of a program to produce the two strings. This is not computable in general but fortunately simpler and yet plausible models exist. These are for the most part based on finitestate machines or other simple machines. It turns out that commonly used string comparison algorithms correspond to such models and we have been able to formalise this practice in the MML framework. The MML method enables two theories or hypotheses to be compared objectively. It does not say how to invent such a theory although algorithms and heuristics may be apparent in some cases. In general, problem knowledge, such as biochemical knowledge, can and should be used to formulate new theories. Note that the algorithms described here do not actually encode any messages although we often write as though they do. Rather, they calculate only the lengths of messages. Solomonoff[20] proposed MML encoding as a basis for inductive inference. Wallace and Boulton[22] produced a practical classification algorithm using the technique. A good introduction to the MML technique was given by Georgeff and Wallace[8]. Wallace and Freeman[23] described the statistical foundations of MML encoding. It has had many applications, for example Chong et al[6] have employed it in classifying images of clouds. In string comparison, Reichert et al[7,15,26] considered the brief encoding of pairs of strings for a single model of relation. Bishop et al[2] gave a maximumlikelihood method to infer the time since two related strings diverged which is very similar to our method for a onestate model. Sankoff and Kruskall[17], Waterman[24] and Bishop and Rawlings[3] give surveys of string comparison methods and applications. In what follows, many messages must include one or more integers such as string lengths. In order to encode them, a prior probability distribution for integers must be assumed. When there is no information to the contrary, we use Rissanen's[16] universal log_{2}^{*} prior as the default. Some alternatives are discussed in [1]. 2. OneState ModelsThere are at least two general models of relation between strings based on simple translation machines, which may be called the generative and mutation models. In the generative model, a simple machine reads an input string of generation instructions in a special alphabet and outputs the two given strings in parallel. In the mutation model, a different simple machine reads two inputs in parallel. One input is one of the given strings. The other input is a string of mutation instructions in a special alphabet. The machine outputs the second given string. For the restricted relation models we consider, it can be shown that to every generative model there corresponds a mutation model and viceversa. In what follows we adopt the generative form, but our use of the words `insert', `delete' and `change' for certain generative instructions is common practice inherited from the mutation form. An indel is either an insert or a delete.
A sequence of generative instructions represents a specific relation or alignment between two strings. It is common practice to assign weights or costs to each instruction. The cost of an instruction sequence is the sum of the costs of its instructions. Given strings A and B, an instruction sequence that generates A and B and achieves the minimum possible cost gives an optimal alignment. The minimum cost defines a function D(A,B) on strings. Writing D[i,j] for D(A[1..i],B[1..j]), D can be calculated by the well known dynamic programming algorithm (DPA). Here for simplicity only, we assume that all matches have an equal cost, all changes have an equal cost and all indels have an equal cost.
The choice of minimum made in the general step of the DPA defines a path from D[0,0] to D[A,B] which gives an optimal alignment. One choice of costs for the DPA is to cost matches at zero and changes and indels at one. In this case D(A,B) is a metric known as Sellers'[18] metric. If matches cost zero, changes cost one and indels cost a half, then an optimal alignment contains a longest common subsequence (LCS). Common practice in the choice of costs is ad hoc and there is no objective way to judge what are good costs or what particular costs imply about the relation of strings. Here, we relate costs to probabilities of generative instructions. If something is known of these probabilities, this gives an objective way to choose costs. If probabilities are not known, it allows them to be inferred from given strings. To encode an alignment of A and B, we transmit a string of match, change, insert and delete instructions. Note that the number of instructions must be included in the message. The probabilities of the instructions are required to design an optimal code. Here we assume a simple onestate machine.
Almost invariably we want p(i)=p(d). The length of (a code word for) an instruction includes a contribution for the characters of the strings. The length of a match or indel is minus log_{2} of the probability of the instruction plus two bits for the character involved. The length of a change is minus log_{2} of the probability of a change plus log_{2}(12) bits for the two differing characters. We use these instruction lengths as costs in the DPA. Given fixed probabilities and two strings, two alignments can be compared on the basis of their message lengths. The more plausible alignment has the shorter message length. The difference in message lengths gives log_{2} of the posterior oddsratio of the alignments. Usually, probabilities of instructions are not known in advance and must then be included in the encoding of an alignment. One method is to use an adaptive code based on running counts. Alternatively, a header to the message states the probabilities used to an appropriate accuracy. The two methods give very similar message lengths[4]. The optimality of an alignment depends on the probabilities used. When these are not known in advance, a search is necessary to find good values. Reasonable initial values are assumed and an alignment, optimal for these probabilities, is obtained. Probabilities for the next step are obtained from instruction frequencies in the alignment. The process converges in a few steps. A small number of initial values must be tried because of local minima. Note that an arbitrary table of costs for generative instructions implies certain probabilities. These can be discovered by normalisation. The optimal alignments produced by the DPA are invariant if all costs are multiplied by a constant k>0 and if a constant j is added to the indel costs and 2j is added to other costs. Starting from some table of costs, eg. for Sellers' metric, one can choose j and k to give modified costs which give the same optimal alignments and which can be interpreted as negative log_{2}s of probabilities.
The probabilities must sum to one and give an expression for k in terms of j. There is one degree of freedom remaining which gives a set of probabilities under which the optimal alignments produced are invariant. Exercising this degree of freedom varies the probabilities of alignments but does not change their rankorder. Normalisation enables the probabilities implied by Sellers' metric, for example, to be calculated.
Any similar metric, such as that corresponding to the LCS problem, can be normalised in this way. For given costs, optimal alignments are not unique in general. All optimal alignments and indeed all nonoptimal alignments contribute to the probability of A and B being related in some way. This has been recognised by Bishop et al[2]. Even for two identical strings, eg. ACGT:ACGT, there is a small contribution from suboptimal alignments and the shorter the strings are the less certain can we be that they are related! In computing an MML alignment, D[i,j] is minus the log_{2} of the probability that A[1..i] and B[1..j] arose through being related in one most probable way. The `min' in the general step of the DPA corresponds to choosing the maximum probability. If probabilities are instead added, for they correspond to exclusive events, then D[i,j] is the (log_{2}) probability that A[1..i] and B[1..j], and ultimately A and B, arose through being generatively related in some unspecified way. We call this the rtheory. As before, it is necessary to iterate to discover the probabilities to use in encoding the theory. To do this, frequencies of instructions are computed in D in a weighted fashion according to probabilities of transitions. The nulltheory is that strings A and B arose through being unrelated and this is the complement of the rtheory. The nulltheory is encoded by encoding string A and then encoding string B. Note, the lengths of A and B must be included  see later. The oddsratio of the strings' being related against their not being related is equal to the binary exponential of the nulltheory length minus the rtheory length. It is then simple to calculate the posterior probability of the strings being related. Running the rtheory algorithm in reverse, D'[i+1,j+1] is the (log_{2}) probability that A[i+1..A] and B[j+1..B] arose through being related. Thus, D[i,j]+D'[i+1,j+1]D[A,B] is the (log_{2}) probability that the "true" alignment related A[1..i] to B[1..j] and A[i+1..A] to B[j+1..B], given that A and B are related. A density plot of this function is a good way of visualising possible alignments.
For identical strings, there is near but not absolute certainty of the obvious alignment. For nonidentical strings the picture can be more interesting:
Note the area of uncertainty marked by `#' where `CG' is aligned with `GC' for the first pair of strings and the branch for the second pair. The examples above were created to exhibit certain features; in general results for such short strings should be treated very cautiously. The rtheory message includes the (weighted) number of generative instructions. The nulltheory message must include the length of string A and of string B. If these two lengths are encoded independently then it gives a penalty to the nulltheory and favours the rtheory for strings of similar length. In some circumstances this may be appropriate but we do not generally consider strings similar only on grounds of length but require similarity of structure also. The generative model with p(i)=p(d) has a built in expectation that the difference in the lengths of A and B is O(sqrt(A+B)). We build a similar expectation into the nulltheory and transmit A+B using the log_{2}^{*} prior and AB using a binomial prior. If p(i)=p(d)=1/2, the rtheory and the nulltheory have the same prior over the difference in lengths. When p(i)=p(d)<1/2 the rtheory has a tighter distribution on the difference. Thus extreme similarity of length still favours the rtheory in the absence of genuine structural similarity, but the message length difference caused by this effect is at most O(log(A+B)) and is only present if abs(AB)<<sqrt(A+B). Figure 1 plots the average message length of the rtheory against homology for strings of equal length. The homology is the fraction of characters matching in an optimal alignment (in this case a Sellers alignment) and is commonly used as a measure of similarity. When trying to judge the significance of homology it is common practice to scramble the letters in the two strings and recalculate the homology for a baseline figure. This procedure is not necessary with the MML method which has a built in and rigorous test of significance. The data for figure 1 were generated by mutating strings with homology one (identical) or zero by various amounts using a mutation model in which changes, inserts and deletes occur in the ratio 2:1:1. Note that for long strings, the nulltheory is encoded in approximately two bits per character. Note also that very low homology (consider AAAA:CCCC) supports the rtheory because the strings are related in a negative way. 3. Finite State ModelsMore sophisticated variations on the dynamic programming algorithm are often used in molecular biology. The basic DPA gives a run of L inserts (or deletes) a cost proportional to L. Linear cost functions w(L)=a+b*L are more commonly used for indels. The constants a and b are chosen by experimentation but a=3 and b=1 are typical in work on DNA. As before, matches commonly cost zero and changes cost one. The constant `a' is a startup cost and discourages large numbers of small indels being inferred. This is found to be plausible biologically. Gotoh[9] gave a fast O(A*B) algorithm for such costs. In brief, the algorithm maintains three values in each cell of the matrix D. It maintains a value for each possible direction of arrival and a cell can be computed by examining just three neighbours. The linear costs can be related to probabilities of instructions for a generative machine using the techniques of the previous section. One possible interpretation is a onestate machine with modified insert and delete instructions. These instructions are now taken to insert and delete strings rather than single characters. The alignments produced by Gotoh's algorithm are invariant when the costs are multiplied by k>0 and when 2j is added to every character match and change and jL is added to every indel of length L. Summing the probabilities to one gives an expression for k in terms of j. This enables the probabilities implied by particular costs to be recovered. As before, one degree of freedom remains. A second interpretation of the linear costs gives a threestate generative machine with simple character instructions. A match or a change always leaves the machine in S_{1}, an insert leaves it in S_{2} and a delete leaves it in S_{3}. Costs are conditional upon the current state.
In general, a threestate model cannot be normalised unless a constant x>0 is subtracted from the cost of entering S_{2} (and S_{3}) and added to the cost of leaving it. This does not change the alignments inferred. Probabilities of instructions in each state sum to one and this gives two equations in j, k and x. One degree of freedom remains giving a family of values for j, k and x and the implied probabilities.
The second interpretation of linear costs as a threestate machine is preferred as it easily allows an rtheory algorithm to be run in a forward and reverse direction and the results to be combined to give a density plot. In general, the costs in Gotoh's algorithm can be varied to reflect the probabilities of instructions and the algorithm then produces an MML alignment. When probabilities are not known they can be discovered by iteration. We also modify Gotoh's algorithm, rather as the basic DPA was modified, to add probabilities and to produce a weighted sum over all possible alignments. This gives the rtheory of the threestate model. Figure 2 gives an example of a probability density plot. If any problemspecific knowledge is available, it can and should be used to design new models of relation. First, a language must be designed which has appropriate expressiveness. Second, an optimal (or at least a good nonredundant) coding scheme must be devised for the language. Finally, an algorithm or a heuristic must be designed to infer plausible hypotheses. For example, it is apparent that some pairs of DNA strings have related regions and unrelated regions. The related regions are not necessarily identical. A language to express this possibility might contain characterbased instructions as before plus longinsert and longdelete instructions. The latter instructions apply to reasonably large substrings. Assume that long indels are rare events and that their lengths have a geometric distribution with a large mean. The 3state machine can be augmented with two new states to model this situation.
The dynamic programming algorithm can be modified to maintain five values  one for each state  in each cell of the matrix D and to sum probabilities much as before. As before, parameter values can be estimated by iteration. The new algorithm now calculates the rtheory for the new model. Note that a run of inserted characters is not classified absolutely as long or short. Rather it is treated in a weighted fashion as both long and short. To be sure, an insert of one or two characters almost certainly implies a short insert and one of many characters almost certainly implies a long insert  almost. The model just defined also solves the problem of finding substrings or patterns common to two strings. The following special case is of interest in molecular biology and has been examined by Sellers[19] and others:
W and W' are related but V and Y are unrelated as are X and Z. 4. Other ModelsThere are a great many models of relation between strings. An important class of model is based on simple finitestate machines. The inference algorithms are variations on the dynamic programming algorithm. Each model is defined by probabilities of machine instructions and by probability distributions on the runlengths of blocks of matches, changes and indels. The Sellers' and the linearcost algorithms discussed above imply geometric distributions for runlengths. A linear cost amounts to using a unary code which is optimal for a geometric distribution. Millers and Myers[13] gave an O(A*B) alignment algorithm which enables downwardconcave cost functions to be used. After normalisation and modulo the characters, a cost function w(L) implies a probability distribution 2^{w(L)} on lengths. Any problemspecific knowledge of actual probabilities can be built into a cost function. Conversely, if a particular cost function performs well it gives information about mechanisms applying to strings. Other types of relation model are possible. For example, Tichy[21] gave a lineartime algorithm to find the minimum number of blockmoves necessary to edit one string into another. Block mutations such as duplicates, reversals and complements are known to occur in macromolecules. The MML method allows such rival models to be compared on real strings. 5. ConclusionsWe used minimum message length encoding to infer and compare theories over strings. MML encoding makes explicit the connection between a string comparison algorithm and a model of relation. For a given model, the oddsratio of two alignments can be calculated. The rtheory is the weighted union of all possible alignments. Since there is a natural nulltheory, the posterior probability of two strings' being related can be calculated. An algorithm to find substrings or patterns common to two strings was also given. Given enough data, the MML method allows models of relation for "real" strings to be compared and, in principle, the "true" model to be inferred. The methods are clearly applicable to mutating strings such as DNA and other macromolecules. 6. References
[1] L. Allison and C. N. Yee.
Minimum message length encoding and the comparison of macromolecules.
[2] M. J. Bishop, A. E. Friday and E. A. Thompson.
Inference of evolutionary relationships.
[3] M. J. Bishop and C. J. Rawlings (eds). Nucleic Acid and Protein Sequence Analysis, a Practical Approach. IRL Press 1987. [4] D. M. Boulton and C. S. Wallace.
The information content of a multistate distribution.
[5] G. J. Chaitin.
On the length of programs for computing finite binary sequences.
[6] Y. H. Chong, B. Pham, A. J. Maeder and M. J. Marlin.
Automatic nephanalysis using a minimum message length classification scheme.
[7] D. N. Cohen, T. A. Reichert and A. K. C. Wong.
Matching code sequences utilizing context free quality measures.
[8] M. P. Georgeff and C. S. Wallace.
A general selection criterion for inductive inference.
[9] O. Gotoh.
An improved algorithm for matching biological sequences.
[10] R. W. Hamming. Coding and Information Theory. Prentice Hall 1980. [11] A. N. Kolmogorov.
Three approaches to the quantitative definition of information.
[12] G. G. Langdon.
An introduction to arithmetic coding.
[13] W. Miller and E. W. Myers.
Sequence comparison with concave weighting functions.
[14] Ming Li and P. M. B. Vitanyi.
Two decades of applied Kolmogorov Complexity.
[15] T. A. Reichert, D. N. Cohen and K. C. Wong.
An application of information theory to genetic mutations and the matching
of polypeptide sequences.
[16] J. Rissanen.
A universal prior for integers and estimation by minimum description length.
[17] D. Sankoff and J. B. Kruskall eds. Time Warps, String Edits and MacroMolecules. Addison Wesley 1983. [18] P. H. Sellers.
On the theory and computation of evolutionary distances.
[19] P. H. Sellers.
The theory and computation of evolutionary distances:
pattern recognition.
[20] R. Solomonoff.
A formal theory of inductive inference, I and II.
[21] W. F. Tichy.
The stringtostring correction problem with block moves.
[22] C. S. Wallace and D. M. Boulton.
An information measure for classification.
[23] C. S. Wallace and P. R. Freeman.
Estimation and inference by compact coding.
[24] M. S. Waterman.
General methods of sequence comparison.
[25] I. H. Witten, R. M. Neal and J. G. Cleary.
Arithmetic coding for data compression.
[26] A. K. C. Wong, T. A. Reichert, D. N. Cohen and B. O. Aygun.
A generalized method for matching informational
macromolecular code sequences.
Also see [Bioinformatics] and [MML]. L. Allison, C. S. Wallace and C. N. Yee, 1989; HTML 2001. 

