Inductive Inference over MacroMolecules 

© L.Allison, C.S.Wallace and C.N.Yee,Technical Report 90/148,

NB. Generative probabilistic finite state automata (models) for a pair of sequences have later been called Pair Hidden Markov Models PHMM. 
There are many ways of explaining or modeling two strings given that they are related. In the mutation model, a simple machine reads one given string and a sequence of mutation instructions while writing the second given string (figure 1).
string A: TATACGTTACAC instructions: copy, copy, ins(A), copy, copy, change(G), change(C), copy, copy, del, copy, copy, copy, ins(A) string B: TAATAGCTTCACAFigure 1: Mutation Model Example. 
In the generative model, a different simple machine reads a sequence of generative instructions while writing the two given strings in parallel (figure 2).
instructions: match(T), match(A), ins(A), match(T), match(A), change(C,G), change(G,C), match((T), match(T), del(A), match(C), match(A), match(C), ins(A) string A: TATACGTTACAC string B: TAATAGCTTCACAFigure 2: Generative Model Example. 
Note that for traditional reasons we use the terms `change' when a different character appears in each string, `delete' (del) when a character appears in string A and not in string B and `insert' (ins) when a character appears in B and not in A under both models but with appropriate meanings. For both models, a given sequence of instructions is equivalent to a traditional alignment, see figure 3.
string A: TATACGTTACAC     string B: TAATAGCTTCACAFigure 3: An Alignment. 
It is therefore clear that the mutation and generative models are equivalent and we usually prefer the generative model on grounds of symmetry. We use the terms `alignment' and `instruction sequence' interchangeably.
Note that alignments and sequences of generative or mutation instructions are not unique. Often we want an optimal explanation, alignment or instruction sequence but even suboptimal explanations may be plausible.
Given two macromolecules, A and B, it is often assumed
that they evolved independently from a common ancestor P.
We could model this by two mutation machines, MA and MB,
which each read P and their own instruction sequence and produced
A and B respectively.
A and B have infinitely many possible ancestors P
but most are wildly improbable.
Summing over all possible P gives the probability
of A and B being related under this model.
If we knew P and the instruction sequences for MA and for MB,
the sequences could be synchronised to derive an alignment or
an instruction sequence for a single generation machine.
An alignment corresponds to a set of possible
ancestors (figure 4).
Alignment 1: TATACGTTACAC     TAATAGCTTCACA Ancestors 1: TA[A]TA[CG][CG]TT[A]CAC[A] Alignment 2: TATACGTTACAC      TAATAGCTTCACA Ancestors 2: T[A]ATA[C]G[C]TT[C]ACA[C]Figure 4: Alignments and Possible Ancestors. 
Therefore the rtheory should not be based on just one alignment but summed over all of them. We can decide if A and B are related but P must remain uncertain unless extra information is provided. Note that there may have been characters in P that were deleted in both A and B and for which no direct evidence remains. Note also that the assumption of a constant rate of evolutionary mutation implies that MA and MB are equal and it is difficult to see how one could make any other assumption given just two strings.
Traditional alignment programs[Bish3][Sank][Wate] select alignments that have a minimum cost, or equivalently a maximum score, under some measure. The minimum message length principle implies that the cost of an alignment should be the length of a message under an optimal code for alignments. To answer the question, are A and B related, all alignments should contribute according to their probability. The corresponding message and its code should transmit A and B given that they are related; it should not transmit a single alignment.
Our models and codes may have parameters for the form of a generative machine and the probabilities of its instructions which must be included in the message. In this way simple and complex models may be compared on an equal footing. In following sections we specify and compare one, three and fivestate models.
The onestate generative machine has a single state and transitions or instructions of the following forms: match(ch), change(ch1,ch2), ins(ch), del(ch). The match instruction generates the same character `ch' in both strings. The change instruction generates ch1 in string A and ch2 in string B. The insert (ins) instruction generates a character in string A and the delete (del) instruction generates a character in string B. A single instruction having probability `p' is optimally encoded in log_{2}(p) bits plus two bits for each character, except that the two characters in a change are different and together require log_{2}(12)<4 bits. Often we require that P(ins)=P(del). The probability of a sequence of instructions of fixed length is the product of the probabilities of individual instructions and the message length of the sequence is the sum of the message lengths of the instructions.
If the probabilities of instructions are known apriori,
a fixed code can be based on them.
The dynamic programming algorithm (DPA)
is modified to compute the message length
of an optimal instruction sequence (figure 5).
Boundary Conditions: D[0, 0] := 0 D[i, 0] := delete * i, i:=1..A D[0, j] := insert * j, j:=1..B General Step, i:=1..A and j:=1..B: D[i,j]:=min(D[i1,j ] + delete, D[i ,j1] + insert, D[i1,j1] + if A[i]=B[j] then match else change)Figure 5: Dynamic Programming Algorithm. 
Each instruction is given a weight based on log_{2} of its probability. Note that a match does not have zero cost. D[i,j] is the message length of an optimal sequence to generate A[1..i] and B[1..j]. It has been shown[Alli2] how to normalise Seller's[Sell1][Sell2] and similar metrics so as to extract the probabilities implicit in them [See also L.Allison, Jrnl. Theor. Bio. 161, pp263269, 1993]. Ignoring the matter of coding lengths and parameter values for now, the algorithm finds an alignment or instruction sequence having the minimum message length.
When probabilities are not known in advance, they must be inferred from the strings. A simple iterative approach finds at least a local optimum. Initial probabilities are guessed and an alignment is found. New probabilities are calculated from the frequencies of instructions in the alignment and used in the next iteration. The process quickly converges. The inferred parameter values are included in the message; Boulton and Wallace[Boul] give the necessary calculations.
The length of a sequence is not known in advance and must be included in any message. In the absence of more information we use Rissanen's[Riss] universal log^{*} prior to transmit the length.
The overhead of transmitting the length of an instruction sequence and the parameter values becomes a smaller fraction of the total message length as string length increases. Since the DPA does not consider the overhead when finding an alignment, the overall result is not guaranteed to be optimal. In practice it is, particularly for reasonably long strings.
Optimal alignments are not unique in general. Even suboptimal alignments make some contribution to the rtheory. The `min' in the DPA (figure 5) chooses the single alignment having the smallest message length, the largest probability. If the probabilities are instead added, for they correspond to exclusive events, then D[i,j] is the (log_{2}) probability that A[1..i] and B[1..j] arose through being generatively related by the machine but in some unspecified way. This yields the probability of A and B given the rtheory. The addition of probabilities is a probabilistic argument but a corresponding code has been devised by Wallace[Wall3]. It reduces the message length of the rtheory below that of a single alignment. When instruction probabilities are not known they must be inferred. The same iterative search is used, weighted frequencies being added when alternative sequences are combined and their probabilities added.
The nulltheory is that A and B arose independently and this is the complement of the rtheory. The nulltheory is encoded by encoding string A and then encoding string B. It must include their lengths. Rather than transmit two lengths, we transmit the sum of the string lengths using the universal prior and the difference of the lengths using a binomial prior. This is reasonable because random insertions and deletions perform a random walk on the difference of the lengths. If P(ins)=P(del)=1/2, the nulltheory and the rtheory have the same prior over AB. If P(ins)=P(del)<1/2, the rtheory has a tighter distribution on AB.
The oddsratio of the rtheory and the nulltheory is the binary exponential of the difference in their message lengths. Since they are complementary, their posterior probabilities can be calculated.
Running the rtheory in reverse,
D'[i+1,j+1] is (log_{2}) the probability of
A[i+1..A] and B[j+1..B] being generated.
Thus, D[i,j]+D'[i+1,j+1] is the (log_{2}) probability
of A[1..i] and B[1..j] being generated
and A[i+1..A] and B[j+1..B] being generated
and is proportional to the probability of the "true"
alignment partitioning the strings in this way.
A density plot of this function
is a good way of visualising possible alignments
(figure 6).
See:
Allison, Wallace & Yee,
[AIM'90], and
for greyscale alignment probability plots. _{T A A T A G C T T C A C A} _{ $   . } T:_{  $ #  . . } A:_{ .  $ $  . . . } T:_{ . .   $   . . } A:_{ . .   $ #  . . } C:_{ .   # + #  . . } G:_{ . . .  # $ + . . } T:_{ . . .  + $ +  . } T:_{ . . . . + $ #  . } A:_{ . . .  #  #  .} C:_{ . .  #  # } A:_{ . .  # + #} C:_{ .  # $} key : $ # +  . MML plus: 0..1..2..4..8..16 bits nulltheory=62.4 rtheory=58.5 probability related=0.94Figure 6: Alignment Densities. 
Note the areas of uncertainty where there are multiple reasonable
explanations, CG:GC and ACAC:CACA.
In this example the rtheory is certainly acceptable
but any single alignment is at best marginally acceptable.
The density plot is a "fuzzy" alignment.
It is a good representation of the possible ancestors of A and B
for use in multiple string alignment
or in phylogenetic tree reconstruction[Alli3].
We note that string alignment algorithms
employing linear costs w(L)=a+b*L for indels of length L
are in common use.
The constant `a' is a startup cost to discourage the inference of
many short block indels
and is thought to be realistic for macromolecules.
This can be modelled by a threestate machine (figure 7).
states: S1: match/change state S2: insert state S3: delete state transitions: S1match>S1 S1change>S1 S1insert>S2 S1delete>S3 S2match>S1 S2change>S1 S2insert>S2 S2delete>S3 S3match>S1 S3change>S1 S3insert>S2 S3delete>S3 conservation: p(mSj)+p(cSj)+p(iSj)+p(dSj)=1, j=1,2,3Figure 7: ThreeState Machine. 
When P(insS1)<P(insS2) there is a low startup probability, a high startup cost, on block indels. A linear cost function corresponds to a geometric prior on indel lengths, for log_{2}(p^{L}q) = log_{2}(q)  log_{2}(p)*L.
Gotoh[Goto] has given an algorithm to infer an optimal alignment for linear weights. Briefly, the DPA is modified to maintain three values in each cell of the D matrix. One value corresponds to each state. As before, we have further modified the algorithm to add probabilities. Again probabilities must be inferred from the strings if they are not known apriori. The iterative scheme described previously is used.
The algorithm is run forward and backwards and results added to produce alignment density plots. The threestate model has more parameters than the onestate model and these add to the length of the header part of the rtheory message.
Gotoh[Goto] also provided an algorithm
to find a single optimal alignment under
piecewise linear indel costs.
Bilinear indel costs can be modelled by a fivestate machine
(figure 8).
states: S1: match/change state S2: (short)insert state S3: (short)delete state S4: longinsert state S5: longdelete state transitions: S1, S2, S3 as before plus S1longins>S4 S1longdel>S5 S4longins>S4 S4endlong>S1 S5longdel>S5 S5endlong>S1 conservation: p(mS1)+p(cS1)+p(iS1)+p(dS1)+ p(longinsS1)+P(longdelS1)=1 etc.Figure 8: FiveState Model. 
We have modified the DPA to maintain five values, one for each state,
in the D matrix.
This kind of model is appropriate if
short indels are thought to be relatively frequent and
long indels rather infrequent which might account for point mutations
and larger events, perhaps introns/exons.
A fullyconnected fivestate machine would have a very large number of
transitions and parameters.
To reduce this we simply augment the threestate machine with
longinsert and longdelete states connected only to
S1.
short: P_{s}(L)=0.098 x 0.50^{(L1)}x0.50 long: P_{l}(L)=0.002 x 0.98^{(L1)}x0.02 combined: P_{c}(L)=P_{s}(L)+P_{l}(L) Figure 9: Combining Length Costs of Long and Short Indels. 
As before, probabilities of alternative sequences are added. Note that because of this a medium length indel is considered to be partly long and partly short. Resulting indel costs are smooth and asymptotically linear; figure 9 gives an example.
The search for optimal fivestate parameter values is more difficult. Because longindels are expected to be rare they are initially given a low probability, a large startup cost, but a high probability of continuation. A further useful strategy is to run the threestate model first and base initial probabilities for S1, S2 and S3 on the results with minor adjustments to conserve probability. The iterative process is used from this starting point.
Note that a special case of fivestate data
is when the long indels occur at the ends of one or both strings,
when string A=A1;A2;A3 and B=B1;B2;B3
where A2 and B2 are related by a threestatemodel and
A1, A3, B1 and B3 are unrelated, possibly null.
This amounts to looking for a common pattern A2:B2 in A and B.
This situation is rather common because interesting
pieces of DNA from a chromosome may have flanking regions
attached.

The one, three and fivestate algorithms were run on both artificial data and real DNA strings.
The cost of stating the parameters of a model is a decreasing fraction
of the total message cost as string lengths increase.
For short strings, this may cause a simpler model to be favoured.
To study this effect, pairs of strings of various lengths
were generated by a fivestate model
with various known parameter values.
For each setting, a collection of 15 to 30 pairs was generated.
For each model and collection, the rtheory algorithm was run
and the message lengths averaged.
This was done for the one, three and fivestate
models.
S1 (match/change): P(match)=0.8 P(change)=0.000001 P(short insert)=P(short delete)=0.098 P(long insert)=P(long delete)=0.002 S2 (short insert): P(match)=0.49 P(change)=0.000001 (return to S1) P(short indel)=0.5 P(short delete)=0.01 S3 (short delete): similar to S2 S4 (long insert): P(long insert)=0.98 P(end long)=0.02 (return to S1) S5 (long delete): similar to S4 average short indel length = 2 average long indel length = 50Table 1: 5State Parameters for Generating Test Data. 
Figure 10(a) [see Allison et al 1992] gives plots of the total message lengths per symbol versus string length for strings generated by a fivestate model having parameters indicated in table 1. In this case the probability of changes was held very low to isolate the interaction between short and long indels. There is a crossover from favouring the threestate to favouring the fivestate model at a string length of 500 to 1000 characters. (The inference for long strings is stronger than might be immediately apparent because the cost per symbol is plotted.) A single pair of strings from an unknown source and shorter than this cannot be identified as coming from a fivestate rather than a threestate model.
A pair of strings longer than 500 to 1000 characters can be identified as coming from a fivestate model. Many pairs of shorter strings, assumed to be generated by the same model and parameter values, can also be so identified. It is sufficient for the parameter values of the pairs to be related rather than identical in this latter situation so that they need not be stated in full for each pair. To indicate what would happen in an extreme case of this situation, figure 10(b) gives message lengths under the three models when parameter costs are excluded. The data is the same as for figure 10(a). In this case, the crossover from the threestate to the fivestate inference is at about 150 characters for the given parameter values. At this point a pair is reasonably likely to have at least one long insert or delete for the given parameter values.
Note that the DNA coding regions for many proteins have lengths in the difficult range of 500 to 1000 characters. Note that the crossovers would occur at shorter lengths for an alphabet size larger than 4.
When the similarity within each pair of strings is increased, the general picture is as described above but the crossovers occur at greater string lengths. Increasing similarity favours simpler models. In the extreme case, two identical strings are best described by the onestate model.
It is clear that to draw any general conclusions about
the "true" model for real DNA, a pair of very long related strings or
a large number of pairs of related strings is required.
Such data are hard to come by
The Blur sequences are members of the large Alu family[Bain].
The typical Blur sequence is about 280 characters long.
1state 3state 5state >model Blur8 :Blur13 1039 946 943 (1016) (912) (909) Blur11 :Blur13 1179 963 957 (1155) (928) (922) Blur11':Blur13' 825 830 836 (801) (800) (802)Lengths in parentheses are exclusive of parameter costs. Table 2: Message Lengths in bits for some Blur Sequences. 
Table 2 gives some results of the one, three and fivestate models on Blurs 8, 11 and 13. When whole strings are compared, the evidence strongly supports either the three or fivestate model rather than the onestate model. The three and fivestate models give message lengths within a few bits of each other which is insufficient to conclude in favour one rather than the other. Blur11' and Blur13' were derived by forming an alignment and truncating any terminal overhanging segment on either string. Results for this pair slightly favour the onestate model suggesting that the overhangs were the dominant factor in the whole strings. The difficulty is probably that the Blur's are too short. There are many Blurs but it is not known how they are related. We suggest that the correct way to explain them is by an MML multiple alignment with respect to a phylogenetic tree (or forest). Algorithms for this problem are the subject of current work [e.g. see Allison & Wallace, Jrnl. Molec. Evol. 39 pp418430, 1994].
The globin genes and pseudogenes are often used in phylogenetic studies
of primates[Holm].
1state 3state 5state  HSBGLOP:ATRB1GLOP 6414 6004 6007 inc' param's (6383) (5960) (5958)exc' param'sGenbank id HSBGLOP: human beta type globin pseudogene 2153bp. Genbank id ATRB1GLOP: owl monkey psi beta 1 globin pseudogene 2124bp. Table 3: Message Lengths (bits) for 2 Globin Pseudogenes. 
Table 3 gives the results of the rtheory algorithms of the three models on human beta type globin pseudogene and owl monkey psi beta 1 globin pseudogene. The three and fivestate models give similar results and the onestate model performs significantly worse.
The above results strongly suggest that the onestate model is not the correct one to use with DNA. They do not distinguish between the three and fivestate models. It is clear that a great many more tests are required to discover if one of these, or some other model, is the correct one to use.
The basic DPA takes O(A*B) time and space.
If a single optimal MML alignment is required,
Hirschberg's[Hirs] divide and conquer technique
can be used to reduce the space complexity to O(B).
This is not applicable if the MML of the rtheory is required.
Observations of alignment density plots (figure 6)
indicate that most of the probability is concentrated in a narrow region.
Fast approximate MML algorithms can take advantage of this
fact.
Figure 11: Message Length v. Window Size (approx'n algorithm). 
An approximate threestate algorithm has been programmed. Firstly, Ukkonen[Ukko] gave an O(A*D(A,B)) editdistance algorithm for Seller's metric. This is fast for similar strings. The algorithm does not calculate the D matrix directly but rather the location of contours in it. The algorithm halts when a contour reaches D[A,B]. We applied Ukkonen's technique to linear indel costs. Secondly, as mentioned in section 3, arbitrary costs in a metric can be normalised to yield instruction probabilities. This process can be reversed to choose integer weights corresponding to "reasonable" probabilities. This is used to find an optimal alignment. Finally, a window containing all D cells that lie within a diagonal distance `w' of the optimal alignment is formed and the rtheory algorithm is run only within this window. All cells outside this window are considered to have probability zero and an infinite message length. The time and space taken for this step is O(max(A,B)*w). A graph of message lengths against w is given in figure 11 [see Allison et al 1992]. The random data was generated by a threestate model. The dominant influence on required window size is the probability of a match in state 1 which is indicated. It can be seen that a small window includes almost all the probability. This windowing technique could be applied to other models.
We have solved the question of two strings being related (the rtheory) for one, three and fivestate models of string relation. The biological validity of such models is open to debate but we note that one and threestate models are implicit in alignment algorithms in common use. A major advantage of a formal model is that its assumptions are explicit and available for scrutiny and modification. We have given an objective method of comparing models of relation over a given set of strings and have applied it to the one, three and fivestate models. The tests performed are only a first step and the models can be elaborated to account for unequal probabilities of changes, effects of codon bias, the genetic code and so on. The assembly of a large body of real biological test data remains a problem.
In general, a single alignment is excessively precise. All alignments are possible; some are more probable than others. The rtheory should include them all.
Modulo normalisation, an arbitrary cost function w(L) on blocks of matches, changes, inserts or deletes of length L implies a prior probability distribution 2^{w(L)} on run lengths. Problem specific knowledge can be used to design a good prior. Conversely if a cost function gives results judged (externally) to be good, it gives information about good priors and mechanisms relating strings.
A piecewise linear cost function on runs of instructions implies a finitestate machine or model. All such models have an O(A*B) MML algorithm for the rtheory and for alignment probability densities. However, the number of transitions in the machine and thus the complexity of the algorithm rapidly increases with the number of linear segments making the cost function. Miller and Myers[Mill] have given an O(A*B) or O(A*B*log(max(A,B))) algorithm to find a single optimal alignment for arbitrary concave functions. There does not seem to be an rtheory algorithm or a density algorithm with this complexity for such functions.
Given a single pair of strings from an unknown source,
over an alphabet of size four
and of length less than 500 to 1000 characters
it is not possible to identify them as being the products of a fivestate
process even if they are.
Longer strings or many pairs of shorter strings generated by
the same fivestate process can be so identified.
More on Computational Biology / Bioinformatics [here].
www: 
