Appears in
Hawaii Int. Conf. Sys. Sci. 25, Vol.1, pp.663674, January 1992.
L. Allison, C. S. Wallace and C. N. Yee,
Department of Computer Science, Monash University, Australia 3168.
Abstract:
A method of Bayesian inference known as
minimum message length encoding is applied
to inference of an evolutionarytree and to
multiplealignment for k>=2 strings.
It allows the posterior oddsratio of two competing hypotheses,
for example two trees,
to be calculated.
A tree that is a good hypothesis
forms the basis of a short message describing the strings.
The mutation process is modelled by a finitestate machine.
It is seen that tree inference and multiplealignment are
intimately connected.
Q: What is the difference between a hypothesis and a theory?
A: Think of a hypothesis as a card.
A theory is a house made of hypotheses.
(From rec.humor.funny, attributed to Marilyn vos Savant.)
Keywords:
inductive inference,
minimum message/ description length,
multiplealignment,
evolutionarytree,
phylogenetic tree.
1. Introduction.
In previous
papers[1, 2],
minimum message length
(MML)
encoding was used to
infer the relation, if any, between 2 strings
and to compare models of mutation or relation.
Here we apply MML encoding
to the problems of multiplealignment and inference of an evolutionarytree
given k>=2 strings of average length n.
MML encoding is a method of Bayesian inference.
We consider the message length of an explanation
of given data D.
An explanation consists of a message which first states
a hypothesis H, then states the data using a code which
would be optimal were H true.
It is a basic fact of coding theory that
the message length (ML), under optimal encoding, of an event of probability p
is log_{2}(p) bits.
From Bayes' theorem in the discrete case:
P(H&D) = P(H).P(DH) = P(D).P(HD)
ML(H&D)= ML(H)+ML(DH)= ML(D)+ML(HD)
P(DH) is called the likelihood of the hypothesis.
It is a function of the data given the hypothesis;
it is not the probability of the hypothesis.
P(H) is the prior probability of the hypothesis
and P(HD) is its posterior probability.
If ML(H) and ML(DH) can be calculated for given H and D
then the posterior oddsratio of 2 hypotheses can be calculated.
Analogous results hold in the continuous case and
imply an optimum accuracy for estimating realvalued parameters,
although the calculations can be difficult.
MML encoding can be used to
compare models, test hypotheses and estimate parameters.
Wallace and Freeman[25] give the statistical foundations.
Maximumlikelihood methods and MML encoding are related.
A significant difference is that the former do not cost
the model or hypothesis complexity, ML(H), and so
tend to favour overly complex or detailed models.
In sequence analysis the issue covers:
the estimation of parameters to optimal accuracy,
the complexity of models of mutation and of evolution and
the fact that a single alignment is too precise a hypothesis.
In cases where model complexity is constant,
it can be ignored and then maximumlikelihood and MML encoding
give the same results.
We next link inference over 2 strings to problems over k strings.
Given strings A and B,
the following questions can be raised:
Q1: Are they related?
Q2: Are they related by a particular model of relation
and parameter values (~ time since divergence)?
Q3: Are they related by a particular alignment?
Q4: Are they related by descent from a particular common ancestor?
An answer to one of these questions can "only" be a hypothesis.
An explanation of the strings can be based on such an answer.
Explanations based on successive questions
carry increasing amounts of information
and must have increasing message lengths.
A and B have infinitely many possible common ancestors.
It is possible to dismiss most as wildly improbable
but impossible to identify a common ancestor with
certainty without extra information.
An alignment implies a subset of possible ancestors in a natural way.
Conventional alignment programs return a single alignment
which is optimal under some explicit or implicit assumptions  a model.
A given model can relate A and B by many different alignments
and we call the answer to Q2 the rtheory (for relation).
It states the inferred parameters for the evolution of A and B
followed by A and B in a code based on all possible alignments.
A byproduct is a probability density plot of all alignments.
The answer to Q1 would require the sum over all models
of the integral over all parameter values.
This calculation is impractical and the result would be of little use;
Q2 is the question usually asked.
There is a natural nulltheory in sequence analysis.
If the strings are unrelated,
it is not possible to do better than transmit them one after the other,
at about 2 bits per character for DNA.
The lengths of the strings must also be transmitted.
Any explanation that has a message length greater
than that of the nulltheory is not acceptable.
The approach to problems over 2 strings
generalises to problems over k>=2 strings.
Questions analogous to Q1Q4 arise
when given k strings which might be related by an evolutionarytree:
R1 (a): Are
they related or, slightly more detailed,
R1 (b): are
they related by a particular treetopology?
Consideration is usually restricted to binary trees
and we follow this convention here.
It is not hard to allow arbitrary branching factors.
Forests should be allowed so as to permit independent families;
the message length
of a forest includes the sum of those of its trees
and the generalisation is obvious.
R2: Are they related by a particular tree
(topology + parameter values for each edge).
Each edge of the tree represents a simple 2string relation
and the inferred parameters for each edge (~ time) are required.
This is the generalised rtheory.
R3: Are
they related by a particular multiplealignment?
R4: Are
they related by descent from particular ancestors?
The chain Q1:Q2:Q3:Q4 matches the chain R1:R2:R3:R4
and in particular Q2:Q3 matches R2:R3,
although this requires some discussion.
Many misunderstandings come from confusing answers to R3 with those to R2.
A tree (topology & edge parameters) implies a
probability distribution on charactertuples and thus on multiplealignments.
The probability of a tree and the strings (R2) is the
product of the prior probability of the tree and the sum of
the probabilities of all such multiplealignments given the tree.
When k>2,
the number of alignments is very large
and heuristics are needed to
make the computations feasible.
R3 can be taken in 2 ways.
We take the word `related' to mean `related by an evolutionarytree'.
A tree has 2k3 edges and thus has O(k) parameters
but there are 5^{k}1 possible tuples in a multiplealignment.
There is a smaller but still exponential number of `patterns'.
Many of the tuples' probabilities are locked together under
an evolutionarytree hypothesis.
If `related' were taken to mean any arbitrary relation,
there would be more degrees of freedom in the form of hypotheses
which might enable the data to be fitted better but at the higher overhead
of stating more parameters.
Reichert et al[21] were the first to apply compact encoding
to the alignment of 2 strings.
They used one particular model of relation and sought a single optimal
alignment.
This was difficult to do efficiently for the model that they chose.
Bishop and Thompson[5] defined maximumlikelihood alignment for what
is a 1state model although they recognised the possibility of other models.
Allison et al[2] objectively compared 1, 3 and 5state models of
mutation using MML encoding.
Felsenstein[11,12] developed a maximumlikelihood method for
inferring evolutionarytrees.
To date, it is the method with the soundest statistical credentials.
His program does not do alignment itself
and requires a multiplealignment as input.
Friday[13] discussed the problem of
assigning figures of merit to inferred trees.
Wallace and Boulton developed MML methods for classification[24,7],
and for hierarchical classification[8],
of arbitrary "things".
In classification, a class usually contains many things.
Cheeseman and Kanefsky[9] applied MML encoding to evolutionarytrees
over macromolecules.
Milosavljevic[20] examined the MML classification of macromolecules.
The methods described in this paper relate to other methods as follows.
They generalise Felsenstein's approach by allowing inserts and deletes
to be inferred and by including the cost of the model.
In contrast to Cheeseman and Kanefsky,
ancestral strings are not inferred, for to do so involves arbitrary choices.
However probabilities can be assigned to possible ancestors
and to possible alignments given the parameters that are inferred.
string A= TAATACTCGGC
string B= TATAACTGCCG
mutation instructions:
copy
change(ch) NB. ch differs from the corr' char in string A.
del
ins(ch)
a mutation sequence:
copy; copy; delete; copy;
copy; ins(A); copy; copy;
del; copy; change(C); copy;
ins(G)
generation instructions:
match(ch)
mismatch(ch_{A},ch_{B}) NB. ch_{A}<>ch_{B}
ins_{A}(ch_{A})
ins_{B}(ch_{B})
a generation sequence:
match(T); match(A); ins_{A}(A);
match(T); match(A); ins_{B}(A);
match(C); match(T); ins_{A}(C);
match(G); mismatch(G,C);
match(C); ins_{B}G
equivalent alignment:
TAATACTCGGC
    
TATAACTGCCG
Figure 1: Basic Models.
2. MutationModels.
We
consider 2 broad models to relate 2 strings A and B (Figure 1).
In the mutationmodel a simple machine
reads string A and a sequence of mutation instructions and outputs
string B.
In the generationmodel a simple machine reads a
sequence of generation instructions and outputs strings A and B
on a single tape 2 characters wide.
There is an obvious correspondence between
a mutation sequence, a generation sequence and an alignment.
Results based on these notions are
equivalent and they can be used interchangeably as convenient.
The machines that we consider are probabilistic finitestate
machines as described in [2];
they are equivalent to hidden Markov models.
We usually consider symmetric machines where
p(ins) = p(del) = p(ins_{A}) = p(ins_{B}).
The message length of an instruction is log_{2} of its probability
plus either 2 bits for 1 character or log_{2}(12) bits
for 2 differing characters;
for simplicity only, we assume that all characters are equally probable.
In highly conserved sequences, the probabilities of those instructions
that do not conserve characters
are low and their message lengths are high.
The probability of an instruction also
depends on the current state, S, of the machine:
ML(match(c)S) = log_{2}(P(matchS)) + 2 ML(mismatch(c,c')S)
= log_{2}(P(mismatchS)) + log_{2}(12)
ML(ins_{A}(c)S) = ML(ins_{B}(c)S) = log_{2}(P(ins_{A}S)) + 2
A 1state machine corresponds to Sellers' edit distance[23] with
adjustable weights.
A 3state machine can correspond to linear indel costs or weights.
Machines with more than 3 states can correspond
to piecewise linear indel costs.
Multiple states allow an instruction to have
different probabilities in different contexts.
eg. For a 3state machine
a run of indels can have a higher probability
of continuing than of beginning, giving a linear message length or cost as a
function of run length.
A message based on one generation sequence (or alignment) states:
(i) the parameter values of the machine
which give its instruction probabilities and
(ii) the sequence of instructions.
The parameters are estimated by an iterative process:
Initial values are chosen and an optimal sequence found.
An optimal code for this sequence
should be based on probability estimates derived
from the observed frequencies of instructions.
This change in the parameter values cannot increase the message length
but can make another sequence optimal,
so iteration is necessary and is guaranteed to converge.
A few iterations, typically 4 to 8, are sufficient for convergence.
The process could find only a local minimum in theory but this is not
a problem in practice.
The parameters must be stated to an optimum accuracy;
the instructions are chosen from a multistate distribution and
Boulton and Wallace[6] give the necessary calculations.
The rtheory (Q2) is based on all instruction sequences.
The algorithms[2] that compute the rtheory for the various machines
are dynamic programming algorithms [DPA] that
perform <+,logplus> on message lengths in the inner step
or <*,+> on probabilities,
instead of the usual <+,min> on edit distances
or <+,max> on lengths of common subsequences.
(logplus(log(A), log(B)) = log(A+B))
The iterative process is used to estimate the parameter values
based on weighted frequencies.
TATAA TATAA TATAA
      
TAATA 2 TAATA 2 TAATA 2 etc.
CGGC CGGC CGGC
   
GCCG 3 GCCG 3 GCCG 4
CGGC CGGC CGGC
   
GCCG 4 GCCG 4 GCCG 4
CGGC
 
GCCG 4 etc
Figure 2: Possible Subalignments.
Results
reported in [2] support the widely held view that the 3 and 5state
machines are better models for DNA than the 1state machine.
The 3 and the 5state machines
have increasing numbers of parameters to be stated.
This tends to increase the message length, but is more than made up for
by a saving on the strings.
There are practical difficulties in going beyond about 5states.
We do not yet have enough data to reliably judge between the 3 and
5state machines.
Traditionally,
alignment programs have been based on
Sellers' metric[23],
or on linear and piecewise linear cost functions for indels[14,15],
or on arbitrary concave cost functions[19].
Some of these programs have several parameters requiring adhoc adjustments.
MML encoding provides an objective way
to compare models of differing complexities
and to infer their parameter values.
We now examine the questions Q1Q4 informally on the example strings
A = TAATACTCGGC and B = TATAACTGCCG.
Note, results on short strings
must be taken with a pinch of salt
and we use the 1state model only because it is the simplest to discuss.
Q1: Are strings A and B related?
They probably are related but
it is hard to see exactly how.
They have at the least diverged considerably.
Q2: How far and in what ways have A and B diverged?
The front sections, TATAA:TAATA can be
related in several ways involving either
2 mismatches or 2 indels.
The central CTs probably match.
The end sections, GCCG:CGGC,
can be related with 2 indels and a mismatch,
or 4 indels or 4 mismatches.
On average we expect about 7 matches,
1 or 2 mismatches and a few indels.
Q3: What is the best alignment?
It is only possible to say that
it probably fits the scheme
T[A:AA]T[AA:A]CT[GCCG:CGGC] (Figure 2).
As noted, the 1state model was used only for simplicity.
The 3 and 5state models would penalise isolated indels more heavily.
Q4: What was the most recent common ancestor of A and B?
It is only possible to say that
it probably fits the pattern
TA[A]TA[A]CTalpha
where alpha involved 2 to 4 Cs and/or Gs.
T A T A A C T G C C G
*   .
T:  *   . .
A:  # * +  . .
A: .  * + +   . .
T: .  * *    . .
A: . .  # *    .
C: . .  + * +   .
T: .  + + * # + . .
C: . . . + # + # + .
G: . .  + # # # +
G: . . + # + #
C: . . + # *
1state machine
key: * # +  .
MML plus: 0..1..2..4..8..16 bits
1 avg algn = 64.9 bits
nulltheory = 56.0 bits
rtheory = 53.8 bits
= : <> : indel = 7 : 1 : 5.9
probability related = 0.8
Figure 3: Probability Density Plot.
When the rtheory algorithm is run on strings A and B
the results (Figure 3)
quantify the informal arguments made above.
The optimal alignments stand out
in the density plot
but there are several of them and many other nearoptimal
alignments.
Any 1 alignment has a message length
greater than the nulltheory and is not acceptable
but taking them all together the rtheory is acceptable.
3. Trees over Two Strings.
Given 2 related strings A and B,
we think of them as having evolved from a most recent common ancestor P.
P
. .
M_{A}. .M_{B}
. .
A B
Figure 4: Common Ancestor.
This involves 2 unknown steps or
mutation machines, M_{A} and M_{B} (Figure 4).
The models are reversible and
mutating P into A by M_{A} is equivalent to
mutating A into P by M_{A}'.
The composition of 2 sstate machines M_{A}' and M_{B}
is not in general equivalent to an sstate machine for s>1.
Fortunately, for typical overall indel rates
a single sstate machine is a good approximation to
the composition of 2 or more sstate machines.
It is thus acceptable to relate A and B directly by a single sstate machine.
For example, Figure 5 plots log_{2} probability of
indel length for the composition of 1, 2 and 512 machines,
for machines that can cause short (s)
or long (L) indels and that give asymptotically piecewise linear costs.
The calculations are based on machines that can copy or insert only.
The similarity of form of the curves is important,
not their absolute values.
This also suggests that when considering any 2 out of k strings,
and the path of machines (edges) between them in a tree,
in order to get partial information
on a multiplealignment, it is acceptable to combine 2 or
more adjacent sstate machines into one.
Note that there are 2 forests over A and B.
The rtheory tree (Figure 4) forms 1 forest.
The forest of 2 singleton trees corresponds to the nulltheory.
If the ancestor P and the mutation instruction sequences that M_{A} and
M_{B} executed were known,
it would be possible to form
an equivalent generation sequence (Figure 6)
for a single generation machine G.
However neither P nor the mutation sequences are known.
The idea to circumvent this problem
is to synchronise M_{A} and M_{B} as they read P
and to remove dependency on the unknown P by summing over
all possibilities.
M_{A} and M_{B} are forced to execute
copy, change and delete instructions in parallel;
each of these instructions reads 1 character from P.
Equivalently, G executes
a match, mismatch, ins_{A} or ins_{B} instruction as appropriate
except that if both M_{A} and M_{B} execute a delete, G does nothing.
Note that an insert reads no character from P.
If M_{A} is about to execute an insert instruction and
M_{B} is about to execute some other kind of instruction,
M_{B} is delayed and G executes an ins_{A}, and v.v.
If both M_{A} and M_{B} are about to execute 1 or more inserts,
they and G can be made to do one of several things;
doublecounting can be avoided if this is done carefully.
Instruction. Explanations and Rates.
match(x):
x x x<x>x x<y>x
pA'(c).pB'(c) pA'(ch).pB'(ch)/3
mismatch(x,y):
x y x<x>y x<y>y
pA'(c).pB'(ch) pA'(ch).pB'(c)
x<z>y
pA'(ch).pB'(ch)2/3
ins_{A}(x):
x _ x<x>_ x<y>_
pA'(c).pB'(d) pA'(ch).pB'(d)
x<_>_
pA'(i)
ins_{B}(x):
_ x _<x>x _<y>x
pA'(d).pB'(c) pA'(d).pB'(ch)
_<_>x
pB'(i)
_ _ _<x>_ "invisible",
pA'(d).pB'(d) not seen
ccopy, chchange, iinsert, ddelete
define pA'(instn) = pA(instn)/(1pA(i))
pB'(instn) = pB(instn)/(1pB(i))
rates per character of P
Note pA'(c)+pA'(ch)+pA'(d)
= pB'(c)+pB'(ch)+pB'(d) = 1
sum rates = (1+pA'(i)+pB'(i))
= (1pA(i).pB(i))/((1pA(i)).(1pB(i)))
Figure 6: Explanations for Generation Instructions.
The inserts can be interleaved in different ways,
eg. G can execute
an ins_{A} and then an ins_{B} or v.v.
Each of these options corresponds to a part of an alignment.
Some, such as an ins_{A} immediately followed by an ins_{B}
are of low, but nonzero, probability and are not usually seen in alignments.
For example, a single optimal alignment always
explains a pair of characters <A_{i},B_{j}>
as `mismatch(A_{i},B_{j})' and never
as `ins_{A}(A_{i});ins_{B}(B_{j})'
and thus gives biased estimates.
We expect P, A and B to have the same number of characters on average
if pA(i)=pA(d) and pB(i)=pB(d).
However we expect the alignment to be longer than P,
for a tuple is "gained" whenever M_{A} or M_{B} does an insert
but a character is only "lost" when both M_{A} and M_{B} delete it.
For this reason, rates of instructions per character of P,
ie. pA' and pB' in figure 6,
have a denominator of (1pA(i)) or (1pB(i)).
The expected number of (visible and invisible pairs)
in the alignment per character of P
is 1+pA'(i)+pB'(i).
The above points are generalised to k strings in the
next section.
Note that null is a noncharacter, not a proper character.
Similarly an edge with a null at each end
represents a nonoperation for a machine,
introduced only to synchronise M_{A} and M_{B}.
The parameters of G could be calculated from
those of M_{A} and M_{B}
but neither P, M_{A} nor M_{B} are known.
In this case parameter values for G can still be inferred from A and B
under the rtheory.
The values embody the weighted average over
all possible P and instruction sequences for M_{A} and M_{B}.
If a constant rate of mutation is assumed, M_{A}=M_{B}
and it is possible to infer the parameters of M_{A}.
For 3 or more strings it is possible to get a 3way fix
on all hypothetical strings and
estimate all machines without assuming a constant rate of mutation.
4. Trees and MultipleAlignment.
Of the questions, R1R4, over k>=2 strings,
we consider R2 to be the most important
with R3 a close second, to recap:
R2: Are the strings related by a particular tree?
R3: Are they related [under an evolutionarytree]
by a particular multiplealignment?
The topology of the tree is important
but of little use without the amount and form of mutation
on each edge.
Strictly, R3 is too detailed to judge a tree on
but an optimal alignment is often useful.
As for the 2string case,
an optimal multiplealignment is not unique in general
and even suboptimal alignments may be plausible.
An MML encoding of the strings based on an answer to R2 or R3 consists of
(i) the value of k,
(ii) the tree topology,
(iii) the parameter values for each edge and either
(iv) the strings, based on an all alignments (R2) or
(iv') the strings based on 1 optimal alignment (R3).
The tree that best explains the strings
gives the shortest message length.
The difference in message lengths of 2 trees
gives log_{2} of their posterior oddsratio.
The most colourless assumption is that all topologies are
equally likely so part (ii) is of constant length, given k,
and is only needed for comparison with the nulltheory.
A rooted tree can only
be inferred if a (near)
constant rate of mutation is assumed.
If this assumption is not made, an unrooted tree is inferred.
R2 and R3 can be broken down into subproblems:
S1: What is the best treetopology?
S2: What are the parameters for each edge?
S3: What is the message length of a tuple?
S4: What is the message length of an alignment?
S5: What is the message length of the strings given the tree,
considering all alignments taken together? (R2)
S5': What is the message length
of the strings given the tree and based on 1 optimal alignment? (R3)
The answers to these subproblems are not independent.
The topology and parameter values define the tuple probabilities
and thus the alignment probabilities
and the plausibility of each tree.
Two competing sets of answers to the subproblems can be judged
easily and objectively but
finding the best answers is hard.
On the other hand,
algorithms for the subproblems are largely independent
and can be studied as such.
A prototype program has been written for R3
based on one set of approaches to these problems.
Some other possibilities have also been investigated.
These are discussed below.
4.1. S1: TreeTopology.
There are U_{k} unrooted tree topologies over k strings:
U_{2} = U_{3} = 1
U_{k} = 1.3.5. ... .(2k5), k>=4
Each is considered equally probable apriori
and log_{2}(U_{k}) bits are required to state its identity:
ML(topology)=log_{2}(U_{k})
The prototype program requires the user to specify one or more trees
to be evaluated.
Standard distancematrix methods could be used to propose plausible topologies.
The ideal distance is proportional to time.
It is thought that the parameters derived from the 2string rtheory
give more accurate pairwise distances than a simple editdistance.
Felsenstein[12] describes how
local rearrangements can be used to find at least
a locally optimal tree.
There is no great difficulty in allowing nary trees,
except that there are more of them.
Note that a multiplealignment method that does
not explicitly consider a tree nevertheless embodies
an implicit model of relation of some kind,
possibly a fixed tree.
4.2. S2: Edge Parameters.
Each edge of the tree relates 2, possibly hypothetical, strings
and is represented by a machine.
Any realistic machine requires at least 1 parameter (~time)
to specify instruction probabilities.
The parameter(s) of each of the 2k3 machines must be included in the message.
The instructions are chosen from a multistate distribution
and Boulton and Wallace[6] give the calculations
for the precision with which to state their probabilities.
The prototype program is based on 1state machines,
each having 3 parameters 
P(copy), P(change), P(indel) 
and 2 degrees of freedom.
No correlation between P(change) and P(indel) is assumed.
MML encoding is invariant under smooth monotonic transformations
of parameters so the probabilities are stated rather than being
translated into a measure of time.
If a constant rate of mutation were assumed,
it would be sufficient to describe k1 machines.
This would permit an objective test of this much discussed hypothesis:
There is a fall of approximately 50% in the cost of stating parameters
but presumably a rise in the cost of stating the strings.
An intermediate theory, based on a near constant rate of mutation is
also possible, with different tradeoffs.
An iterative scheme, similar to that used for 2 strings,
is used to estimate the parameter values of the machines (edges).
This is described later.
Actual machine parameters:
copy change indel
m1: 0.9 0.05 0.05
m2: 0.9 0.08 0.02
m3: 0.7 0.2 0.1
m4: 0.8 0.1 0.1
m5: 0.75 0.1 0.15
A A
. .
m1. .m3
. A:0.43 A:0.40 .
.C:0.56 C:0.59.
++
. m5 .
. .
m2. .m4
. .
C C
Probable hypothetical characters.
Prior tuple probability = 0.00027
Estimated operations carried out:
copy change indel
m1: 0.43 0.57
m2: 0.57 0.43
m3: 0.40 0.60
m4: 0.59 0.41
m5: 0.94 0.06
(Probabilities less than 0.01 omitted.)
Figure 7: Example, explanations for tuple ACAC.
4.3. S3: Tuple Costs.
Here we consider the tree topology and parameters to be fixed.
The probability of an alignment depends on the probabilities
of the ktuples of characters forming it.
Each tuple can be thought of as an instruction of a kstring generation
machine that generates all of the k strings in parallel.
This machine is implied by the 2string machines on the edges of the tree.
The probability of a tuple can be found
by assigning its characters, possibly null,
to the leaves of the tree and considering all explanations of the tuple.
An explanation assigns values to the internal, hypothetical
characters of the tree.
This defines an operation on each edge
and a probability can be calculated for the explanation.
The probability of the tuple is the sum of the probabilities
of all explanations and
its message length is log_{2} of its probability.
We first consider the case where the mutationmodel on
2 strings is a 1state model.
Felsenstein[12] gave the necessary algorithm to calculate the probability
of a tuple of characters given the probability of a change
on each edge of the tree.
It takes O(k . alphabet^{2}) time per tuple.
It can be used directly by allowing the possibility of indels.
Care must be taken that only 1 insert is allowed per explanation
of a tuple, as an insert begins a new tuple.
The tuple's probability is computed in a postorder traversal of the tree.
The probabilities of hypothetical characters
and of machine operations, given the tuple, (Figure 7) are computed in a
subsequent preorder traversal.
The program computes the probabilities of
tuples as they are encountered
and stores them in a hashtable for reuse.
There is a considerable saving when n>>alphabet.
Sankoff and Cedergren[22] give an algorithm
that finds the single most probable explanation for a tuple;
this is too detailed a hypothesis in general
and is another case where a max over probabilities should be
replaced with a `+'.
It gives a good approximation when all mutation probabilities are small
and all edges are similar.
In that case the complexity of their algorithm is only O(k) per tuple.
string A:  A  A  A
string B: B   B B 
string C: C C C   
1 2 3 4 5 6
Figure 8: Indel Patterns, k=3.
The situation is complex
when the model relating 2 strings has more than 1 state.
The 3state model, giving linear indel costs, is of particular interest.
There are
2^{k}2 patterns of indels possible in a tuple (Figure 8).
The kstring generation machine therefore has 2^{k}1 states.
(2^{k}1 is also the number of neighbours that each cell depends on
in a kdimensional DPA matrix.)
Each tuple can therefore occur in 2^{k}1 contexts with
a different probability in each one.
This is a serious problem unless a simplification can be found.
There is some hope because
the probabilities of instructions in these contexts are not independent
as there are only O(k) free parameters in the tree.
4.4. S4: Alignment Message Length.
The multiplealignment part of the message
states the length of the alignment
and the tuples in the alignment.
The former is encoded under the log^{*} prior.
The message length of a tuple was given in the previous section.
It is therefore easy to compare 2 alignments
on the basis of message length;
the hard part is to find a good one.
4.5. S5': Optimal MultipleAlignment.
The dynamicprogramming algorithm [DPA]
can be naively extended to k strings to find an optimal multiplealignment.
The kdimensional DPA matrix has O(n^{k}) entries.
Each entry depends on 2^{k}1 neighbours.
If the mutationmodel has 1 state, the extension is straightforward
and the time complexity is O(n^{k}.2^{k}).
"Only" O(n^{k1}) space is needed to calculate the value of the alignment.
The n^{k} factor becomes painful, in both time and space, for
n>=200 and k as small as 3.
It is necessary to reduce the volume
of the DPA matrix that is searched
to make the algorithms practical
and there are a number of possible approaches.
Altschul and Lipman[4] devised a safe method to restrict the volume
investigated for an editdistance based on fixed
treecosts and also for other costs.
(The treecost of a tuple is the minimum number of mutations
required to give the tuple.)
Pairwise alignments are done first and used to delimit a volume
that is guaranteed to contain an optimal multiplealignment.
However, no results are available to indicate how practical
the method is for treecosts.
The method relies on the triangle inequality
and might be applicable to MML calculations.
A related heuristic is to use alignmentregions
as fuzzy strings.
Pairwise alignments are performed on sibling leaf strings.
A threshold (in editdistance or in message length)
is applied to limit the area of each alignment.
Fuzzy strings are aligned in the obvious way until a complete
multiplealignment is obtained.
The method is fast if the threshold is tight
but is not guaranteed to find an optimal alignment.
We have done some experiments with such an algorithm using
fixed treecosts.
With a threshold of +1, up to 8 strings of 200 characters
can be aligned on a DEC5000.
The model of computer
makes only a small difference to the practical limits
for this kind of complexity.
Hein[16] gave a multiplealignment algorithm which
uses directed acyclic graphs (DAGs) to represent
fuzzy ancestral strings.
The algorithm is fast but an optimal alignment is not guaranteed.
Fixed treecosts are used but this is by no means necessary.
The value of DAGs is that in sections where there is consensus
between the strings,
the effective number of neighbours of a cell in the DPA "matrix"
is much reduced.
Long exact matches between strings are fast to find
using hashtables or suffix trees.
Heuristics for multiplealignment
based on splitting the strings on a long exact match,
ideally near their centres,
are clearly possible.
The prototype program operates in 2 stages.
During an initial postorder traversal,
strings and then alignments are aligned recursively
until a complete multiplealignment is obtained:
procedure RA(var Result:Alignment; T:Tree);
var Left, Right :Alignment;
begin if leaf(T) then
StringToAlignment(T^.s, Result)
else RA(Left, T^.left);
RA(Right, T^.right);
Align(Left, Right, Result)
end if
end
The machine parameters are held constant,
at some reasonable but otherwise arbitrary values,
during the initial stage.
An improvement stage follows;
the complete alignment is projected onto subalignments
which are realigned
during another postorder traversal:
procedure Improve(var Algn :Alignment);
var A, B :Alignment; ss :StringSet;
procedure Im(T:Tree; var ss:StringSet);
var ss1, ss2 :StringSet;
begin if not leaf(T) then
Im(T^.left, ss1);
Im(T^.right, ss2); ss := ss1+ss2;
Project(Algn, ss1, A, B);
Align(A, B, Algn);
if ss <> AllStrings then
Project(Algn, ss2, A, B);
Align(A, B, Algn)
end if
else {leaf(T)} ss := [T^.s]
end if
end;
begin Im(T, ss)
end
The machine parameters are continually reestimated during this stage.
The overall process takes O(kn^{2}) time
and global optimality is not guaranteed.
The improvement stage can be iterated.
More research needs to be done on alignment strategies.
Note that if the mutationmodel on 2 strings has 3 states,
to model linear indel costs,
the kstring generation machine has 2^{k}1 states and
each entry in the DPA matrix has 2^{k}1 entries.
This also adds
a multiplicative factor of 2^{k}1 to the time complexity.
4.6. S5: Sum over all Alignments.
The part of the message that describes the strings proper
is ideally based on all alignments.
The number of bits required to state the strings is
log_{2} of the sum of the probabilities of all alignments.
This can be computed by a <+,logplus> dynamic programming algorithm.
It suffers from the same combinatorial explosion as
finding an optimal alignment.
Fortunately most of the probability is concentrated in a narrow region
which raises hopes for a fast approximate algorithm.
This has been used to implement a fast, approximate rtheory
algorithm for 2 strings[2].
Feng[3] has performed experiments on 3 strings which give good results.
First, probability density plots are calculated for
pairwise comparisons.
A "threshold" is applied to give a small alignment region.
The regions are projected into the DPA cube.
The rtheory is calculated over this volume.
A further approach that we intend to implement is to
to sum over all alignments in a region
within a distance `w' of a multiplealignment
found as described in the previous section.
Pairs of states must be considered at each transition
if the generation machine has `s' states,
So if linear indel costs are modelled,
a multiplicative factor of (2^{k}1)^{2}, or O(2^{2k}),
is included.
This is a severe problem unless a simplification
or an approximation can be found.
4.7. Estimating Edge Parameters.
The probability of a tuple can be calculated,
as described previously,
if all edge parameters are known.
An iterative scheme is used to estimate unknown parameters.
It is a generalisation of the scheme used for 2 strings:
Initial values are assumed and used to find an optimal alignment (say)
and to compute its message length and hence that of the tree hypothesis.
For each tuple, the frequencies of operations for each machine are calculated.
These frequencies are accumulated and used to calculate the instruction
probabilities for the next iteration.
The new values cannot increase the message length of the alignment but
they may cause another alignment to become optimal.
This must reduce the message length further so the process converges.
A test program has been written to investigate the estimation process alone
for the 1state model.
It assumes that a long and "perfect" alignment is known,
tuples occurring with their predicted frequencies.
It then estimates the true machine parameters
from arbitrary starting values.
Convergence is rapid.
The message length usually converges to within 1% of the optimum value
within 3 iterations.
All parameters are reestimated simultaneously.
Note that the perfect alignment would be suboptimal!
Its statistical properties could only derive from a weighted
average over all alignments.
This test program is also used to calculate limits for
an acceptable hypothesis based on a single alignment
for asymptotically long strings where
the overhead of stating parameter costs and string lengths
can be ignored.
The critical point is where the nulltheory and the alignment
have equal message lengths.
For example, when the machines in a tree are all identical
and P(change) = P(indel) then the following results are obtained:
k: 2 3 4 5
critical P(copy): 0.68 0.79 0.82 0.83
The critical values of P(copy) would be higher for short strings.
They would be lower if the probabilities of all alignments were summed;
recall that a single alignment is an answer to R3 not to R2.
Empirical results for 2 strings are given in [1].
The estimation scheme described above
has been built into the alignment program.
It will also be used in any rtheory algorithm
in which case weighted frequencies will be used for estimation.
The programs do not correct for the "invisible" tuple
which has a probability of approximately (k2).p(del)^{3}.
This results in an underestimate of indels by a similar amount
but this is considerably less than the accuracy with which
parameters can be estimated.
Input:
4  k
acgtacgtacagt  string 1
actgtacgtacgt  2
acgtactagct  3
accgtactgagct  4
((1 2)(3 4))  tree 1
((1 3)(2 4))  2
((1 4)(2 3))  3
Tree 1: M.L.=108 bits *
estimate of parameters:
mc1: 0.88 0.02 0.09
mc2: 0.88 0.02 0.09
mc3: 0.94 0.03 0.03
mc4: 0.81 0.02 0.17
mc5: 0.71 0.18 0.11
s1 s4
. .
.m1 .m4
. m5 .
..
. .
.m2 .m3
. .
s2 s3
s1[1..]: acgtacgtacagt
s2[1..]: actgtacgtacgt
s3[1..]: acgtactagct
s4[1..]: accgtactgagct
Tree 2: M.L.=113 bits
Tree 3: M.L.=122 bits
NullTheory= 124.5 bits
NB. all trees are unrooted.
CPU time = 7 secs/tree (Pyramid)
Figure 9: Example Run.
4.8. NullTheory.
The nulltheory is that the k strings are not related.
Its statement consists of
(i) the value of k,
(ii) the lengths of the strings and
(iii) the characters in the strings.
Part (iii) requires 2 bits per character for DNA if
all characters are equiprobable.
Part (ii) is potentially a large penalty over a tree hypothesis
because the latter states only a single length  of an instruction sequence.
A reasonable approach, which mitigates this penalty,
is to transmit the total length of the strings,
encoded under the log^{*} prior,
followed by
the set of lengths under a knomial prior.
The second part is short if the lengths are similar.
This generalises the binomial distribution used for 2 strings in [2].
The total message length of the nulltheory is therefore:
ML_{null} = log_{2}(P(k))
log_{2}(P(total))
log_{2}(P(length_{1},...,length_{k}total))
+total*2
where total=length_{1}+...+length_{k}
Any hypothesis that has a message length
longer than that of the nulltheory is not acceptable.
The previous section included predictions of critical values of P(copy) for
a single alignment,
asymptotically long strings and some special trees.
Extensive tests have not been done but
when the alignment program is run on artificially generated data
the results of the significance test
are consistent with the predictions.
In particular, random strings are always detected as being unrelated.
4.9. Results.
Figure 9 shows the results of the prototype program
on a small example.
All 3 trees are acceptable hypotheses but tree 1 is the most
plausible by a significant margin.
All trees are unrooted.
Note that little benefit is gained from the hashtable for such short strings.
s1: HUMPALA Human prealbumin
(transthyretin) mRNA, complete cds
s2: RATPALTA Rat prealbumin
(transthyretin) mRNA, complete
s3: MUSPALB Mouse mRNA for prealbumin
s4: sheep Transthyretin cDNA
s5: chicken TTR cDNA transthyretin
s1s4:
Tree ((1 2)(3 4)): M.L.=3693 bits
Tree ((1 3)(2 4)): M.L.=3661 bits
Tree ((1 4)(2 3)): M.L.=3569 bits *
NullTheory = 4781 bits
s1s5:
Tree ((1 4) (2(5 3))): M.L.=4846 bits
Tree ((1 4) ((2 5)3)): M.L.=4844 bits
Tree (((1 5)4) (2 3)): M.L.=4807 bits
Tree ((1(4 5)) (2 3)): M.L.=4758 bits
Tree (((1 4)5) (2 3)): M.L.=4750 bits*
*incl: 16.6 length, 67.1 params,
4655.9 alignment
NullTheory = 6134 bits
s5

s1 m5 s2
.  .
.m1  .m2
. m6  m7 .
s6s8s7
. .
.m4 .m3
. .
s4 s3
estimate of parameters:
copy change indel
mc1: 0.89 0.09 0.02
mc2: 0.84 0.04 0.12
mc3: 0.95 0.05 0.01
mc4: 0.88 0.08 0.05
mc5: 0.74 0.18 0.08
mc6: 0.96 0.04 0.00
mc7: 0.88 0.10 0.03
CPU time approx' 5 min/tree (sparc station).
Figure 10: Transthyretin DNA/RNA sequences.
Figure 10 gives results
on transthyretin DNA and RNA sequences[10].
The grouping ((human sheep) (mouse rat)) is favoured
on those 4 sequences.
Joining chicken to the edge between the 2 pairs is
narrowly the best option.
There is still debate over the branching pattern of human, chimpanzee and
gorilla in the evolutionary tree of the primates.
Figure 11 shows results on some globin pseudogene sequences.
They favour the grouping (gorilla (human chimp))
and are consistent with the conclusions of Holmes[18, Ch 13]
on these sequences.
First, the 3 trees over gorilla, human, chimpanzee and owlmonkey sequences
were evaluated.
Owlmonkey was a consistent outsider.
The orangutan sequence was then added and 3 trees were evaluated.
5. Conclusions.
We have shown how to use MML encoding to infer evolutionarytrees
and multiplealignments.
We have argued that these problems are intimately connected.
The posterior probability of a tree depends on the sum of the implied
probabilities of all multiplealignments.
In this we agree with the sentiment, but not the detailed methods,
of Hogeweg and Hesper[17] that "the alignment of sets
of sequences and the construction of phyletic trees
cannot be treated separately."
The use of a single multiplealignment leads to an underestimate
of the posterior probability of the tree hypothesis and a bias in the parameter
estimates.
Nevertheless it is a reasonable approach when the strings are similar and
a prototype program of this kind has been written for the 1state model.
The MML method can compute the posterior oddsratios
of competing hypotheses such as different trees.
The evolutionarytree hypothesis is explicitly built into the message.
The existence of a natural nulltheory leads to an inbuilt
significance test.
These remarks are qualified with `subject to
the validity of the underlying model'.
The MML method requires the model to be defined explicitly.
In principle, 2 competing models can be compared objectively;
this was done for 2 strings in [2].
1: GGBGLOP:Gorilla
betatype globin pseudogene
2: HSBGLOP:Human betatype globin
pseudogene standard; DNA; PRI; 2153 BP
3: PTBGLOP:Chimpanzee betatype globin
pseudogene standard; DNA; PRI; 2149 BP
4: ATRB1GLOP:Owlmonkey psi beta 1 globin
pseudogene. 2124 bp dsDNA
5: ORAHBBPSE:Orangutan bases 1593 to 3704
s1s4:
Tree ((1 2)(3 4)): M.L.=7241 bits
Tree ((1 3)(2 4)): M.L.=7236 bits
Tree ((1 4)(2 3)): M.L.=7197 bits*
NullTheory = 17193 bits
Elapsed time ~50 min's (Sparc station)
s1s5:
Tree (((1 2) 3)(4 5)): M.L.=8023 bits
Tree (((1 3) 2)(4 5)): M.L.=8031 bits
Tree ((1 (2 3))(4 5)): M.L.=7980 bits*
*incl: 18.7 length, 98.8 params,
7852 alignment
NullTheory = 21426 bits
Elapsed time ~50 min's (Sparc station)
estimate of parameters:
copy change indel
mc1: 0.989 0.011 0.000
mc2: 0.994 0.004 0.003
mc3: 0.988 0.012 0.001
mc4: 0.868 0.087 0.045
mc5: 0.969 0.016 0.015
mc6: 0.996 0.001 0.003
mc7: 0.985 0.011 0.005
s1:GGB

m1
m2 m6  m7 m4
s2:HSBs6s7s8s4:ATR
 
m3 m5
 
s3:PTB 
s5:ORA
Figure 11: Primate Globin Pseudogenes.
The biggest doubts about the model used are as follows.
We have not, as yet, modelled different transition and transversion rates
but this would be straightforward.
It seems to be difficult to bring a multistate
model of mutation properly into multiplealignment.
Large mutation events, such as block transpositions, duplications
and reversals, could be modelled easily but finding
an optimal "alignment" would become even harder.
Finally, what is completely lacking
from our method and from most if not all
sequence analysis programs
is any model of the pressure of selection.
The program discussed above is written in Pascal and is available,
free for nonprofit, nonclassified teaching and research purposes,
by email (only) from:
lloyd at bruce.cs.monash.edu.au.
The alignment programs for 2 strings,
based on 1, 3 and 5state models of relation
and discussed in [2] are written in C and are also available.
A good workstation with floating point hardware is the minimum requirement
for their use.
Acknowledgment:
This work was partially supported by Australian Research Council grant
A49030439.
6. References.
 [1]
Allison L. & C. N. Yee.
Minimum message length encoding and the comparison of macromolecules.
Bull. Math. Biol. 52(3) 431453 1990.
 [2]
Allison L., C. S. Wallace & C. N. Yee.
Inductive inference over macromolecules.
AAAI Spring Symposium on A.I. & Mol. Biol. Stanford
March 1990 (extended abstract),
and TR 90/148 Dept. Comp. Sci. Monash University 1990
[HTML] (in full).
 [3]
Allison L. & Du Xiaofeng.
Relating three strings by minimum message length encoding (abstract).
Genes, Proteins and Computers, Chester, April 1990.
 [4]
Altschul S. F. & D. J. Lipman.
Trees, stars and multiple biological sequence alignment.
SIAM J. Appl. Math. 49(1) 197209 1989.
 [5]
Bishop M. J. & E. A. Thompson.
Maximum likelihood alignment of DNA sequences.
J. Mol. Biol. 90 159165 1986.
 [6]
Boulton D. M. & C. S. Wallace.
The information content of a multistate distribution.
J. Theor. Biol. 23 269278 1969.
 [7]
Boulton D. M. & C. S. Wallace.
A program for numerical classification.
Comp. J. 13 6369 1970.
 [8]
Boulton D. M. & C. S. Wallace.
An information measure for hierarchic classification.
Comp. J. 16 254261 1973.
 [9]
Cheeseman P. & B. Kanefsky.
Evolutionary tree reconstruction (extended abstract).
AAAI Spring Symposium on
the Theory and Application of MinimalLength Encoding Stanford
3539 March 1990.
 [10]
Duan W., M. G. Achen, S. J. Richardson, M. C. Lawrence,
R. E. H. Wettenhall, A. Jaworowski & G. Schreiber.
Isolation, characterization, cDNA cloning and gene expression of an avian
transthyretin: implications for the evolution of the structure and
function of transthyretin in vertebrates.
Euro. J. Biochem. to appear.
 [11]
Felsenstein J.
Evolutionary trees from DNA sequences: a maximum likelihood approach.
J. Mol. Evol. 17 368376 1981.
 [12]
Felsenstein J.
Inferring evolutionary trees from DNA sequences.
Statistical Analysis of DNA Sequence Data.
B. S. Weir (ed) Marcel Dekker 133150 1983.
 [13]
Friday A.
Quantitative aspects of the estimation of evolutionary trees.
Folia Primatol. 53 221234 1989.
 [14]
Gotoh O.
An improved algorithm for matching biological sequences.
J. Mol. Biol. 162 705708 1982.
 [15]
Gotoh O.
Optimal sequence alignment allowing for long gaps.
Bull. Math. Biol. 52(3) 359373 1990.
 [16]
Hein J.
A new method that simultaneously reconstructs ancestral sequences for any
number of homologous sequences when the phylogeny is given.
Mol. Biol. Evol. 6 649668 1989.
 [17]
Hogeweg P. & B. Hesper.
The alignment of sets of sequences and the construction
of phyletic trees: an integrated method.
J. Mol. Evol. 20 175186 1984.
 [18]
Holmes E. C.
Pattern and Process in the Molecular Evolution of the Order Primates.
PhD Thesis, Cambridge University, 1989.
 [19]
Miller W. & E. W. Myers.
Sequence comparison with concave weighting functions.
Bull. Math. Biol. 50(2) 97120 1988.
 [20]
Milosavljevic A. D.
Categorization of macromolecular sequences by minimum length encoding.
University of California at Santa Cruz, UCSC CRL 9041 1990.
 [21]
Reichert T.A., D. N. Cohen & K. C. Wong.
An application of information theory to genetic mutations and the
matching of polypeptide sequences.
J. Theor. Biol. 42 245261 1973.
 [22]
Sankoff D. & R. J. Cedergren.
Simultaneous comparison of three or more sequences related by a tree.
Time Warps, String Edits and Macromolecules:
The Theory and Practice of Sequence Comparison.
D. Sankoff & J. B. Kruskal (eds) Addison Wesley 253263 1983.
 [23]
Sellers P. H.
On the theory and computation of evolutionary distances.
SIAM J. Appl. Math. 26(4) 787793 1974.
 [24]
Wallace C. S. & D. M. Boulton.
An information measure for classification.
Comp. J. 11 185195 1968.
[HTML]
 [25]
Wallace C. S. & P. R. Freeman.
Estimation and inference by compact coding.
J. Royal Stat. Soc. B 49(3) 240265 1987.
[paper]
Also see
[Bioinformatics]
and
[MML].
