
L. Allison & C. S. Wallace,
Department of Computer Science, Monash University, Australia 3168.
This work is partly supported by an Australian Research Council grant.
June 1993, revised March 1994, accepted May 1994.
Abstract:
It is shown how to sample alignments from their posterior probability
distribution given two strings.
This is extended to sampling alignments of more than two strings.
The result is firstly applied to the estimation of the edges
of a given evolutionary tree over several strings.
Secondly, when used in conjunction with simulated annealing,
it gives a stochastic search method for an optimal multiple alignment.
Key words and phrases:
alignment, estimation, evolutionary tree, Gibbs sampling, inductive inference,
MonteCarlo method, multiple alignment, sampling, stochastic.
Introduction
Estimating the evolutionary "distances" between some given strings,
for example DNA sequences,
and finding an alignment for them,
are examples of inductive inference problems.
We rarely know the "right" answer to a question of evolution and must
be satisfied with a "good" hypothesis.
In previous papers
(Allison et al 1992a, Yee & Allison 1993),
minimum message length encoding was used to infer the relation, if any,
between two strings and to compare models of evolution or relation.
It is not possible to extend the algorithms given there directly to K>2 strings
in any simple way due to rapidly increasing algorithmic complexity.
A stochastic process is one possible way around this difficulty.
Such a process is described for two problems given a set of K>=2 strings
and given an evolutionarytree over them:
firstly estimating the evolutionary "distances"
on the edges or arcs of the tree
and secondly finding a good alignment of all of the strings.
The method relies on sampling from the posterior probability distribution of
alignments.
In an inductive inference problem
we are given some data D and wish to form a "good" hypothesis H
about the data.
The language of hypotheses should be tailored to the problem we wish to solve.
The following standard definitions are useful:
P(H&D) = P(H).P(DH) = P(D).P(HD)  joint probability
Definitions:
P(HD) = posterior probability of H given D
P(H) = prior probability of H
P(DH) = likelihood of H
ML(E) = log_{2}(P(E))  message length of event E
ML(H&D) = ML(H)+ML(DH) = ML(D)+ML(HD)  joint message length
ML(HD)ML(H'D) = ML(H)ML(H')+ML(DH)ML(DH')
= log_{2} posterior odds ratio of H & H'
In the present context, the data consist of some given strings.
Alignments and estimates of evolutionary distances are
different kinds of hypotheses about those strings.
P(DH) is called the likelihood of the hypothesis H;
it is a function of the data given the hypothesis
and it is not the probability of the hypothesis.
P(H) is the prior probability of the hypothesis H.
P(HD) is the posterior probability of H.
The message length (ML) of an event E is the minimal length, in bits,
of a message to transmit E using an optimal code.
A long message indicates a low probability
and a short message indicates a high probability.
It is convenient to work with message lengths,
rather than probabilities, for a number of reasons:
Typical probability values can be very small
and the message lengths are of a more convenient magnitude
for handling by computer and by person.
The message paradigm reinforces the view that there should be no hidden
parameter values associated with the hypothesis itself.
All such values are costed explicitly in ML(H).
All realvalued parameters must be stated to some optimum
but finite accuracy, as described by Wallace and Boulton (1968).
In sequence comparison there is a natural nulltheory
which has a message length being that required to state the
given strings individually;
this takes approximately two bits per character in the case of DNA.
It provides a method of hypothesis testing.
The nulltheory assumes that there is no pattern or structure in the data.
It includes the assumption that an individual string is random.
Wallace and Freeman (1987) gave the statistical foundations
of minimum message length encoding.
We often wish to find the "best" hypothesis,
from a specified class of hypotheses.
(Depending on the application, this might be the best evolutionary tree,
a good estimate of the evolutionary "distance" between strings,
or the best alignment.)
It is generally possible to calculate,
or at least to give a good approximation to,
ML(H) and ML(DH) under reasonable assumptions.
ML(H) can even be ignored if it is a constant for all
members of a class of hypotheses.
It is not often possible to calculate ML(D)
which is unfortunate for it would yield ML(HD).
However by subtracting ML(H'&D) from ML(H&D)
it is possible to get a posterior log_{2} oddsratio for two
competing hypotheses H and H';
the shorter message indicates the more likely hypothesis.
If one hypothesis is the nulltheory this also gives the hypothesis test.
An evolutionary tree which is a good hypothesis
makes good predictions about the sorts of tuples of characters
that occur in good alignments.
(A tuple consists of the characters that appear in a column of
a conventional alignment.)
For example, given four related strings s1 to s4
and assuming that all edges are similar,
the tree ((s1 s2)(s3 s4)) predicts
that tuples of the form xxyy appear more often than xyxy,
whereas ((s1 s3)(s2 s4)) favours xyxy over xxyy.
Similarly, a good alignment contains many highly probable tuples
which leads to a short message length for the data.
In the extreme case of the strings being identical
they can all be transmitted for little more than the cost of transmitting one.
An optimal alignment of a set of strings can be used to infer
an estimate of evolutionary distances and is sometimes used primarily
for that purpose.
However a single alignment is a much more detailed sort of hypothesis than
an estimate of distances because it also
states which characters of the descendant strings are related.
In statistical terms, the optimal alignment problem has many
nuisance parameters  if the question is one of evolutionary distances.
The message for the data given the estimate of distances should not be based
on one alignment.
It is not the matter that distance estimates are better than
alignments or viceversa; they are answers to different questions.
Alignments are useful in their own right, for some purposes.
Yee and Allison(1993) showed
that in order to obtain an unbiased estimate of
the evolutionary "distance" between two strings
it is necessary to use a weighted average of estimates from all
alignments whereas the use of a single optimal alignment gives a biased
estimate.
The average can be computed in a few steps,
each step taking time proportional to the product of the string lengths.
This is feasible for two strings but not more unless the strings are short.
In this paper,
a stochastic process is used to average over not all but many
alignments of K strings so as to get the estimates of the distances
on the edges of a tree over the strings in an acceptable time.
This is an example of Gibbs sampling or a MonteCarlo method (Hastings 1970).
Random alignments are generated from the posterior probability distribution
of alignments.
When used in conjunction with simulated annealing
this also gives a stochastic search process for a good alignment.
We use treecosts because these correspond to explicit
evolutionary hypotheses;
the edges of a tree are modelled individually.
We note that Lawrence et al (1993) describe a Gibbs sampling strategy
for finding ungapped signals in a set of protein sequences.
That work relates each protein to a central model which implicitly represents
the constraints on the typical member of the set.
It is using a form of starcosts which is probably more suitable
for proteins, particularly if they are only distantly related.
There is important prior
work in the treatment of alignment
and evolutionary trees as inductive inference problems.
Bishop and Thompson (1986) first cast pairwisealignment
as a maximumlikelihood problem, summing the probabilities of all alignments.
Thorne et al(1991, 1992) extended the maximum likelihood method to
more general models of evolution, including conserved and variable regions,
and related probabilities of mutation to time.
Allison et al (1992a) included the model or hypothesis cost in
pairwisealignment and compared evolutionary models of different complexities
on an equal footing.
Felsenstein (1981, 1983) treated the inference of an evolutionary tree
from a given multiple alignment as a maximum likelihood problem.
All these are part of a large and important trend to make models of evolution
explicit for better scrutiny
and to place subsequent inferences on a sound statistical footing.
Models of Evolution
We model the evolution of a parent string A
into a child string B by a finitestate
mutationmachine.
Such a machine can copy a character,
change a character, insert a character or delete
a character.
Inserts and deletes are collectively called indels.
The machine reads A and a sequence of mutation instructions and produces B.
If we wish to consider A and B to be of equal standing,
we model their relation under a finitestate generationmachine.
Such a machine can generate the same character in A and in B (match),
generate different characters in A and in B (mismatch),
generate a character in A only (ins_{A}) or
generate a character in B only (ins_{B}).
The machine reads a sequence of generation instructions and produces A and B.
Traditionally, a (pairwise) alignment is used to show a hypothetical
relationship between two strings.
Each string is padded out with zero or more null characters `'
until they have the same lengths.
The padded strings are then written out one above the other.
The two characters in a column are called a pair,
a twotuple, or just a tuple.
The allnull tuple is not allowed.
It is convenient to write a pair as <x,y> or just `xy' in text.
There is an obvious correspondence between alignments of two strings,
sequences of generation instructions and
sequences of mutation instructions, as illustrated in
figure 1,
and they will be used interchangeably as convenient.
Given a finitestate generationmachine, say,
the probability of a particular sequence of instructions
of a given length is the product of the instructions' individual probabilities
and the message length of the sequence
is the sum of their individual message lengths.
This depends on P(match), P(mismatch), P(ins_{A}) and P(ins_{B})
which are called the parameters of the machine.
The parameters might be given apriori,
but in general they must be estimated from the strings.
They correspond to the evolutionary "distance" of the strings.
Note that an estimate can be got by counting instructions
in one, or more, instruction sequences.
The parameters consist of several values;
a machine has more than one degree of freedom.
If it is necessary to reduce them to a single number
representing, say, the number of years since divergence,
then a ratesmodel in the style of Thorne et al(1991, 1992)
could be fitted to them in some way.
Our use of machines differs from the use of hidden Markov models
by Haussler et al (1993) in that a machine models an evolutionary process
rather than a family of sequences or the typical member of a family.
We believe that the former use best captures evolutionary relations
and the latter best captures functional constraints.
The advantage of using finitestate machines
is that there are standard techniques for manipulating them.
The composition of two or more such machines is yet another one.
Figure 2
illustrates the evolution of string A from string P as modelled by
a mutation machine mA and of B from P as modelled by a mutation machine mB.
If we know the parameters of mA and mB then
we can calculate the parameters of an equivalent
generation machine m=mA.mB which relates A and B directly.
There are a number of possible explanations for each generation instruction
of m in terms of instructions for mA and mB.
We expect mA and mB to execute more instructions than
there are characters in P because each copy, change and delete acts on one
character of P but an insert does not.
The dashed quantities, pA'(c) etc., are the rates of instructions
per character of P.
Summing the rates of all explanations for all of m's instructions,
we get a rate of (1+pA'(i)+pB'(i)) instructions per character of P for m,
the excess over 1 being due to the insertion operations of mA and mB.
A further normalisation step gives the parameters of m.
Similar calculations are used when several machines are combined
in the alignment of more than two strings in section 4.
Note that we do not allow `x<_>y' to
be an explanation of xy, for example,
because the x and the y are not related in the explanation.
The dynamic programming algorithm (DPA) can be used to compare
two strings under a variety of costs calculating, for example,
the editdistance as in figure 3a.
We take an optimal alignment to correspond to a most probable
instruction sequence.
An instruction sequence having the highest probability,
and the minimum message length,
can be found by a DPA using costs based on the message lengths
of instructions, as in figure 3b.
(Strictly, a term should be included for the number of instructions
in the sequence but all plausible sequences have similar numbers of
instructions and the log_{2} of those numbers are very similar.)
The alignment can be recovered by a traceback through the D[,] matrix
or by Hirschberg's (1975) technique.
If on the other hand the object is to infer the machine parameters,
rather than an optimal alignment,
something slightly different should be done.
There are many optimal, nearoptimal and not so optimal
instruction sequences or alignments in general.
An estimate, E, of the machine parameters is itself a hypothesis.
The best estimate minimises ML(E)+ML(A&BE),
the parameters being stated to some optimal and finite accuracy.
ML(A&BE) is not the message length of any one instruction sequence.
It is the log_{2} probability that the machine generated strings A and B,
ie. the log_{2} of the sum of the probabilities over all possible
instruction sequences or alignments that yield A and B.
The hypothesis is that A and B are related by the inferred machine
in some unspecified way and it is called the rtheory,
the `r' standing for related.
ML(A&BE) can be calculated by a modified DPA which
uses `logplus' instead of `min' in its general step, as in
figure 3c,
effectively adding the probabilities of alignments.
For example, logplus(m,m) = m1.
We now have D[i,j] = ML(A[1..i]&B[1..j]E).
Note that two different alignments represent exclusive explanations
and their probabilities can therefore be added.
Full details are given in Allison et al (1992a).
A code for efficiently transmitting A and B given E, using ML(A&BE) bits
and not based on a single alignment, can be devised but it is sufficient
just to know what the message length would be in such a code.
The machine parameters must be estimated if they are not known apriori.
An iterative approach is used; it may give only a local optimum.
Assume we require an optimal alignment.
Initial parameter estimates are "guessed".
An alignment is found that is optimal for the current estimates
using the DPA of figure 3b.
New estimates are derived from the alignment.
The process is iterated until it converges, which it must,
invariably in a few steps.
If on the other hand the objective is to estimate the machine parameters,
rather than to find an optimal alignment, a similar approach is used but
with the DPA of figure 3c and
this is further modified to accumulate weighted averages of instruction
frequencies when alignments are combined in the general step.
These values are used in the next iteration.
In what follows we consider only the simplest
model of evolution, the 1state model,
where the probabilities of instructions are independent of context.
In particular, linear indel or gap costs require at least a threestate model.
It is also assumed that P(insert) = P(delete) and that all characters
are equally probable, all inserts are equally probable
and all changes are equally probable.
In that case, we have for DNA and the generation machine:
ML(match(x)) = log_{2}(P(match))+log_{2}(4) = log_{2}(P(match))+2
ML(mismatch(x,y)) = log_{2}(P(mismatch))+log_{2}(12)
ML(ins_{A}(x)) = ML(ins_{B}gp(x)) = log_{2}(P(ins_{A}))+2 = log_{2}(P(ins_{B}))+2
These assumptions are made in the interests of simplicity
as we are primarily interested in the underlying algorithms and methods.
The model is simple but is not unacceptable for DNA.
A more general model of changes would have to be incorporated if
the methods were to be used on protein sequences.
It is in any case the indels that are the main source of
algorithmic complexity in string comparison.
The Posterior Distribution of Alignments
The stochastic processes to be described are based on
a method of sampling alignments (or machine instruction sequences),
given two strings A and B,
with the correct frequency as determined by the posterior
probability distribution of alignments.
A sampling procedure that achieves this aim is sketched below
with more details being given in the appendix.
It is extended to multiple alignments of more than two strings
in later sections.
It strongly resembles Hirschberg's (1975) linearspace version of the DPA
in its use of a recursive divideandconquer technique.
Assume that A is of length m and B is of length n.
Informally, the procedure divides A at i=m div 2 into
A1=A[1..i] and A2=A[i+1..m].
It then chooses a point, j,
and divides B into B1=B[1..j] and B2=B[j+1..n];
j might or might not equal n div 2.
The procedure is then called recursively on A[1..i] and B[1..j]
and on A[i+1..m] and B[j+1..n] until the base case is reached.
The base case is that A contains just one character.
It is straightforward to enumerate all the allowable ways of
generating A and B in the base case and to sample from them according
to probability.
The choice of j in the general case is made according to the probability
distribution of values taken over all alignments or
instruction sequences.
The DPA of figure 3c
calculates the probability of A and B being
generated  by summing over all possible alignments of A and B.
(It actually works with the log_{2} of probabilities.)
It can also be used to calculate the probabilities of A[1..i] and
each prefix of B being generated;
these values are contained in one row of D[,].
Running the DPA "in reverse" gives the probabilities of A[i+1..m] and
each suffix of B being generated.
These results are combined and normalised to give the probability
distribution of j given i, and j is sampled accordingly.
This is closely related to the notion of alignment density plots
in Allison et al (1992a, fig 6, fig 11).
The complete sampling procedure is just twice as slow as the DPA
that it uses, ie. it takes O(A×B) time.
This is because it solves one fullsize (A×B) DPA problem,
two problems whose total size (A×B/2) is half that of
the original problem, four problems whose total size is one quarter
that of the original problem etc.
In Hirschberg's algorithm the divideandconquer technique was used
to reduce the use of space to O(B)  as it still does here 
because only two rows of a matrix of length B are required to calculate
the values needed.
K >= 2 Strings
A (multiple) alignment of K strings is formed by
padding out each string with zero or more null characters
so that they all have the same lengths.
The padded strings are then written out one above another.
Each column of characters is called a Ktuple or just a tuple.
The allnull tuple is not allowed.
There are two parts to the optimal alignment of several strings.
The first is the search algorithm for finding a good alignment.
The second is the cost function to be applied to alignments.
As before, we take an optimal alignment to be a most probable alignment
of the strings.
The naive extension of the DPA to K strings,
each of approximately n characters,
would require O(2^{K}n^{K}) running time which is infeasible
for even modest values of K and n.
However the DPA can be extended so as to align two alignments.
A string can be considered to be a trivial case of an alignment
and its characters to be onetuples.
Given an alignment AS over a set of L strings S
and an alignment AT over a set of M different strings T,
AS and AT can be aligned to give an alignment of K=L+M
strings S ∪ T.
The algorithm aligns AS, a string of Ltuples, with AT, a string of Mtuples.
All that is necessary is that a cost be given to each Ktuple
within the DPA; this is described below.
The final alignment may not be optimal for the L+M strings
but this algorithm can be used as an iterative step to improve
a multiple alignment to at least a local optimum.
This kind of deterministic heuristic is quite common
and an example has been described by Allison et al(1992b):
Given K>2 strings and an evolutionary tree over them
an initial Kway alignment is found by some suboptimal process.
The tree is then "broken" on some edge
which partitions the strings into two disjoint sets,
S of size L and T of size M=KL.
The Kway alignment is projected onto these two sets of strings
to give two subalignments, AS over S and AT over T, which
are realigned with the DPA to give a new overall Kway alignment.
The process is iterated and terminates when there is no further improvement
in the full Kway alignment.
The results are usually good although
it is not guaranteed to give an optimal Kway alignment.
It may get stuck in local optima and results may depend
on the initial alignment and the order in which edges are chosen
during improvement.
The alignment sampling process of the following section
provides a way around these and other difficulties.
The other part of the optimal alignment problem is the assignment
of a cost to an alignment of K strings.
We use tree costs and model each edge of the evolutionary tree
by a separate mutationmachine.
The machines can be combined so as to calculate the probability,
and hence the message length, of each Ktuple examined by the DPA.
The DPA can then find a good Kway alignment
and calculate its total message length by summing Ktuple message lengths.
A particular Ktuple of descendant characters can have
several evolutionary explanations.
If we knew the hypothetical ancestral characters at the internal nodes of
the tree it would be a simple matter to combine
the probabilities of the implied machine instructions on the edges.
Since we do not, it is necessary to sum
over all possible assignments to internal nodes in the style of
figure 2.
Fortunately the combinatorial possibilities can be dealt with efficiently;
Felsenstein(1981, 1983) describes the necessary algorithm.
It involves calculating probabilities for each possible
character value at each internal node of the tree;
an example is given in figure 4.
The probabilities of the various instructions for a given tuple can
also be calculated and can be used to estimate instruction
probabilities for each machine from a given Kway alignment.
All this can be done by traversing the tree, in O(K) time, for each tuple
but it is a significant computation and should not be repeated unnecessarily.
Therefore results are stored in a lookuptable for fast access
if ever the tuple is met again, as it probably will be.
The lookuptable speeds up the algorithm considerably.
Two alignments of 4 strings of length 500 can be aligned,
to give an 8way alignment, in 15 seconds on a sparcstation.
Sampling Alignments of K > 2 Strings
It is infeasible to average over all Kway alignments of K strings
for the purpose of estimating the edges of a given evolutionary tree
but it can be sufficient to average over a large sample of alignments.
Unfortunately it is also infeasible to extend the alignment sampling
procedure of section 3 directly to K>2 strings
for the same algorithmic reasons that the DPA cannot be directly extended.
However, we can think of an alignment as the state of
a complicated system having many degrees of freedom.
It is sufficient to hold many of
those degrees of freedom fixed while sampling the remainder from
the conditional posterior distribution so that is what is done:
Given a multiple alignment, the tree is "broken" on a random edge
which partitions the strings into two disjoint sets, as in section 4.
The multiple alignment is projected onto the two sets of strings
to give two subalignments.
A random realignment is sampled
from the posterior distribution of alignments (of the subalignments)
as described for just two strings in section 3 and using the costs
for Ktuples as described in section 4.
The realignment is sampled conditional on the subalignments.
Each subalignment and the degrees of freedom that it specifies remain
fixed during the realignment.
Only the relation between the two subalignments is sampled in this step
but the process is iterated many times, choosing random edges.
The machine parameters are estimated for each multiple alignment sampled.
Results from all samples are averaged to give the final
estimate of the machine parameters.
Standard deviations are also calculated and give an indication
of the accuracy of estimates.
A workingestimate of the machine parameters is
needed to calculate the distributions in the algorithm
and a weighted average from "recent" alignments is used for this purpose;
the algorithm seems to be insensitive to the details of how this is done.
To begin the sampling, an initial multiple alignment is found
by the deterministic heuristic described previously.
Simulated Annealing
The alignment sampling procedure of the previous section
is trivially modified for use in a simulated annealing approach
to optimal multiple alignment.
At the heart of the sampling procedure
a point j of string or alignment B is chosen to correspond to
the midpoint i of string or alignment A.
The point j is sampled from the probability distribution of possible values.
If the values of the probability distribution are raised to
some power p>=0, and the distribution is renormalised,
the sampling of j, and hence of alignments, is modified in a useful way.
When p=0, j is chosen from a uniform distribution.
When p=1, j is chosen from the posterior distribution implied by alignments.
When p>1, j is biased towards the more probable values.
When p is very large, j is chosen so as to lead to
an optimal (pairwise) alignment;
the procedure becomes Hirschberg's algorithm in effect.
Increasing p implements lowering the temperature in simulated annealing.
Since the algorithm actually uses message lengths,
the message lengths are multiplied by p
which is equivalent to raising the probabilities to the power p.
This strategy is very different in action from
the simulated annealing strategy of Ishikawa et al(1992),
which makes small perturbations to a multiple alignment.
The present strategy can make large perturbations to an alignment,
especially when p is small.
Results
A program was written
containing the alignment sampling procedure
and the simulated annealing method described above.
It was first tested on artificially generated DNA data.
Figure 5
shows the artificial evolution of three generations of strings.
The original ancestor, s1, is a random string of 500 characters.
The firstgeneration descendants are s2 and s3;
s3 is identical with s1 to make all edges similar
because we are dealing with unrooted trees.
Every edge in the tree corresponds to a mutationmachine with the following
parameters: P(copy)=0.9, P(change)=0.05; P(insert)= P(delete)=0.025.
One expects something like 10% mutation from parent to child
but it could be more or less as the evolutionary process is random.
One expects something like
20% mutation between s8 and s9 say,
30% mutation between s4 and s6,
40% mutation between s8 and s10 and
50% mutation between s8 and s12.
Note that alignments can be found for s8 and s12 with more than 50% matches
(their editdistance is 189 not 250)
and that an alignment with 60% to 70% matches can be
found even for two random, unrelated DNA sequences of similar length.
In three separate trials,
first, second and thirdgeneration strings
were used as data  each with the correct evolutionary tree.
The tree for trial 3 includes that for trial 2 which includes the
trivial tree for trial 1.
Three analyses were performed in each trial.
First, the deterministic heuristic was used to find a good alignment
and parameters were estimated from this alone.
The alignment was used as a starting point for the next two analyses.
Second, parameters were estimated from 1000 stochastically sampled alignments.
Third, simulated annealing was used
over 1000 trials with the message length multiplier
increasing linearly from 1.0 to 4.0.
The results of these trials are summarised in
Table 1.
The figures marked actual give information from
the evolution of the strings which is unknown to the analysis program.
During evolution the machine on each edge had
parameters P(copy)=0.9, P(change)=0.05, P(insert)=0.025, P(delete)=0.025
but there is variation and so the actual figures are given.
The figures marked det give the parameter estimates from
the putative optimal alignment found by the deterministic heuristic
in each trial.
There is the beginning of a trend to overestimate P(copy) and,
with the exception of e7 and e10, to underestimate P(indel).
This is consistent with previous results on two strings (Yee & Allison 1993).
In order to avoid repeated qualification in what follows,
we often refer to an alignment found
by the deterministic heuristic or by simulated annealing as
an `optimal alignment' even though it may only be a near optimal
alignment.
The figures marked Gibbs give the estimates from the 1000 sampled
alignments in each trial.
(The standard deviation of the estimate of P(indel) is the largest
and is the only one reproduced.)
The actual proportion of indels on each edge lies
within about two standard deviations of the estimate of P(indel).
Note that the standard deviation of the estimates for e1,
which is common to each trial, increases as the data gets farther from e1,
roughly doubling with each extra generation,
and that the estimates for this edge are the worst.
The deterministic heuristic proved hard to beat in the search for an
optimal alignment at this moderate level of mutation.
It was not beaten by simulated annealing in any of the above trials
although simulated annealing found many alignments
with a message length just two bits more than that found by the heuristic.
From other trials it also seems that 4way alignment
might be rather easy, in that the heuristic was not beaten in several trials.
On 8way alignment with 10% mutation per edge,
the heuristic was sometimes beaten, but never by
more than a few bits on artificial data.
Possibly the simulated annealing was cooled insufficiently
or was cooled too quickly.
It seems that there is a very large number indeed of near optimal
alignments and that the searchspace is hardly informative close to them.
It would take extremely time consuming tests
to map out the alignment landscape thoroughly.
The search for the marginally "best" alignment may be rather pointless
in any case.
Simulated annealing
beat the deterministic heuristic by a significant 34 bits
when the level of mutation was increased to 15% per edge.
Table 2
gives results for a tree with the topology of
figure 5
where the machine on each edge had
parameters P(copy)=0.85, P(change)=0.075, P(insert)= P(delete)=0.0375.
Strings such as s8 and s12 are only tenuously related here.
Estimates are given from single alignments by the deterministic heuristic (det)
and simulated annealing (SA)
and from stochastic sampling of 1000 alignments (Gibbs).
There is an increased tendency for (near) optimal alignments to
underestimate P(indel).
This effect is most marked on "internal" edges of the tree,
as illustrated by the means over different sets of edges.
For example, simulated annealing gives an average of 0.025
against a real figure of 0.073 over e2 to e5.
Sampling gives an average of 0.070.
Note that much of the improvement in message length
in going from the heuristic to simulated annealing
seems to be due to the latter "explaining away" more indels as changes on
the inner edges.
(A similar effect has been noted with algorithms under
development for the most parsimonious alignment problem.)
The standard deviations in sampling's parameter estimates
increase with the level of mutation as is to be expected.
Some tests were also done on unbalanced trees
with edges of different lengths.
Sampling continued to yield better estimates of actual edges
although accuracy decreased and
standard deviations increased on the longer edges.
Various tests were done to study the asymptotic behaviour of the
algorithms and some results are given in
table 3.
In order to reduce computer time,
only the deterministic heuristic was used to find (near) optimal alignments
to compare with sampling.
Firstly, 10 data sets were generated for the tree of
figure 5 at
20% mutation per edge.
This is quite a high level of mutation;
across the 10 data sets the message lengths range
from 7700 to 7900 bits for an optimal alignment,
from 8000 to 8150 bits for the null theory and
from 9100 to 9900 bits for an average alignment.
(Note that the message length of the rtheory, if it could be calculated,
would be less than that of an optimal alignment.)
Averages over all edges and all 10 data sets of actual and estimated parameters
are shown in
table 3, line (a).
There is an increased tendency for optimal alignments to underestimate
P(indel) at the 20% mutation level, particularly on inner edges of the tree.
In going from
table 2
to table 3
line (a), the average frequency of indels
per edge has risen from 0.072 to 0.099 but the estimate from
optimal alignment has remained at 0.048.
Stochastic sampling gives good average estimates,
within 0.01 of the actual figures.
However for some data sets the estimates of some edges by sampling have
standard deviations of over 0.06 implying that little more than
one decimal place of such an estimate may be usable.
Subsequently,
10 data sets were generated for the tree of three leaves,
(s1 (s2 s3)),
at each of 10%, 20% and 30% mutation per edge.
Averages over all edges and all 10 data sets of actual and estimated parameters
are shown for each level of mutation in
table 3 lines (b) to (d).
Sampling gives good estimates although
standard deviations of estimates rise with the mutation level.
At the same level of mutation per edge, the estimates from optimal alignment
are better for the three leaf tree than for the 8 leaf tree,
presumably because the former has fewer degrees of freedom
available for the maximisation of alignment probability.
The mutation level of 30% per edge is high and an optimal threeway alignment
typically fails to be an acceptable hypothesis by a margin of
30 to 60 bits.
The results also suggest that there may be a small bias in
the sampling program to overestimate P(indel) at high levels of mutation.
This possibility is being further investigated.
Transthyretin is a protein expressed in the brain.
It is also expressed in the liver of some animals.
Amongst other things, it is relevant to the evolution of birds,
reptiles and mammals (Schreiber et al 1992).
The cds sequences for transthyretin (Duan et al 1991)
from human, sheep, mouse, rat, chicken and
reptile (T.rugosa) were obtained from Genbank.
There is little doubt about the correct topology
of the evolutionary tree, which is shown
in figure 6
annotated with the estimates of the parameters
for each edge as averaged from sampling 1000 alignments.
There must be considerable pressure of selection on these sequences,
some relationships being close, and only the edge joining birds and
reptiles to mammals shows significant numbers of indels.
The standard deviations of estimates are low as the alignment is constrained.
An optimal alignment gives similar estimates.
Huggins et al (1992) constructed a tree of
chloramphenicol acetyltransferase (CAT) monomers from various bacteria.
Tests on five of the corresponding DNA sequences (Allison & Wallace 1994)
revealed an interesting possibility.
The tree (((CATQ CATB) (CATP CATD)) CCOLI) is weakly preferred
on the basis of the message length of an optimal alignment.
However, the tree ((CATQ CATB) ((CATP CCOLI) CATD)) is preferred
on the basis of the average message length of sampled alignments,
although the difference between the trees
is less than the sum of the standard deviations in message lengths.
This implies some uncertainty in the placing of CCOLI.
Subsequently a sixth DNA sequence, PC221, was added to the data set.
All test results are clear on how PC221 should be linked to each
of the two trees above.
For the six strings, the
tree (((CATQ CATB) PC221) ((CATP CCOLI) CATD)) is supported
over (((CATQ CATB) PC221) ((CATP CATD) CCOLI))
by both the optimal alignment and the average alignment criteria,
although only weakly by the former.
These trees are also closer,
in terms of the average message length of alignments,
than the trees on five sequences.
As it happens the correct tree is almost certainly
(((CATQ CATB) PC221) ((CATP CATD) CCOLI))
which is a subtree
of the tree that Huggin's et al inferred using protein sequences and
using sequences from more organisms.
The situation is illustrated in
figure 7.
The annotations on the edges come from Gibbs sampling 1000 alignments
of the six sequences under either tree as convenient;
the two trees had good agreement on common edges.
The significance of the above to the particular case of CAT genes is
probably not great as the analysis is based on a simple model of evolution
that is not the best for coding sequences and takes no account
of expert biological knowledge.
However, the results do illustrate the important general point
that optimal alignment message length and average alignment message length
may support different trees.
An interesting question is which one should be believed in such cases?
The best answer is neither.
The best criterion for choosing a tree
would be based on the message length of the rtheory:
the log_{2} probability of getting the strings given the tree,
ie. the log_{2} of the sum of the probabilities of all alignments
given the tree.
This could support a different tree entirely.
Unfortunately it is infeasible to calculate its message length.
The message length of an optimal alignment provides an upper bound
on that of the rtheory, and
a good alignment contributes more to the probability of the rtheory
than an average one contributes,
but there are many more of the latter.
It is only sensible to exercise caution in such cases.
Conclusions
An alignment of a set of strings can give an estimate of the edges of
an evolutionary tree over the strings.
However the use of a good or optimal alignment gives a biased estimate of
the true values, particularly on the inner edges of the tree.
Forming the weighted average of estimates from all alignments would
give an unbiased estimate of the edges.
This can be done efficiently for two strings
but is infeasible for more than two.
However, averaging the estimates from many
alignments sampled from their posterior probability
distribution gives a good approximation and is feasible.
In addition, sampling from the probability distribution raised to
an increasing power (or from message lengths with an increasing multiplier)
effects a simulated annealing search for an optimal alignment.
A computer program embodying these ideas has been implemented
for the simplest, 1state model of evolution.
We intend to extend the model although this is not trivial
for more than two strings.
The current implementation is practical for 10 strings
of several hundred characters when used on a good workstation.
With some optimisation each limit could be increased somewhat.
It is tempting to reduce the timecomplexity of the underlying
DPA from quadratic to near linear by a windowing technique under
the assumption that most of the probability in alignmentspace
is concentrated near the current alignment.
However this may be a trap because the assumption may be invalid,
particularly if the strings contain repeated and transposed sections.
The sampling method is certainly a good candidate for
implementation on a parallel computer.
The tuning of simulated annealing algorithms is a difficult area
and more work needs to be done on tuning the one described here.
It would be useful to be able to handle many more than 10 strings.
To do this it will probably be necessary to use
a method, related to the one described here,
but which samples definite strings for the internal, hypothetical nodes
of the tree in a stochastic manner.
(The current method makes only implicit,
probabilistic assignments to internal nodes.)
The problem can then be treated as an iterated threedescendant problem.
Each step will be relatively simple but more of
them will probably be required to explore the searchspace adequately.
The resulting program would be a stochastic version
of a method proposed by Sankoff et al(1976).
A program of this type is under development.
Our sampling and simulated annealing techniques could be used with
other cost functions, such as star and allpairs costs, in multiple alignment
provided that the costs can be interpreted
as something like the log_{2} of a probability.
It would be sufficient for the rankordering of the better alignments
to be correct at low temperature.
Simulated annealing could help with the problem of local optima
that also affects alignment under such costs.
However, it is not clear what the results of stochastic sampling would mean
for these costs as they do not seem to have an obvious
evolutionary interpretation.
Appendix: Sampling Alignments of Two Strings
The alignment sampling procedure of section 3
is recursive and is in general called
to act on a substring of string A and a substring of string B.
If either substring is empty there is only one possible alignment.
Otherwise, suppose we are given substrings A[k..m] of string A where m>k
and B[h..n] of string B where n>=h.
Let i be the middle position (k+m)div 2 of A[k..m].
Consider an arbitrary alignment of A[k..m] and B[h..n].
A[i] appears somewhere in the alignment.
It either occurs alone as <A[i],> which is equivalent to delete(A[i])
or as <A[i],B[j]> for some j which is equivalent to copy(A[i]) or to
change(A[i],B[j]).
In either case let B[j] be the last character of B[h..n], if any,
that occurs with or before A[i]; h1<=j<=n.
There is a certain probability distribution over the possible values
of j given i.
The probability of a particular value `j' is proportional to the sum
of the probabilities of all alignments that involve its choice
ie. events (i) and (ii) below.
(i) A[k..il]&B[h..j]; <A[i],  >; A[i+1..m]&B[j+1..n]
... del A[i] ...
for some j, h1<=j<=n
(ii) A[k..i1]&B[h..j1]; <A[i], B[j]>; A[i+1..m]&B[j+1..n]
... copy A[i] or ...
... change A[i]toB[j] ...
for some j, h<=j<=n
DivideandConquer Cases of A[k..m]&B[l..n], m>k, n>=h.
Now P(A[p..q]&B[r..s]) can be calculated for all `s' by
the modified (logplus) DPA of
figure 3c.
Therefore the probabilities of each possible value of j
can be calculated with two calls to the DPA:
one on the forward strings A[k..i]&B[h..n] and
one on the reversed strings A[i+1..m]&B[h..n].
The forward run is used to calculate the probabilities of A[k..i] and each
prefix of B[h..n] being generated together, by any instruction sequence
whose final instruction includes A[i].
The reverse run calculates the probabilities of A[i+1..m] and each
suffix of B[h..n] being generated together in any way.
Combining the results, using logplus, gives the log_{2} odds ratios
of all possible ways of partitioning an instruction sequence for A and B
at the instruction that includes A[i].
This allows j to be sampled from its correct probability distribution.
The sampling procedure is then called recursively on
A[k..i]&B[h..j] and on A[i+1..m]&B[j+1..n].
The coordinates (i,j) are called an internal terminal.
The coordinates (A,B) are called the external terminal.
Since the first half of the overall alignment was
forced to end with either <A[i],> or <A[i],B[j]> and not <,B[j]>,
any subsequent reversed DPAs are forced to begin from an internal terminal
with one of these alternatives by modification of the boundary conditions.
The lengths of the sections of A and B shrink with each recursive call
to the sampling procedure.
Eventually a single character of A remains and this leads to the
base cases of the procedure.
If the call is for an internal terminal,
A[i] must occur at the end of the minialignment.
There are two possibilities:
(i)  ...  A[i]
B[j] ... B[k] 
ins ... ins del
(ii)  ...  A[i]
B[j] ... B[k1] B[k]
copy or
ins ... ins change
InternalTerminal BaseCases of A[i]&B[j..k], k>=j.
The probability of each possibility is easily calculated
and they are sampled accordingly.
If the base case is for the external terminal
there are more possibilities
but each is no more complex than before
and they are easily sampled:
(i) A[i]  ...  or  A[i]  ...  etc.
 B[j] ... B[k] B[j]  B[j+1] ... B[k]
del ins ... ins ins del ins ... ins
(ii) A[i]  ...  or  A[i]  ...  etc.
B[j] B[j+1] ... B[k] B[j] B[j+1] B[j+2] ... B[k]
copy/ copy/
chng ins ... ins ins chng ins ... ins
ExternalTerminal BaseCases of A[i]&B[j..k].
Essentially the same procedure is used to sample Kway alignments
of Kstrings as described in section 5.
Allison L. & C. S. Wallace. (1994)
An information measure for the string to string correction problem with applications.
17^{th} Australian Comp. Sci. Conf., Christchurch, New Zealand, 659668.
Allison L., C. S. Wallace & C. N. Yee. (1992a)
Finitestate models in the alignment of macromolecules.
J. Mol. Evol. 35 7789.
(Also see
[90/148].)
Allison L., C. S. Wallace & C. N. Yee. (1992b)
Minimum message length encoding, evolutionary trees and multiple alignment.
25^{th} Hawaii Int. Conf. Sys. Sci. 1 663674.
Bishop M. J. & E. A. Thompson. (1986)
Maximum likelihood alignment of DNA sequences.
J. Mol. Biol. 190 159165.
Duan W., M. G. Achen, S. J. Richardson, M. C. Lawrence,
R. E. H. Wettenhall, A. Jaworowski. & G. Schreiber. (1991)
Isolation, characterisation, cDNA cloning and gene expression of an
avian transthyretin: implications for the evolution of structure
and function of transthyretin in vertebrates.
Eur. J. Biochem. 200 679687
Felsenstein J. (1981)
Evolutionary trees from DNA sequences: a maximum likelihood approach.
J. Mol. Evol. 17 368376.
Felsenstein J. (1983)
Inferring evolutionary trees from DNA sequences.
In Statistical Analysis of DNA Sequence Data
B. S. Weir (ed), Marcel Dekker 133150.
Hastings W. K. (1970)
Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 57 97109.
Haussler D., A. Krogh, S. Mian & K. Sjolander. (1993)
Protein modelling using hidden Markov Models:
Analysis of globins.
26^{th} Hawaii Int. Conf. Sys. Sci. 1 792802.
Hirschberg D. S. (1975)
A linear space algorithm for computing maximal common subsequences.
Comm. A.C.M. 18(6) 341343.
Huggins A. S., T. L. Bannam & J. I. Rood. (1992)
Comparative sequence analysis of the catB gene from clostridium butyricum.
Antimicrobial agents and chemotherapy 36(11) 25482551.
Ishikawa M., T. Toya, M. Hoshida, K. Nitta,
A. Ogiwara & M. Kanehisa. (1992)
Multiple sequence alignment by parallel simulated annealing.
Institute for New Generation Computing (ICOT)
TR730
Lawrence C. E., S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald
& J. C. Wooton. (1993)
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple
alignment.
Science 262 208214.
Sankoff D., R. J. Cedergren & G. Lapalme. (1976)
Frequency of insertiondeletion, transversion, and transition
in evolution of 5S ribosomal RNA.
J. Mol. Evol. 7 133149.
Schreiber G., A. R. Aldred & W. Duan. (1992)
Choroid plexus, brain proteinhomeostasis and evolution.
Today's Life Science Sept. 2228.
Thorne J. L., H. Kishino & J. Felsenstein. (1991)
An evolutionary model for maximum likelihood alignment of DNA sequences.
J. Mol. Evol. 33 114124.
Thorne J. L., H. Kishino & J. Felsenstein. (1992)
Inching towards reality:
an improved likelihood model of sequence evolution.
J. Mol. Evol. 34 316.
Wallace C. S. & D. M. Boulton. (1968)
An information measure for classification.
Comp. J. 11(2) 185194.
Wallace C. S. & P. R. Freeman. (1987)
Estimation and inference by compact encoding.
J. Royal Stat. Soc. B 49 240265.
Yee C. N. & L. Allison. (1993)
Reconstruction of strings past.
Comp. Appl. BioSci. 9(1) 17.
string A = TAATACTCGGC
string B = TATAACTGCCG
mutation instructions:
copy
change(ch) NB. ch differs from the corr' char in string A.
del
ins(ch)
a mutation sequence:
copy; copy; delete; copy; copy; ins(A); copy; copy;
del; copy; change(C); copy; ins(G)
generation instructions:
match(ch)
mismatch(ch_{A},ch_{B}) NB. ch_{A}!=ch_{B}
ins_{A}(ch_{A})
ins_{B}(ch_{B})
a generation sequence:
match(T); match(A); ins_{A}(A); match(T); match(A); ins_{B}(A); match(C);
match(T); ins_{A}(C); match(G); mismatch(G,C); match(C); ins_{B}G
equivalent alignment:
TAATACTCGGC
    
TATAACTGCCG
Figure 1: Basic Models.
P  hypothetical, unknown parent
. .
mA. .mB  mutation machines
. .
. .
A B  given descendants
m = mA.mB  equivalent generation machine
m's instns Explanation &
pair (a,b) rates per character of P.
 
match(x) x<x>x x<y>x
x x pA'(c).pB'(c) pA'(ch).pB'(ch)/3
mismatch(x,y) x<x>y x<y>y x<z>y
x y pA'(c).pB'(ch) pA'(ch).pB'(c) pA'(ch).pB'(ch)2/3
ins_{A}(x) x<x>_ x<y>_ x<_>_
x _ pA'(c).pB'(d) pA'(ch).pB'(d) pA'(i)
ins_{B}(x) _<x>x _<y>x _<_>x
_ x pA'(d).pB'(c) pA'(d).pB'(ch) pB'(i)
_<x>_ "invisible",
_ _ pA'(d).pB'(d) not seen
ccopy, chchange, iinsert, ddelete
Define instn rates per character of P:
pA'(instn) = pA(instn)/(1pA(i)); pB'(instn) = pB(instn)/(1pB(i))
Note pA'(c)+pA'(ch)+pA'(d) = pB'(c)+pB'(ch)+pB'(d) = 1
Sum rates = (1+pA'(i)+pB'(i)) = (1pA(i).pB(i))/((1pA(i)).(1pB(i)))
Figure 2: Composition of Two Machines, for DNA.
Boundary Conditions, i=1..A, j=1..B:
D[0,0] = 0
D[i,0] = D[i1,0]+f(A[i], '')
D[0,j] = D[0,j1]+f('', B[j])
General Step, i=1..A, j=1..B:
D[i,j] = g( D[i1,j1]+f(A[i],B[j]),  match or mismatch
D[i1,j ]+f(A[i], ''),  ins_{A} A[i]
D[i, j1]+f('', B[j]) )  ins_{B} B[j]
(a) For Edit Distance:
f(x,x) = 0; f(x,y) = f(x,'') = f('',y) = 1
g(a,b,c) = min(a,b,c)
(b) For most probable alignment:
f(x,x) =log_{2}(P(match(x))); f(x,y) =log_{2}(P(mismatch(x,y)));
f(x,'')=log_{2}(P(ins_{A}(x))); f('',y)=log_{2}(P(ins_{B}(y)))
g(a,b,c)= min(a,b,c)
(c) For rtheory:
f( , ) as for (b)
g(a,b,c)= logplus(a,logplus(b,c))
where logplus(log_{2}(p),log_{2}(q)) = log_{2}(p+q)
Figure 3: Dynamic Programming Algorithm (DPA) Variations.
Actual machine parameters:
copy change indel
m1: 0.9 0.05 0.05
m2: 0.9 0.08 0.02
m3: 0.7 0.2 0.1
m4: 0.8 0.1 0.1
m5: 0.75 0.1 0.15
A A
. .
m1. .m3
. A:0.43 A:0.40 .
.C:0.56 C:0.59.
++
. m5 .
. .
m2. .m4
. .
C C
Probable hypothetical characters.
Tuple ACAC's probability = 0.00027
Estimated operations carried out:
P(copy) P(change) P(indel)
m1: 0.43 0.57 <0.01
m2: 0.57 0.43 <0.01
m3: 0.40 0.60 <0.01
m4: 0.59 0.41 <0.01
m5: 0.94 0.06 <0.01
Figure 4: Example, explanations for tuple ACAC.
s8 s15
. .
s9 . . s14
. . e6 e13 . .
. . . .
e7 . . . . e12
. . . .
. . . .
s4 s7
. .
. .
e2 . . e5
. .
. e1 .
s2..............s1=s3
. .
. .
. e3 . e4
. .
. .
s5 s6
. . . .
. . . .
e8 . . . . e11
. . . .
. . e9 . e10 .
s10 . . s13
. .
s11 s12
Ancestor, s1=s3, length(s1)=500
1st generation: s2 and s3
2nd generation: s4 to s7
3rd generation: s8 to s15
Probabilities for each edge during evolution:
P(copy)=.9; P(change)=.05; P(insert)=P(delete)=.025
Figure 5: Full Tree.
sheep human
. .
. .
(.911,.081,.008) . .(.913,.086,.001)
. .
+
.
.(.973,.026,.001)
+
. .
(.945,.054,.001). .
. .
. .
mouse............+ .(.838,.136,.026(+.003))
(.951,.041,.008) . .
. .
.(.945,.055,.001) .
. .
. +......................reptile
rat . (.853,.146,.001)
.
.(.923,.076,.001)
.
.
chicken
key: (P(match),P(change),P(indel))
Genbank ids: HUMPALA(27470), OATTHYRE(12452), MMALBR(27467),
RATPALTA(10453), GDTRTHY(26478), TRTRANST(16468).
Figure 6: Edge Estimates by Sampling for Transthyretin cds.
CATQ
.
(.84,.16,.01). CATB
. .
. .(.91,.09,0)
.
.
.(.84,.14,.02)
.
.
.........................PC221
. (.75 .23,.01)
.
.(.83,.08,.09(+.01))
(.80,.20,0) .
CCOLI.
+ .
+ .(.81,.18,0)
+ .
(.67,.32,0)+ .
+ . .
. . unmarked => P(copy) >= .97
. .
CATP CATD
Message Lengths (bits):
Tree over 5 sequences opt' Gibbs (SD)
  
(((CATQ CATB)(CATP CATD))CCOLI)  4369 4555(+40)
((CATQ CATB)((CATP CCOLI)CATD)) +++ 4376 4502(+26)
Null: 6449 bits
Tree over 6 sequences opt' Gibbs (SD)
  
(((CATQ CATB)PC221)((CATP CATD)CCOLI))  5395 5564(+36)
(((CATQ CATB)PC221)((CATP CCOLI)CATD)) +++ 5391 5519(+32)
Null: 7752 bits
Data: chloramphenicol resistance gene, cds or orfs from
Genbankid organism designation
m55620 Clostridium perfringens 4591118 CATQ
m28717 Clostridium perfringens 145768 CATP
x15100 Clostridium difficile 91726 CATD
m35190 Campylobacter coli 309932 CCOLI
m93113 Clostridium butyricum 145800 CATB
spc221 Staphylococus aureus 22672914 PC221
Figure 7: Alternative Trees for Bacterial Sequences.
(i) data = 1st generation, s2 and s3:
edge actual det Gibbs
   
e1: .895 .037 .068 .900 .040 .061 .893 .038 .070(.005)
(ii) data = 2nd generation, s4 to s7:
edge actual det Gibbs
   
e1: .895 .037 .068 .894 .044 .063 .878 .038 .085(.009)
e2: .902 .053 .045 .910 .054 .036 .899 .052 .049(.007)
e3: .919 .041 .039 .916 .057 .026 .917 .048 .035(.005)
e4: .898 .055 .047 .899 .059 .043 .891 .054 .056(.007)
e5: .908 .041 .051 .915 .055 .030 .908 .052 .041(.006)
  
means over subsets of edges:
e25: .907 .048 .046 .910 .056 .034 .904 .052 .045
e15: .904 .045 .050 .907 .054 .040 .899 .049 .053
(iii) data = 3rd generation s8 to s15:
edge actual det Gibbs
   
e1: .895 .037 .068 .878 .060 .062 .863 .043 .094(.016)
e2: .902 .053 .045 .928 .048 .025 .900 .054 .046(.007)
e3: .919 .041 .039 .915 .061 .025 .920 .048 .033(.008)
e4: .898 .055 .047 .920 .038 .043 .906 .036 .058(.009)
e5: .908 .041 .051 .913 .055 .032 .904 .061 .035(.008)
e6: .891 .058 .051 .896 .066 .038 .893 .062 .045(.006)
e7: .926 .035 .039 .924 .035 .042 .918 .031 .052(.007)
e8: .922 .029 .049 .937 .037 .026 .932 .026 .041(.006)
e9: .898 .035 .067 .909 .042 .050 .895 .032 .073(.009)
e10: .893 .060 .048 .898 .048 .055 .879 .052 .069(.009)
e11: .902 .049 .049 .907 .057 .036 .906 .053 .042(.007)
e12: .898 .049 .053 .909 .045 .046 .902 .040 .059(.008)
e13: .916 .049 .035 .912 .064 .024 .906 .064 .030(.007)
  
means over subsets of edges:
e25: .907 .048 .046 .919 .051 .031 .908 .050 .043
e613: .906 .046 .049 .912 .048 .040 .904 .045 .051
e113: .905 .045 .049 .911 .050 .039 .902 .046 .052
Evolution: Ancestor length = 500
P(copy)=0.9; P(change)=0.05; P(insert)=P(delete)=0.025 for each edge
key: P(copy) P(change) P(indel)
actual = frequencies as measured during evolution
det = inferred from a good alignment by deterministic heuristic
Gibbs = averaged from 1000x sampled alignments
Table 1: Estimated Edges from Evolution at 10% Mutation/Edge.
edge actual det SA Gibbs (S.D.)
    
e1 .858 .056 .086 .854 .089 .057 .876 .073 .051 .850 .073 .078(.015)
e2 .849 .073 .079 .862 .093 .046 .869 .080 .051 .838 .067 .095(.017)
e3 .881 .069 .051 .920 .050 .030 .908 .069 .024 .887 .062 .052(.012)
e4 .839 .085 .076 .863 .103 .034 .864 .119 .018 .836 .107 .058(.014)
e5 .833 .079 .088 .845 .133 .022 .833 .160 .008 .836 .089 .075(.019)
e6 .862 .063 .076 .877 .059 .064 .875 .041 .085 .850 .047 .103(.011)
e7 .871 .069 .061 .880 .079 .041 .876 .093 .031 .877 .084 .040(.009)
e8 .885 .049 .066 .890 .062 .049 .896 .053 .051 .890 .053 .057(.010)
e9 .831 .083 .086 .841 .098 .062 .838 .091 .072 .813 .079 .108(.012)
e10 .873 .066 .060 .895 .045 .060 .880 .047 .074 .877 .052 .071(.009)
e11 .837 .088 .074 .851 .090 .059 .860 .074 .065 .836 .075 .089(.013)
e12 .856 .078 .066 .876 .083 .041 .889 .078 .033 .861 .090 .049(.008)
e13 .844 .084 .072 .866 .083 .051 .854 .079 .066 .846 .075 .079(.011)
   
means over subsets of edges:
e25 :.851 .077 .073 .873 .095 .033 .869 .107 .025 .849 .081 .070
e613:.857 .073 .070 .872 .075 .053 .871 .070 .060 .856 .069 .075
e113:.855 .072 .072 .871 .082 .047 .871 .081 .048 .854 .073 .073
   
Message Lengths (bits):
6735 6701 7556(+109) mean(SD)
Null: 8045
Evolution: Ancestor length = 500
P(copy)=.85; P(change)=.075; P(insert)=P(delete)=.0375 for each edge
key: frequencies or estimated probabilities of copy, change & indel from
actual  as counted during evolution
det  estimated from a "good" alignment by deterministic heuristic
SA  from an optimal (?) alignment by simulated annealing
Gibbs  from 1000 x Gibbs sampling
Table 2: Estimated Edges from Evolution at 15% Mutation/Edge.
generation actual det Gibbs (std dev's)
   
a) .8 .1 .1 .801 .101 .099 .828 .125 .048 .790 .107 .103 (.008 to .068)
b) .9 .05 .05 .896 .052 .053 .901 .054 .044 .893 .051 .056 (.003 to .008)
c) .8 .1 .1 .795 .105 .101 .815 .124 .061 .785 .104 .111 (.007 to .022)
d) .7 .15 .15 .702 .150 .148 .745 .194 .061 .683 .141 .176 (.012 to .051)
Ancestor length = 500
key: P(copy) P(change) P(indel)
a) : tree = (((s1 s2)(s3 s4))((s5 s6)(s7 s8))) unrooted
b), c) & d) : tree = (s1 (s2 s3)) unrooted
generation : settings for mutation machine on each edge
actual : as counted during evolution
det : estimated from a "good" or optimal alignment
Gibbs : estimated by Gibbs sampling 1000x per data set
std dev's : range of std dev's for sampling, across 10 data sets
Table 3: Averages over all edges & 10 data sets each for four different settings.
More recent work on Bioinformatics can be found
[here].
Inference by [MML] is a general Bayesian method
for machine learning and statistical inference.
Gibbs sampling is an alternative to Monte Carlo Markov Chain
(MCMC) methods when you can sample from the posterior probability distribution.

