
Proc. Int. Conf. on Information, Statistics and Induction in Science
(ISIS), pp.304316, 1996
(based on TR 128, Dept. Computer Science, Monash U., 1989)
1. Introduction
Recently there has been interest in a general method of
statistical estimation and inference variously known as
Minimum Message Length
(MML)
(Wallace & Freeman, 1987) or
Minimum Description Length (MDL) (Rissanen 1987). The
method is an application of a general approach to inductive
inference first proposed by Solomonoff (1964), and closely
related to the notion of Algorithmic Complexity due to
Kolmogorov (1965). The method is Bayesian,
and is based on the idea of choosing an
hypothesis and estimates of its parameters which together
allow a concise encoding of the data. The method is of
great generality, and in those specific estimation problems
which have so far been addressed using MML, gives estimates
similar, but often superior in some respect such as bias, to
those obtained by the Maximum Likelihood (ML) method. In
particular, MML handles more gracefully problems where the
number of unknown parameters increases with sample size, and
can automatically select an appropriate hypothesis from a
family of hypotheses of increasing complexity.
This page was
created from what I believe to be the 1996 source file,
hopefully without adding too many errors, to make this
MML
paper more readily accessible
 LA 2008
NB. I have used
θ
for
θhat
and
φ
for
φhat

The consistency of MDL for a wide range of problems has been
proved by Barron & Cover (1991), who also derive the ultimate
rate of convergence of the estimates with increasing sample
size for a range of problem types. Unfortunately, little is
known in general about the behaviour of MML estimates for
modest sample sizes. For instance, there have been no
general results on the bias or variance of estimates, or on
the error rates of the inbuilt hypothesis selection.
Wallace and Freeman (op cit) and Rissanen (op cit) have
argued that the estimates are close to sufficient, in the
sense of capturing all the relevant information available in
the data, but the arguments are outside conventional
statistical theory.
In this paper, we attempt to show that MML can be expected
to have good behaviour for modest samples. Consistency
arguments prove good behaviour in the limit of large
samples. We have not entirely avoided the need to appeal to
a limit, but rather than considering the behaviour as an
unbounded volume of data is used in a single estimation, we
consider the behaviour as the method is applied to an
unbounded set of (possibly unrelated) estimation problems,
each using only limited data. That is, we consider the
expected consequences of adopting the strategy of using MML
for all desired inferences. Sections 2 and 3 discuss a
proposed criterion of estimate quality for a single
estimation problem. Section 4 extends the criterion to
cover the behaviour of a method in a sequence of problems,
and relates it to conventional criteria such as bias. These
sections also define an “optimal” method according to this
criterion. Sections 6, 7 and 8 then show that the MML
method approaches this optimum.
Of the several criteria which are commonly used for judging
the quality of statistical estimators, none is entirely
satisfactory, at least by itself. Some, such as
consistency, are applicable only to families of estimators
for an unbounded range of sample sizes. Others, such as
bias and expected variance, are tied to particular
parameterizations of the problem, and may give different
judgements if the parameters to be estimated are subjected
to functional transformation. Others again, such as
sufficiency, can be satisfied by estimators with severe
deficiencies. In this paper, we propose yet another
criterion and investigate some of the properties of
estimators satisfying it. We give a constructive proof that
such estimators generally exist, and also show that Strict
Minimum Message Length (SMML) estimators (Wallace & Freeman
1987) in some circumstances closely approach satisfaction of
the criterion.
The framework of the discussion is Bayesian.
We wish to estimate the value of an unknown parameter θ
within a space Θ, given a prior density h(θ) on
Θ, an observed datum x within a space X, and a
probability density f(xθ) (x∈X, θ∈Θ).
At times we use the joint density
p(θ,x) = h(θ)f(xθ),
the marginal density of x,
r(x) = ∫ dθ p(θ, x), and
the posterior density g(θx) = p(θ, x) / r(x).
In some parts of the discussion we prefer
to treat x as a discrete datum within a countable set X.
2. Empirical Criteria
Definitions
 An estimator is a function from X into Θ,
which we shall write as θ = T(x).
 A stochastic estimator is an estimator whose value
θ = T_{s}(x) for
any datum x is a realization of a random variable with
probability density G_{s}(θx).
 A deterministic estimator is not stochastic.
 An honest estimator is an estimator whose value for
given x is independent of the true parameter value θ.
 The oracle is a distinguished estimator T_{t} such
that T_{t}(x) = θ,
the true parameter value. It is
thus a ‘perfect’ estimator, but is of course neither
realizable nor honest. Apart from the oracle, all other
estimators considered will be honest.
 A trial is an instance of an estimation problem described by
Θ, X, h(θ), f(xθ) yielding a
datum x which is then supplied to two estimators
T_{0} and T_{1} to give two estimates
θ = T_{0}(x),
φ = T_{1}(x).
In a trial, the estimators are treated as black boxes.
That is, the internal mechanisms of the
estimators are invisible, and any comparison of the
estimators can be based only on the datum x and the two
estimates θ and φ.
The comparison may involve use
of the prior information Θ, X, h(θ) and f(xθ).
Formalizing the notion of a comparison:
 A criterion c is a function from
(X, Θ, Θ) to the real line.
The outcome of a trial using c is
written c(x, θ, φ).
For some criteria, negative
values of c represent a preference for T_{0} and
positive values a preference for T_{1}. That is, for the
observed value of x, c(x, θ, φ) < 0
indicates that the criterion c judges θ to be a
better estimate than φ, for instance because
θ has the higher likelihood. All criteria
are assumed to be independent of the true parameter θ.
 A fair criterion is a criterion such that
for all a, b ∈ Θ and x ∈ X,
c(x,a,b) =  c(x,b,a).
This definition is intended to exclude criteria which have
an inbuilt preference for one of the two estimators. As the
internal workings of the estimators are invisible, no such
preference would be rational.
The above definitions are intended to provide a framework
for the empirical comparison of estimators, i.e. comparison
based on the observable behaviour of estimators rather than
on their structure. Of course, the outcome of a single
trial using some criterion is a function of the value of x
realized in the trial, and so the preference for one
estimator over the other will be subject to the accidents of
the particular trial. Later, we will discuss a class of
estimation problems for which a single trial might be
sufficient to determine a preference reliably. However,
we now retreat somewhat from the ideal of a purely empirical
comparison, and define the expected outcome of a trial.
The expected value of a trial of estimators T_{0}
and T_{1} using criterion c is the expected value of
c(x, T_{0}(x), T_{1}(x)),
written c(T_{0}, T_{1}),
where the expectation is taken over the joint
distribution of x and θ. Where one or both of the
estimators are stochastic, the expectation is also taken
over their value distributions.
 For T_{0}, T_{1} honest and deterministic, the
estimators are independent of θ and the expectation is
simply over the marginal distribution of x.
 c(T_{0}, T_{1})
= ∫ dx
r(x) c(x, T_{0}(x), T_{1}(x))
 For T_{0}, T_{1} honest, T_{1} stochastic,
T_{0} deterministic,
 c(T_{0}, T_{1})
= ∫ dx
∫ dφ
r(x) G_{1}(φx)
c(x, T_{0}(x), φ)
 and for T_{0}, T_{1} honest and stochastic
 c(T_{0}, T_{1})
= ∫ dx
∫ dθ
∫ dφ
r(x) G_{0}(θx)
G_{1}(φx)
c(x, θ, φ)
 For the special case that one estimator, T_{0} say,
is the oracle T_{t} and T_{1} is deterministic,
 c(T_{t}, T_{1})
= ∫ dθ
∫ dx
h(θ) f(xθ) c(x,θ, T_{1}(x))
 and for T_{0} the oracle, T_{1} stochastic
 c(T_{t}, T_{1})
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(xθ) G_{1}(φx)
c(x,θ, φ)
 Note that in all cases, for c fair,
c(T_{0}, T_{1})
=  c(T_{1}, T_{0}).
3. False Oracles
A false oracle is an estimator T_{f} such that for
all fair criteria c, and for T_{t} the oracle,
c(T_{f}, T_{t}) = 0.
That is, no fair criterion will be expected to prefer the
oracle to a false oracle, or vice versa.
3.1 Theorem:
 For T_{f} a false oracle, T_{t} the oracle,
T_{1} honest and c a criterion not necessarily fair
c(T_{f}, T_{1}) = c(T_{t}, T_{1}).

 Proof: Consider the function
 c*(x,a,b) = c(x,a,T_{1}(x))  c(x,b,T_{1}(x)),
a,b ∈ Θ
 where, if T_{1} is stochastic, we have the same
realization of T_{1}(x) in both places.
Since T_{1} is honest, c* is not a function of θ.
Also, c*(x,a,b) = c*(x,b,a).
Hence c* is a fair criterion, so by definition of T_{f},
 c*(T_{t}, T_{f}) = 0
 But
 c*(T_{t}, T_{f})
= E{c(x, T_{t}(x), T_{1}(x))
 c(x, T_{f}(x), T_{1}(x))}
 = E{c(x, T_{t}(x), T_{1}(x))}
 E{c(x, T_{f}(x), T_{1}(x))}
 = c(T_{t}, T_{1})
 c(T_{f}, T_{1})
Q.E.D.
This theorem shows that, while we cannot in practice compare
an estimator with the oracle, since the oracle is
unavailable, we can instead compare the estimator with a
false oracle and get the same result in expectation.
 3.2 Corollary:
With the same conditions as for the theorem,
 c(T_{1}, T_{f})
= c(T_{1}, T_{t})
 Also,
 c(T_{f}, T_{t})
= c(T_{f}, T_{f})
= c(T_{t}, T_{f})
3.3 Corollary:
With the same conditions as for the theorem, the value on a trial of
c(x, T_{f}(x), T_{1}(x)) is as good an estimate of
c(T_{t}, T_{1}) as is the value on a trial of
c(x, T_{t}(x), T_{1}(x))
= c(x, θ, T_{1}(x)).
 Proof:
Define
c_{k}(x, θ, φ)
= (c(x, θ, φ))^{k}
 Then c_{k} is a criterion, so
c_{k}(T_{f}, T_{1})
= c_{k}(T_{t}, T_{1})
 But c_{k}(T_{0}, T_{1})
is the k^{th} moment of the sampling distribution of
c(x, T_{0}(x), T_{1}(x)).
 Hence c(x, T_{f}, T_{1}) and
c(x, T_{t}, T_{1})
have the same sampling distribution.
3.4 Theorem:
 False oracles exist in sufficiently regular cases.

 The proof is constructive:
 Let T_{p} be the stochastic estimator described by the density
 G_{p}(θx)
= g(θx) = h(θ) f(xθ) / r(x)
= p(θ,x) / r(x).
 For any criterion c,
 c(T_{p}, T_{t})
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(xθ) g(φx) c(x,φ,θ)
 = ∫ da
∫ dx
∫ db p(a,x) p(b,x) c(x,b,a) / r(x),
(a = θ ∈ Θ, b = φ ∈ Θ)

 c(T_{t}, T_{p})
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(xθ) g(φx) c(x,θ,φ)
 = ∫ db
∫ dx
∫ da
p(b,x) p(a,x) c(x,b,a) / r(x),
(b = θ, a = φ)

 The expressions for c(T_{p}, T_{t}) and
c(T_{t}, T_{p}) differ only in the order of integration.
Hence, for sufficiently regular h, f and c, they are equal,
 c(T_{t}, T_{p}) = c(T_{p}, T_{t})
 But, if c is fair,
c(T_{t}, T_{p}) =  c(T_{p}, T_{t})
 Hence c(T_{p}, T_{t}) = 0 for all fair c,
so T_{p} is a false oracle.
3.5 The Merit of False Oracles
False oracles deserve to be considered good estimators in
that no fair criterion is expected to show a preference for
the true parameter value over the estimate of a false
oracle. Indeed, the above results show that no test based
only on the information available to the estimator can
reliably distinguish an oracle from a false oracle. With
the more conventional criteria applied to estimators (some
of which will be further discussed below) it is often the
case that an estimator satisfying one criterion (e.g. lack
of bias) will not satisfy others (e.g. maximum likelihood).
However, for the empirical criteria discussed here, we have
shown that there exists a class of estimators which give
results as good as the true values by all empirical tests, since
c(T_{f}, T_{t}) = c(T_{t}, T_{f})
for all c, fair and otherwise.
Note that it is not true that all criteria, or even all fair
criteria, will prefer a false oracle to an honest estimator
which is not a false oracle. Suppose x and θ are
commensurable, and that x is distributed as N(θ,1).
The fair criterion
 c(x, θ, φ)
= (x  θ)^{2}
 (x  φ)^{2}
will always prefer the estimator θ = x
to a false oracle.
It will also, of course, always prefer θ = x to the
oracle θ = θ,
so the import of its preference is dubious.
The construction of the false oracle T_{p} amounts to a
rule for characterizing the posterior distribution of
θ by a value (estimate) drawn randomly from the posterior.
It may seem a strange rule compared with, say, choosing the
mean, median or mode of the posterior, but it has the
advantage of being invariant under arbitrary
measurepreserving transformations of the parameter space.
4. Sequence Trials
We now attempt to define a class of estimation problems, and
associated trials, which allow a comparison between
our ‘empirical’ criteria and some more conventional ones.
A sequence problem is an ordered sequence of
independent instances of the same basic problem. For a
basic problem described by
Θ, x, h(θ), f(xθ),
a sequence problem of length J has parameter space
Θ^{J} and data space X^{J}.
The data are an ordered sequence of values
{x_{j}, j=1 ... J} from which
an estimator must form an estimate which is an ordered
sequence {θ_{j}, j=1 ... J},
where θ_{j}
is an estimate of the true value θ_{j} of the parameter
in the j^{th} instance of the basic problem.
The prior density for the true sequence parameter
{θ_{j}, j=1 ... J} is
∏_{j} h(θ_{j})
and the conditional data density is
∏_{j} f(x_{j}θ_{j}).
Note that we assume the instances are wholly
independent: data from early instances do not lead to
revision of the prior or conditional data density in later
instances.
An estimator T for a sequence is a function from X^{J}
into Θ^{J}.
We do not require that the estimator
have the form of J instances of the same function applied
in turn to {x_{j}}. The ‘black box’ model of an
estimator is not required to produce any output until it has
the entire data sequence, so θ_{j}
may depend on all components of
{x_{k}, k=1 ... J}, not just x_{j}.
In a sequence trial of estimators
T_{0}, T_{1} using criterion c,
c is a function
c({x_{j}, θ_{j}, φ_{j}, j=1...J})
which operates on the data sequence and the two estimate sequences to yield a
single value, which need not be expressible as a combination
of J applications of a simpler function.
Since a sequence problem and sequence trial are just special
cases of the problems and trials defined earlier, all
definitions and results carry over, with only the notational
difference that now the symbols Θ, X, h( ) and f( )
refer to the component basic problems of the sequence rather
than to the sequence problem as a whole. Note that, since
the basic problems are independent, the posterior density of
{θ_{j}, j=1...J} given
{x_{j}, j=1...J} is
∏_{j} g(θ_{j}x_{j}).
Thus the false
oracle T_{p} of theorem 3.4,
which picks an estimate sequence randomly
from this posterior density, may be considered to choose
randomly and independently a value from each basic posterior
g(θ_{j}x_{j}).
4.1 Sequence Criteria:
Using sequence trials, we can define empirical criteria
corresponding to some conventional criteria. For D a
function from Θ to the real line, v∈Θ, δ
a small constant and J large, define
 D_{v}{x_{j}, θ_{j}, φ_{j}, j=1...J}
=
Σ_{j∈S} D(θ_{j}) / N_{S}
where S is the set
{j; φ_{j}  v < δ} and
N_{S} is the cardinality of S.
D_{v} is the
average value of D(θ_{j}) over those
components of the sequence trial in which φ_{j}
approximately equalled v.
Its expected value,
D_{v}(T_{0}, T_{1})
approximates the expected value of D(θ)
given φ = v.
Letting D_{v} be the function
B_{v}(θ) = θ  v,
call
B_{v}(T_{0}, T_{1})
the bias of T_{0} relative to T_{1}.
It is, of course, a function of v.
For T_{1} the oracle,
B_{v}(T_{0}, T_{t})
is simply the expected value of
θ  θ,
given θ = v.
Hence B_{v}(T_{0}, T_{t})
corresponds to the usual definition of the bias of T_{0}.
While T_{t} is unavailable,
we have
B_{v}(T_{0}, T_{f})
= B_{v}(T_{0}, T_{t})
for T_{f} a false oracle.
Hence the bias of T_{0} can be
estimated by measuring the bias of T_{0} relative to T_{f}
in a sequence trial.
Letting T_{0} be the false oracle T_{p},
 B_{v}(T_{p}, T_{t})
= B_{v}(T_{p}, T_{p})
= ∫ dx
∫ dθ
f(xv) g(θx) (θ  v)
This is not, in general, zero. However, note that
B_{v}(T_{t}, T_{p})
= B_{v}(T_{p}, T_{t}).
The ‘bias’ of the oracle relative to a false oracle T_{f}
equals the bias of T_{f}, so empirical measurement of
relative bias does not serve to distinguish false and true oracles.
Also, note that
B_{v}(T_{t}, T_{t})
≠ B_{v}(T_{t}, T_{f}),
since T_{t} is not honest.
Of course,
B_{v}(T_{t}, T_{t}) = 0
for all v.
Similarly, letting D_{v} be the function
V_{v}(θ) = (θ  v)^{2},
the usuallydefined variance of T_{0}
is V_{v}(T_{0}, T_{t}),
and can be empirically measured by the value of V_{v} in a
sequence trial of T_{0} and a false oracle T_{f}.
For the false oracle T_{p},
 V_{v}(T_{p}, T_{t})
= V_{v}(T_{p}, T_{p})
= ∫ dx
∫ dθ
f(xv) g(θ x) (θ  v)^{2}
This expression
shows that the variance of a false oracle is
the variance of the convolution over x of f and g.
For h(θ) slowly varying, it is about twice the
variance of a conventional Bayes estimator with quadratic loss function.
5. SMML Estimators
Strict Minimum Message length (SMML)
estimation has been described elsewhere (Wallace & Freeman 1987).
We use the same Bayesian framework defined by
Θ, X, h(θ), f(xθ) etc. as before,
but initially it is convenient to consider x to be a discrete variable,
so X is a countable set, and
f(xθ) and r(x) are probabilities rather than densities.
The SMML estimator θ = m(x) has a range Θ*
= {θ_{i}, i=1, 2, ...}
= {θ : θ = m(x), x∈X}
which is a countable subset of Θ.
Define X_{i} = {x : m(x) = θ_{i}}
and q_{i} = q(θ_{i})
= Σ_{Xi} r(x) for all i.
Then m( ) minimizes
 I_{1} =  Σ_{X} r(x) log[ q(m(x)) f(xm(x)) ]
 =  Σ_{i} [ q_{i} log q_{i}
+
Σ_{Xi} r(x) log f(x  θ_{i}) ]
The estimator is derived by considering the expected length
of a message encoding x and decodeable given knowledge of
X, Θ, f(xθ) and h(θ).
By standard coding theory, the expected length cannot be less than
I_{0} =  Σ_{X} r(x) log r(x),
where the base of logarithms depends on the coding alphabet,
e.g., 2 for binary coding.
However, we consider a message having the form of a
statement of an estimate θ, followed by an
encoding of x using a code which would be optimal
(i.e. have least expected length)
were θ = θ.
Using optimal coding for the statement of θ, the
length of the message encoding x is
L(x) =  log q(m(x))  log f(xm(x)).
(Actually, the length of a
real message might exceed L(x) by as much as one symbol,
since real messages must have integral lengths. However, we
neglect this roundingup effect.) The first term of L(x)
gives the length of the encoding of the estimate
θ = m(x).
Note that q(θ_{i}) is the
unconditional probability that θ_{i} will be the
estimate. The second term is the length of the encoding of x
using a code optimal if θ = m(x).
I_{1} is the expected value of L(x).
Since I_{0} is independent of the estimator, the SMML
estimator minimizes
 I_{1}  I_{0}
=  Σ_{X} r(x) log [q(m(x)) f(xm(x)) / r(x)]
If the discretization of the data is so fine that we can
consider X as a continuum, we can revert to the framework
of the earlier sections, and the SMML estimator minimizes
 I_{1}  I_{0}
= ∫_{X}
r(x) log[q(m(x)) f(xm(x)) / r(x)]dx,
 where
 q(θ_{i})
= ∫_{Xi} r(x)dx.
Θ* remains a countable subset of Θ,
but I_{1} and I_{0} are not individually definable.
Minimization of I_{1}  I_{0}
requires a compromise in the size of the region X_{i}
within which all datum values lead to the same estimate θ_{i}.
Large regions have large values of q_{i}, leading to a short
statement of θ_{i}, but if the region is too large,
it will contain values of x for which
f(xθ_{i}) is much less than
f(xθ_{m}) where
θ_{m} is the maximumlikelihood estimate for x,
so the expected length of the statement of x will
be greater than for smaller regions.
We now show that in a sequence of basic estimation problems
such as was described in section 4, the SMML estimator
approximates a false oracle. Two cases are considered.
In the first, which applies to basic problems with
approximately quadratic loglikelihoods, a fairly exact
analysis is possible. The second presents a less exact
treatment of a more general sequence problem. Throughout,
j indexes the component problems of a sequence of J
basic problems, and i indexes the members of Θ*,
the range of an SMML estimator.
6. A Simple Case
Assume that for all j,
f(x_{j}θ_{j}) =
(1 / (√(2 π) σ))
e^{(xjθj)2 / 2σ2},
σ known, and
h(θ_{j}) = a,
a constant over a large range.
If SMML estimation were applied separately to each problem
of the sequence, we would obtain for each j an estimator
θ_{j} = m(x_{j})
with the same form for every j.
We drop the j suffix and consider one estimator m(x).
Its range is simply a set of
equallyspaced values of θ, with spacing s say.
The SMML estimator maps x into the member of the set most
nearly equal to x.
Since, over the range of interest,
r(x) = h(θ) = a,
we have that for every
θ_{i} ∈ Θ*,
X_{i} is the interval
{x: θ_{i}  s/2 <
x < θ_{i} + s/2}
and q_{i} = as.
The estimator minimizes the expected value of
 log q(m(x))  log f(xm(x)) + log r(x).
Since all of the regions X_{i} have the same pattern,
the expectation can be taken over any one of them, giving
 I_{1}  I_{0}
 =  log(as)
+ 1/2 log(2 π)
+ log σ
+ 1/s {
_{θis/2}∫^{θi+s/2} dx
(x  θ_{i})^{2}
/ (2 σ^{2}) }
+ log a
 =  log s
+ (1/2) log(2 π)
+ log σ
+ s^{2} / (24 σ^{2})
 Minimization with respect to s gives
 s^{2} = 12 σ^{2},
 I_{1}  I_{0} = (1/2) log (2 π e / 12).
 The result of applying this estimator to all J problems of
the sequence is therefore an expected message length
increment for the whole sequence given by
 I_{1}  I_{0} = (1/2) J log(2 π e / 12).
This result is not the best that can be achieved.
 An SMML estimator for the whole sequence problem minimizes
the expected value of
  log q(m(x))
 Σ_{j} log f(x_{j}m_{j}(x))
+ log r(x)
 where now the estimator m(x)
maps the sequence or vector of data
x = (x_{1}, x_{2} ... x_{j} ...)
into a vector
 θ
= (θ_{1}, θ_{2} ... θ_{j} ...)
= (m_{1}(x), m_{2}(x), ...,
m_{j}(x), ...),
 and r(x) = a^{J}.
 Thus the SMML estimator minimizes the expected value of
 L(x) =
 log q(m(x))
+ (J/2) log (2 π)
+ J log σ
+ (m(x)
 x)^{2} / (2 σ^{2})
+ J log a
Its range Θ* is a countable set of vectors or points
in Jspace, {θ_{i}}.
Assume these points form a lattice.
Then
q(θ_{i}) = q_{i}
is the same for all i,
and for all i the region X_{i} within which
m(x) = θ_{i}
is the set of x values
closer to θ_{i} than to any other
member of Θ*, i.e., the Voronoi region of θ_{i}.
For Θ* a lattice, the Voronoi regions of all
its members have the same size and shape. Let s be the
volume of X_{i} for all i.
Then q_{i} = sa^{J} for all i.
The expectation of L(x) can be taken over
any Voronoi region, giving
 I_{1}  I_{0} =
 log (sa^{J})
+ (J/2) log (2 π)
+ J log σ
+ E(x  θ_{i})^{2} / (2 σ^{2})
+ J log a.
The expectation
E(x  θ_{i})^{2}
over the Voronoi region of θ_{i} depends on the volume
s and the shape of the region.
To separate these effects, write
E(x  θ_{i})^{2}
= J k_{J} s^{2/J},
where k_{J} depends only on the shape of X_{i},
i.e. the geometry of the lattice.
Then
 I_{1}  I_{0}
=  log s + (J/2) log (2 π)
+ J log σ + Jk_{J} s^{2/J} / (2 σ^{2})
 Minimization with respect to s gives
 s^{2/J}
= σ^{2} / k_{J},
 I_{1}  I_{0}
= (J/2) log (2 π e k_{J})
Minimization with respect to the geometry of the lattice
requires choosing a lattice which, for a given density of
points, minimizes the mean squared distance of all points
from the nearest lattice point.
Lattices chosen in this way are called ‘quantizing lattices’.
Use of the singleproblem SMML estimator separately for each
problem is equivalent to choice of a rectangular lattice, for which
k_{J} = 1/12 for all J.
However, for J > 1,
lattices exist with k_{J} < 1/12.
For instance, for J = 2, a lattice with hexagonal rather
than square Voronoi regions has
k_{J} = 0.080175.. < 1/12
and hence leads to a smaller value for
I_{1}  I_{0}.
Note that in the resulting SMML estimator, θ_{1},
is a function of both x_{1} and x_{2}.
If x_{1} is held constant and x_{2} is changed,
θ_{1} may assume in turn two or more different values,
all however close to x_{1}.
Similarly, θ_{2}
is a function of both x_{1} and x_{2},
but is always close to x_{2}.
 For large J, the optimum quantizing lattice is not known.
However, Zador (1982) has shown that
 Γ(J/2+1)^{2/J} Γ(2/J+1) / (J π)
> k_{J}
> Γ(J/2+1)^{2/J} / ((J+2) π)
The upper bound is the value for a random set of points
(not a lattice) and the lower bound would be achieved only if the
Voronoi region were a Jsphere.
 Using
Stirling’s approximation
in the form
 log Γ(y+1)
= (y + 1/2) log y  y
+ 1/2 log(2 π)
+ O(1/y),

 log k_{J}
> (2/J) [(J/2 + 1/2) log(J/2)  J/2 + 1/2 log(2 π)]
 log(J+2)
 log π
+ O(1/J^{2})
 >  log(2 π e)
+ (log J + log π  2) / J
+ O(1 / J^{2})
 Similarly, and using
log Γ(1+y) =  γ y + O(y^{2}),
where γ is Euler’s constant,
 log k_{J} <
 log(2 π e) + (log J + log π  2 γ) / J
+ O(1/J^{2})
Hence
(1/2) log(J π)  γ
> I_{1}  I_{0}
> (1/2) log(J π)  1,
to order 1/J.
We see that, whereas use of a rectangular lattice, i.e.
repeated use of the oneproblem SMML estimator, gives
I_{1}  I_{0} = (1/2) J log (2 π e / 12)
proportional to J,
the optimum SMML estimator has at worst
I_{1}  I_{0}
= (1/2) log (J π)  γ.
Although the shape of the optimum Voronoi region for large
J is unknown, the close approximation of the optimum
lattice to the behaviour of spherical regions strongly
suggests that the optimum Voronoi region is very nearly
spherical for large J.
 Making this assumption, the equation
s^{2/J} = σ^{2/kJ}
for its volume s gives for its radius ρ
 ρ^{J}π^{J/2} / Γ(J/2 + 1)
= s
= σ^{J} / k_{J}^{J/2}
Using the lower bound as an approximation for k_{J},
ρ = σ √(J+2)
If we consider one component of θ and x, i.e.
one estimation problem of the sequence,
the estimate θ_{j} is a function of
all components of x, not just
of x_{j}. If the other components are thought of as
random variables, all that is known of x is that it lies
in the subspace defined by x_{j}, and all that is known
of θ is that it lies somewhere in a sphere of
radius ρ centred on x. Thus, θ_{j} can be
regarded as a random variable generated by the projection of
a random point in this sphere onto the j axis. The
density of θ_{j} is therefore proportional to
 {ρ^{2}
 (θ_{j}x_{j})^{2}}^{(J1)/2}
= {(J+2)σ^{2}
 (θ_{j}x_{j})^{2}}^{(J1)/2}
For any J, this distribution has mean x_{j} and
standard deviation σ. For large J, it closely approximates
N(x_{j}, σ^{2}),
which is the posterior density of θ_{j} given x_{j}.
The SMML estimator thus produces estimates for the
individual problems of the sequence which have approximately
the behaviour of values chosen at random from the individual
posterior densities. That is, it approximates the false oracle T_{p}.
The results for this simple case are readily generalized to
any sequence of estimators in which the log likelihood for
the sequence of data has an approximately quadratic
behaviour near its maximum, and where the prior for the
parameter sequence is slowlyvarying. It is not necessary
that all basic estimation problems in the sequence be
instances of the same problem, or even that they require
estimation of the same number of parameters. The results
depend only on the assumption of a slowlyvarying priors,
and approximately quadratic loglikelihood.
7. A More General Case
We now describe a construction for a nearoptimal encoding
of a data sequence which does not require smooth or
quadratic behaviour in the log prior and log likelihood
functions. As before, we consider a sequence of J
independent instances of the same basic problem described by
Θ, X, h(θ) and f(xθ).
We assume x is a discrete variable.
In this construction, each estimate θ_{j} will
be selected from a set Θ* of equallyspaced values
{θ_{i}, i=1, 2 ...} whose spacing δ is
small compared with the expected estimation error of each θ_{j}.
Index i is used for indexing members of Θ*, and
index j for indexing problems of the sequence.
We assume
Σ_{i} δ h(θ_{i}) = 1.
The message is constructed as J segments.
Basically, each segment comprises two parts.
The first states an estimate θ_{j} of θ_{j},
selecting θ_{j} from Θ*,
and has length
 log(δ h(θ_{j})).
The second encodes x_{j} using a code optimal if
θ_{j} = θ_{j}, and has length
 log f(x_{j} θ_{j}).
The total length l_{j} of segment j is
 l_{j} =
 log(δ h(θ_{j}) f(x_{j}θ_{j}))
=  log(δ p(θ_{j}, x_{j}))
However, the segments are combined into a complete message
in such a way that the length of the message is less than
Σ_{j} l_{j}.
As δ is small compared with the expected estimation
error of θ_{j}, there will be several members of Θ*,
separated by multiples of δ, any of which
would give a nearminimal value of l_{j}. The basis of
the construction is use of the freedom to choose any one of
several moreorless equally good values for θ_{j}
so as to convey information about the following segment, j+1.
For definiteness, assume binary coding, so logarithms are to
base 2.
The construction proceeds backwards, in decreasing order of j.
Let M_{J+1} be the empty string,
of length L_{J+1} = 0.
Suppose that a string M_{j+1} of length L_{j+1}
has been constructed encoding estimates and data
for problems j+1, ..., J.
To form a string M_{j} covering
in addition problem j, proceed as follows:
 (a) Compute a distribution over Θ* given by
u_{ij}
= δ h(θ_{i})
f(x_{j} θ_{i})
/ r*(x_{j}),
where
r*(x)
= δ Σ_{i}
h(θ_{i}) f(x θ_{i}).
 (b) Construct an optimum prefix code C_{j} (e.g. a Huffman
code) for this distribution. C_{j} is a set of binary
strings, or words, {w_{ij}, i=1, 2,...}
in onetoone correspondence with the members of Θ*, such that no
word is a prefix of another, any sufficiently long binary
string has a word as prefix, and the length of w_{ij} is
approximately  log u_{ij}.
The approximation, which
arises because the lengths of words are necessarily
integral, will be neglected.
 (c) Find the value k of index i for which w_{kj} is a
prefix of M_{j+1}. If no such k exists, append
randomlychosen binary digits to the end of M_{j+1}
until a match occurs. Let t_{j} be the number of digits appended.
 (d) Delete w_{kj} from the front of M_{j+1},
leaving a string M^{}_{j+1}
of length
L^{}_{j+1}
= L_{j+1} + log u_{kj} + t_{j}.
Note that if
t_{j} ≠ 0,
L^{}_{j+1} = 0.
 (e) Construct a string S_{j} encoding datum x_{j} using
estimate θ_{j} = θ_{k}.
Its length is
l_{j}
=  log(δ p(θ_{k}, x_{j})).
 (f) Construct M_{j} by prepending S_{j}
to the front of M^{}_{j+1}.
Steps (a) to (f) are repeated until the final complete
message M_{1} has been formed.
We must show that M_{1} can be decoded using knowledge
only of h and f, and of the construction algorithm.
Decoding proceeds in order of increasing j.
The segment S_{1} for problem 1 appears in full as the
prefix of M_{1}. It is easily decoded. Its first part
is a Huffman or similar coding of θ_{1} based on the
known distribution δ h(θ_{i}) over Θ*.
Having decoded θ_{1}, the second
part can be decoded, as it is a Huffman or similar coding of x_{i}
based on the nowknown distribution
f(xθ_{1}) over X.
Having decoded s_{1}, we can delete it from the front of
M_{1}, leaving M^{}_{2}.
Now, knowing x_{1}, we can compute the distribution
u_{i1} over Θ*, and hence reconstruct the code C_{1}.
Suppose the decoded estimate θ_{1} has
index k in Θ*. Then we can select w_{1k} from
C_{1}, and prepend it to M^{}_{2}
giving M_{2}.
Since S_{2} appears in full as the prefix of M_{2},
it too can now be decoded, and so on until the entire
sequence of data (and estimates) has been recovered.
At the end of the decoding process, we will be left with a
string M^{+}_{J+1}
comprising all the Σ_{j} t_{j}
random digits appended in steps (c) of the coding algorithm.
It will be useful in what follows to imagine that the coding
process began with these randomlychosen digits already in
place, i.e. with
M_{J+1} = M^{+}_{J+1},
L_{J+1} = Σ_{j} t_{j}.
The result of the coding is then
unchanged, but in step (c) it would never be necessary to
append additional digits, and at every stage of the coding
process we have
L^{}_{j+1}
= L_{j+1} + log u_{kj}.
 For all j,
 L_{j}
= L^{}_{j+1} + l_{j}
= L_{j+1} + log u_{kj}
 log(δ p(θ_{k}, x_{j}))
 where, of course, k is a function of j.
Now,
u_{kj} =
δ h(θ_{k}) f(x_{j}θ_{k})
/ r*(x_{j})
= δ p(θ_{k}, x_{j}) / r*(x_{j})
 Hence,
 L_{j} = L_{j+1}  log r*(x_{j})
 L_{1} = L_{J+1}  Σ_{j} log r*(x_{j})
 We now aim to show that the sequence of estimates yielded by
this construction is close to an SMML estimate for the sequence.
To do so, we will try to show that I_{1}, the
expected value of L_{1} over the marginal distribution
of x, is not much greater than the minimum possible value
 I_{0} =  Σ_{X} r(x) log r(x).
 In a long sequence, datum value x may be expected to occur
with relative frequency r(x). Hence the expected value of
the second term in L_{1} is
  J Σ_{X} r(x) log r*(x)
Now, r*(x)
= δ Σ_{i} p(θ_{i}, x)
= r(x) δ Σ_{i} g(θ_{i}x)
where the sum is over a set Θ* of points equally
spaced in Θ at intervals of δ.
For sufficiently regular g( ),
δ Σ_{i} g(θ_{i}x)
approximates the integral
∫ dθ g(θx) = 1
with an error which depends on the function g and the size of δ.
Further,
r*(x) =
δ Σ_{i} h(θ_{i}) f(x θ_{i})
 For sufficiently regular distributions,
 Σ_{X} r*(x)
= δ Σ_{i} h(θ_{i}) Σ_{X} f(x θ_{i})
= δ Σ_{i} h(θ_{i})
= 1
 so r*(x) is a proper distribution. Hence,
  Σ_{X} r(x) log r*(x)
≥  Σ_{X} r(x) log r(x)
 and
 I_{1}  I_{0}
= E(L_{J+1})
 J Σ_{X} r(x) log (r*(x) / r(x))
 = E(L_{J+1})
 J Σ_{X} r(x) log Σ_{i} δ g(θ_{i}x)
The amount by which the sum
Σ_{i} δ g(θx)
departs from one may, for regular g( ) be
expected to be proportional to some small power of δ.
Then we can write
 I_{1}  I_{0}
= E(L_{J+1}) + J A δ^{n}
to order δ^{n}
where A is a constant depending on r( ) and g( ) but not on J.
Now consider
E(L_{J+1}) = E(Σ_{j} t_{j}).
In the first stage of the message construction, forming M_{J},
as many random digits will be required in step (c)
as there are digits in the word w_{kJ}.
In effect, w_{kJ}
is selected from C_{J} by finding a match
between a word of C_{J} and the prefix of a long random
binary string.
The probability that a word of length l will match is just 2^{l}.
The length of w_{iJ} is
 log u_{iJ}.
Hence the expected value of t_{J} is
the expected value of
 Σ_{i} u_{iJ} log u_{iJ}.
Now u_{iJ} is
δ h(θ_{i}) f(x_{J}θ_{i})
/ r*(x_{J}),
or approximately
δ g(θ_{i}x_{J}).
Thus the expected value of t_{J} is approximately
 log δ + B,
where B is a constant approximating the expected entropy of the
posterior distribution g( ).
It is unlikely that t_{j} ≠ 0 (j<J).
For t_{J1} ≠ 0,
we require that M_{J} of length
L_{J} =
 log(δ p(θ_{J}, x_{J}))
be so short that we cannot find in it a prefix matching a word of
C_{J1}.
The length of a word w_{i,J1} of C_{J1} is approximately
 log(δ g(θ_{i}x_{J1}))
so we will fail to find a prefix when
g(θ_{i}x_{J1}) <
p(θ_{J},x_{J}).
Since g(θx) = p(θ,x) / r(x),
we expect the ratio
g(θ_{i}x_{J1}) /
p(θ_{J}, x_{J})
to be of order 1/r(x), which is typically much greater than 1.
We therefore approximate
E(Σ_{j} t_{j}) by E(t_{J}).
 Thus, to order δ^{n},
 I_{1}  I_{0}
= B  log δ + JA δ^{n}
 Choosing δ to minimise
I_{1} gives
δ^{n} = 1/(nJA),
 I_{1}  I_{0}
= B + (1/n) + (1/n) log (nJA).
Thus we find that this construction, like the more exact but
less general construction of section 6, gives a value for
I_{1}  I_{0}
of order log J
for sufficiently regular h( ) and f( ).
As this value for I_{1}  I_{0}
is less than is achieved by the use of a singleproblem SMML estimator on
each basic problem of the sequence,
which gives I_{1}  I_{0}
of order J, the construction must be close to
achieving the shortest possible expected length for a
message using estimates, and so its estimate sequence must
be similar to that given by the SMML sequence estimator.
It is worth remarking that this construction, like that of section 6,
does not require that every problem of the sequence have the same form.
We now show that the construction imitates a false oracle.
M_{j+1} is a nearoptimal binary coding of the data
sequence {x_{j+1}, ..., x_{J}}.
When random variates from some distribution are encoded as a binary
string using an optimal code for that distribution, the
resulting binary string is a realization of a Bernoulli
sequence with p=0.5.
Thus M_{j+1}, (extended if
necessary by appended random digits) is such a realization.
The probability that a given binary string of length l is
a prefix of M_{j+1} is 2^{l}.
So the probability that step (c) will select
θ_{j} = θ_{i} is
2^{length(wij)} = u_{ij}.
But
u_{ij} =
δ p(θ_{i}, x_{j}) / r*(x_{j})
and r*(x_{j}) approximates r(x_{j}).
Hence u_{ij} approximates
δ g(θ_{i}x_{j}).
If we consider the data sequence
{x_{j+1}, ..., x_{J}}
to be a realization of their joint marginal
distribution, the probability that this realization will,
for some x_{j}, cause the selection of the value
θ_{i} as the estimate θ_{j} of
θ_{j},
is just the posterior probability of θ_{i}
given x_{j}. In this sense, θ_{j} is a
random selection from the (quantized) posterior distribution
of θ_{j} given x_{j}, as is made by the false
oracle T_{p}.
8. Discussion
We have defined a class of ‘empirical’ criteria for the
comparison of estimators based on their performance on data
or sequences of data. For sequences, there are empirical
criteria which estimate the bias and variance of estimators,
and any other conventional measure of performance
expressible as the expectation of some function of the true
parameter and the estimated parameter. We have shown the
existence of estimators called false oracles which no
empirical criterion can distinguish from the oracle
(whose ‘estimates’ are the true values). A false oracle can be
constructed by randomly selecting an estimate from its
posterior distribution.
The SMML estimator for a sequence has, in two cases, been
shown to imitate a false oracle. In the first case,
applicable whenever the parameters have slowlyvarying
priors and approximately quadratic log likelihoods, the SMML
estimator is based on a quantizing lattice. The estimate
θ_{j} of the jth parameter is a
(discontinuous) function of all the data
{x_{1}, x_{2}, ..., x_{j},
..., x_{J}}
but, if one regards the data values other than x_{j}
as being randomly drawn from their respective marginal distributions,
θ_{j} is effectively sampled from a
distribution approximating its posterior given x_{j}.
In the second, more general case, we have shown that a
nearoptimal coding of a data sequence, (and so a nearSMML estimator)
is obtained when each estimate θ_{j}
is sampled from its posterior distribution given x_{j}
by using the coded data x_{j+1}...x_{J} as a
pseudorandom source of binary digits.
In both cases, a pseudorandom selection of θ_{j}
is made using (some of) the data other than x_{j} as
a source of “random” information. Thus the SMML estimator
is not truely a false oracle. A criterion based on
knowledge of the “pseudorandom number generator” inherent
in both coding algorithms could empirically distinguish the
SMML estimator from an oracle. However, since in both cases
the functional dependence of θ_{j} on data other
than x_{j} is quite obscure, for all practical purposes
one may expect the statistical behavior of the SMML
estimator for a long sequence to be indistinguishable from
that of a false oracle, and hence from that of an oracle.
9. Applications
The analysis of the behaviour of minimummessagelength
estimators on long sequences of identical basic problems has
practical application. There are practical estimation
problems where the model includes not only a fixed set α
of general parameters but also a set of ‘nuisance’
parameters {θ_{n}, n=1...N} as numerous as the
sample size N. For instance, in a
factor analysis model
of a multivariate Gaussian population, once the means,
specific variances and factor loadings are known or
estimated, the estimation of the factor scores becomes a
sequence of N identical independent estimation problems.
The problem fits the assumptions of section 6, so the
analysis allows us to determine the length of the coded
message. In particular, the analysis shows that the message
length L contains a term (1/2) log(π N)  γ
rather than the larger term (N/2) log(2 π e / 12)
which is given by an analysis treating the estimation of each factor score
separately. The consequent reduction in message length
affects the comparison of the factor model against
alternative models of the same data, since the
minimummessagelength approach prefers the model with the
smaller (minimized) value of L (Wallace & Freeman 1991).
A second example is a
mixture model,
in which each datum x_{n}
is thought to be sampled from an unknown member of a set of
different distributions
{f_{k}(xα_{k}), k=1...K}.
Here, once the distribution parameters
α_{k} and relative abundances h_{k} are known or
estimated, the estimation of the class (i.e. distribution)
k = θ_{n} to which each datum belongs is a sequence
of identical basic problems each requiring the estimation of
a categorical parameter θ_{n}.
If each basic problem is considered separately, the shortest
encoding is obtained by estimating θ_{n} to be
that value of k for which
h_{k} f_{k}(x_{n}α_{k})
is a maximum.
That is, datum x_{n} would be assigned to that component
of the mixture most likely to contain it. The resulting
message length L would then have the form
 L =
Q{α_{k}, h_{k}, k=1...K}
 Σ_{n} log [ max_{k} (h_{k} f_{k}(x_{n}α_{k})) ]
Estimation of the distribution parameters and abundances by
minimization of the above form has been found to give
inconsistent estimates. However, the analysis of section 7
shows that a shorter coding is possible, and has a message
length for large N of the form
 L =
Q{α_{k}, h_{k}}
 Σ_{n} log Σ_{k} h_{k}f_{k}(x_{n}α_{k})
+ 0(1)
In the mixture problems so far studied, minimization of the
latter form yields consistent estimates for {α_{k}}
and {h_{k}}. The smaller value of the latter
form also affects the comparison between the lengths for
models with different numbers of component distributions,
and so affects the estimation of the number of components.
It is perhaps worth mentioning that the coding device
described in section 7 has been used successfully in a
computer program for file compression.
Acknowledgements: This work was done with the assistance of
a study grant from Monash University. The work owes much to
the informed comment and assistance of P.R.Freeman.
References
 Barron, A.R. & Cover, T.M.,
Minimum Complexity Density Estimation,
IEEE Trans. Inf. Thy., 37(4), pp.10341054, (1991).
 Rissanen, J.,
Stochastic Complexity,
J. Royal Stat. Soc. B, 49, 3, pp.228239, (1987).
 Wallace, C.S. & Freeman, P.R.,
Estimation & Inference by Compact Coding,
J. Royal Stat. Soc. B, 49, 3, pp.240252, (1987).
 Wallace C.S. & Freeman P.R.,
Single Factor Analysis by MML Estimation,
J. Royal Stat. Soc. B, 54, 1, pp.195209, (1992).
 Zador, P.,
Asymptotic Quantization Error of Continuous Signals and
the Quantization Dimension,
IEEE Trans. Inf. Thy., IT28, pp.139149, (1982).
© C. S. Wallace.
Also see
[MML] &
[SMML]
 L. Allison, 2008

