|
Proc. Int. Conf. on Information, Statistics and Induction in Science
(ISIS), pp.304-316, 1996
(based on TR 128, Dept. Computer Science, Monash U., 1989)
1. Introduction
Recently there has been interest in a general method of
statistical estimation and inference variously known as
Minimum Message Length
(MML)
(Wallace & Freeman, 1987) or
Minimum Description Length (MDL) (Rissanen 1987). The
method is an application of a general approach to inductive
inference first proposed by Solomonoff (1964), and closely
related to the notion of Algorithmic Complexity due to
Kolmogorov (1965). The method is Bayesian,
and is based on the idea of choosing an
hypothesis and estimates of its parameters which together
allow a concise encoding of the data. The method is of
great generality, and in those specific estimation problems
which have so far been addressed using MML, gives estimates
similar, but often superior in some respect such as bias, to
those obtained by the Maximum Likelihood (ML) method. In
particular, MML handles more gracefully problems where the
number of unknown parameters increases with sample size, and
can automatically select an appropriate hypothesis from a
family of hypotheses of increasing complexity.
This page was
created from what I believe to be the 1996 source file,
hopefully without adding too many errors, to make this
MML
paper more readily accessible
-- LA 2008
NB. I have used
θ
for
θhat
and
φ
for
φhat
|
The consistency of MDL for a wide range of problems has been
proved by Barron & Cover (1991), who also derive the ultimate
rate of convergence of the estimates with increasing sample
size for a range of problem types. Unfortunately, little is
known in general about the behaviour of MML estimates for
modest sample sizes. For instance, there have been no
general results on the bias or variance of estimates, or on
the error rates of the inbuilt hypothesis selection.
Wallace and Freeman (op cit) and Rissanen (op cit) have
argued that the estimates are close to sufficient, in the
sense of capturing all the relevant information available in
the data, but the arguments are outside conventional
statistical theory.
In this paper, we attempt to show that MML can be expected
to have good behaviour for modest samples. Consistency
arguments prove good behaviour in the limit of large
samples. We have not entirely avoided the need to appeal to
a limit, but rather than considering the behaviour as an
unbounded volume of data is used in a single estimation, we
consider the behaviour as the method is applied to an
unbounded set of (possibly unrelated) estimation problems,
each using only limited data. That is, we consider the
expected consequences of adopting the strategy of using MML
for all desired inferences. Sections 2 and 3 discuss a
proposed criterion of estimate quality for a single
estimation problem. Section 4 extends the criterion to
cover the behaviour of a method in a sequence of problems,
and relates it to conventional criteria such as bias. These
sections also define an “optimal” method according to this
criterion. Sections 6, 7 and 8 then show that the MML
method approaches this optimum.
Of the several criteria which are commonly used for judging
the quality of statistical estimators, none is entirely
satisfactory, at least by itself. Some, such as
consistency, are applicable only to families of estimators
for an unbounded range of sample sizes. Others, such as
bias and expected variance, are tied to particular
parameterizations of the problem, and may give different
judgements if the parameters to be estimated are subjected
to functional transformation. Others again, such as
sufficiency, can be satisfied by estimators with severe
deficiencies. In this paper, we propose yet another
criterion and investigate some of the properties of
estimators satisfying it. We give a constructive proof that
such estimators generally exist, and also show that Strict
Minimum Message Length (SMML) estimators (Wallace & Freeman
1987) in some circumstances closely approach satisfaction of
the criterion.
The framework of the discussion is Bayesian.
We wish to estimate the value of an unknown parameter θ
within a space Θ, given a prior density h(θ) on
Θ, an observed datum x within a space X, and a
probability density f(x|θ) (x∈X, θ∈Θ).
At times we use the joint density
p(θ,x) = h(θ)f(x|θ),
the marginal density of x,
r(x) = ∫ dθ p(θ, x), and
the posterior density g(θ|x) = p(θ, x) / r(x).
In some parts of the discussion we prefer
to treat x as a discrete datum within a countable set X.
2. Empirical Criteria
Definitions
- An estimator is a function from X into Θ,
which we shall write as θ = T(x).
- A stochastic estimator is an estimator whose value
θ = Ts(x) for
any datum x is a realization of a random variable with
probability density Gs(θ|x).
- A deterministic estimator is not stochastic.
- An honest estimator is an estimator whose value for
given x is independent of the true parameter value θ.
- The oracle is a distinguished estimator Tt such
that Tt(x) = θ,
the true parameter value. It is
thus a ‘perfect’ estimator, but is of course neither
realizable nor honest. Apart from the oracle, all other
estimators considered will be honest.
- A trial is an instance of an estimation problem described by
Θ, X, h(θ), f(x|θ) yielding a
datum x which is then supplied to two estimators
T0 and T1 to give two estimates
θ = T0(x),
φ = T1(x).
In a trial, the estimators are treated as black boxes.
That is, the internal mechanisms of the
estimators are invisible, and any comparison of the
estimators can be based only on the datum x and the two
estimates θ and φ.
The comparison may involve use
of the prior information Θ, X, h(θ) and f(x|θ).
Formalizing the notion of a comparison:
- A criterion c is a function from
(X, Θ, Θ) to the real line.
The outcome of a trial using c is
written c(x, θ, φ).
For some criteria, negative
values of c represent a preference for T0 and
positive values a preference for T1. That is, for the
observed value of x, c(x, θ, φ) < 0
indicates that the criterion c judges θ to be a
better estimate than φ, for instance because
θ has the higher likelihood. All criteria
are assumed to be independent of the true parameter θ.
- A fair criterion is a criterion such that
for all a, b ∈ Θ and x ∈ X,
c(x,a,b) = - c(x,b,a).
This definition is intended to exclude criteria which have
an inbuilt preference for one of the two estimators. As the
internal workings of the estimators are invisible, no such
preference would be rational.
The above definitions are intended to provide a framework
for the empirical comparison of estimators, i.e. comparison
based on the observable behaviour of estimators rather than
on their structure. Of course, the outcome of a single
trial using some criterion is a function of the value of x
realized in the trial, and so the preference for one
estimator over the other will be subject to the accidents of
the particular trial. Later, we will discuss a class of
estimation problems for which a single trial might be
sufficient to determine a preference reliably. However,
we now retreat somewhat from the ideal of a purely empirical
comparison, and define the expected outcome of a trial.
The expected value of a trial of estimators T0
and T1 using criterion c is the expected value of
c(x, T0(x), T1(x)),
written c(T0, T1),
where the expectation is taken over the joint
distribution of x and θ. Where one or both of the
estimators are stochastic, the expectation is also taken
over their value distributions.
- For T0, T1 honest and deterministic, the
estimators are independent of θ and the expectation is
simply over the marginal distribution of x.
- c(T0, T1)
= ∫ dx
r(x) c(x, T0(x), T1(x))
- For T0, T1 honest, T1 stochastic,
T0 deterministic,
- c(T0, T1)
= ∫ dx
∫ dφ
r(x) G1(φ|x)
c(x, T0(x), φ)
- and for T0, T1 honest and stochastic
- c(T0, T1)
= ∫ dx
∫ dθ
∫ dφ
r(x) G0(θ|x)
G1(φ|x)
c(x, θ, φ)
- For the special case that one estimator, T0 say,
is the oracle Tt and T1 is deterministic,
- c(Tt, T1)
= ∫ dθ
∫ dx
h(θ) f(x|θ) c(x,θ, T1(x))
- and for T0 the oracle, T1 stochastic
- c(Tt, T1)
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(x|θ) G1(φ|x)
c(x,θ, φ)
- Note that in all cases, for c fair,
c(T0, T1)
= - c(T1, T0).
3. False Oracles
A false oracle is an estimator Tf such that for
all fair criteria c, and for Tt the oracle,
c(Tf, Tt) = 0.
That is, no fair criterion will be expected to prefer the
oracle to a false oracle, or vice versa.
3.1 Theorem:
- For Tf a false oracle, Tt the oracle,
T1 honest and c a criterion not necessarily fair
c(Tf, T1) = c(Tt, T1).
-
- Proof: Consider the function
- c*(x,a,b) = c(x,a,T1(x)) - c(x,b,T1(x)),
a,b ∈ Θ
- where, if T1 is stochastic, we have the same
realization of T1(x) in both places.
Since T1 is honest, c* is not a function of θ.
Also, c*(x,a,b) = -c*(x,b,a).
Hence c* is a fair criterion, so by definition of Tf,
- c*(Tt, Tf) = 0
- But
- c*(Tt, Tf)
= E{c(x, Tt(x), T1(x))
- c(x, Tf(x), T1(x))}
- = E{c(x, Tt(x), T1(x))}
- E{c(x, Tf(x), T1(x))}
- = c(Tt, T1)
- c(Tf, T1)
--Q.E.D.
This theorem shows that, while we cannot in practice compare
an estimator with the oracle, since the oracle is
unavailable, we can instead compare the estimator with a
false oracle and get the same result in expectation.
- 3.2 Corollary:
With the same conditions as for the theorem,
- c(T1, Tf)
= c(T1, Tt)
- Also,
- c(Tf, Tt)
= c(Tf, Tf)
= c(Tt, Tf)
3.3 Corollary:
With the same conditions as for the theorem, the value on a trial of
c(x, Tf(x), T1(x)) is as good an estimate of
c(Tt, T1) as is the value on a trial of
c(x, Tt(x), T1(x))
= c(x, θ, T1(x)).
- Proof:
Define
ck(x, θ, φ)
= (c(x, θ, φ))k
- Then ck is a criterion, so
ck(Tf, T1)
= ck(Tt, T1)
- But ck(T0, T1)
is the kth moment of the sampling distribution of
c(x, T0(x), T1(x)).
- Hence c(x, Tf, T1) and
c(x, Tt, T1)
have the same sampling distribution.
3.4 Theorem:
- False oracles exist in sufficiently regular cases.
-
- The proof is constructive:
- Let Tp be the stochastic estimator described by the density
- Gp(θ|x)
= g(θ|x) = h(θ) f(x|θ) / r(x)
= p(θ,x) / r(x).
- For any criterion c,
- c(Tp, Tt)
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(x|θ) g(φ|x) c(x,φ,θ)
- = ∫ da
∫ dx
∫ db p(a,x) p(b,x) c(x,b,a) / r(x),
(a = θ ∈ Θ, b = φ ∈ Θ)
-
- c(Tt, Tp)
= ∫ dθ
∫ dx
∫ dφ
h(θ) f(x|θ) g(φ|x) c(x,θ,φ)
- = ∫ db
∫ dx
∫ da
p(b,x) p(a,x) c(x,b,a) / r(x),
(b = θ, a = φ)
-
- The expressions for c(Tp, Tt) and
c(Tt, Tp) differ only in the order of integration.
Hence, for sufficiently regular h, f and c, they are equal,
- c(Tt, Tp) = c(Tp, Tt)
- But, if c is fair,
c(Tt, Tp) = - c(Tp, Tt)
- Hence c(Tp, Tt) = 0 for all fair c,
so Tp is a false oracle.
3.5 The Merit of False Oracles
False oracles deserve to be considered good estimators in
that no fair criterion is expected to show a preference for
the true parameter value over the estimate of a false
oracle. Indeed, the above results show that no test based
only on the information available to the estimator can
reliably distinguish an oracle from a false oracle. With
the more conventional criteria applied to estimators (some
of which will be further discussed below) it is often the
case that an estimator satisfying one criterion (e.g. lack
of bias) will not satisfy others (e.g. maximum likelihood).
However, for the empirical criteria discussed here, we have
shown that there exists a class of estimators which give
results as good as the true values by all empirical tests, since
c(Tf, Tt) = c(Tt, Tf)
for all c, fair and otherwise.
Note that it is not true that all criteria, or even all fair
criteria, will prefer a false oracle to an honest estimator
which is not a false oracle. Suppose x and θ are
commensurable, and that x is distributed as N(θ,1).
The fair criterion
- c(x, θ, φ)
= (x - θ)2
- (x - φ)2
will always prefer the estimator θ = x
to a false oracle.
It will also, of course, always prefer θ = x to the
oracle θ = θ,
so the import of its preference is dubious.
The construction of the false oracle Tp amounts to a
rule for characterizing the posterior distribution of
θ by a value (estimate) drawn randomly from the posterior.
It may seem a strange rule compared with, say, choosing the
mean, median or mode of the posterior, but it has the
advantage of being invariant under arbitrary
measure-preserving transformations of the parameter space.
4. Sequence Trials
We now attempt to define a class of estimation problems, and
associated trials, which allow a comparison between
our ‘empirical’ criteria and some more conventional ones.
A sequence problem is an ordered sequence of
independent instances of the same basic problem. For a
basic problem described by
Θ, x, h(θ), f(x|θ),
a sequence problem of length J has parameter space
ΘJ and data space XJ.
The data are an ordered sequence of values
{xj, j=1 ... J} from which
an estimator must form an estimate which is an ordered
sequence {θj, j=1 ... J},
where θj
is an estimate of the true value θj of the parameter
in the jth instance of the basic problem.
The prior density for the true sequence parameter
{θj, j=1 ... J} is
∏j h(θj)
and the conditional data density is
∏j f(xj|θj).
Note that we assume the instances are wholly
independent: data from early instances do not lead to
revision of the prior or conditional data density in later
instances.
An estimator T for a sequence is a function from XJ
into ΘJ.
We do not require that the estimator
have the form of J instances of the same function applied
in turn to {xj}. The ‘black box’ model of an
estimator is not required to produce any output until it has
the entire data sequence, so θj
may depend on all components of
{xk, k=1 ... J}, not just xj.
In a sequence trial of estimators
T0, T1 using criterion c,
c is a function
c({xj, θj, φj, j=1...J})
which operates on the data sequence and the two estimate sequences to yield a
single value, which need not be expressible as a combination
of J applications of a simpler function.
Since a sequence problem and sequence trial are just special
cases of the problems and trials defined earlier, all
definitions and results carry over, with only the notational
difference that now the symbols Θ, X, h( ) and f( )
refer to the component basic problems of the sequence rather
than to the sequence problem as a whole. Note that, since
the basic problems are independent, the posterior density of
{θj, j=1...J} given
{xj, j=1...J} is
∏j g(θj|xj).
Thus the false
oracle Tp of theorem 3.4,
which picks an estimate sequence randomly
from this posterior density, may be considered to choose
randomly and independently a value from each basic posterior
g(θj|xj).
4.1 Sequence Criteria:
Using sequence trials, we can define empirical criteria
corresponding to some conventional criteria. For D a
function from Θ to the real line, v∈Θ, δ
a small constant and J large, define
- Dv{xj, θj, φj, j=1...J}
=
Σj∈S D(θj) / NS
where S is the set
{j; |φj - v| < δ} and
NS is the cardinality of S.
Dv is the
average value of D(θj) over those
components of the sequence trial in which φj
approximately equalled v.
Its expected value,
Dv(T0, T1)
approximates the expected value of D(θ)
given φ = v.
Letting Dv be the function
Bv(θ) = θ - v,
call
Bv(T0, T1)
the bias of T0 relative to T1.
It is, of course, a function of v.
For T1 the oracle,
Bv(T0, Tt)
is simply the expected value of
θ - θ,
given θ = v.
Hence Bv(T0, Tt)
corresponds to the usual definition of the bias of T0.
While Tt is unavailable,
we have
Bv(T0, Tf)
= Bv(T0, Tt)
for Tf a false oracle.
Hence the bias of T0 can be
estimated by measuring the bias of T0 relative to Tf
in a sequence trial.
Letting T0 be the false oracle Tp,
- Bv(Tp, Tt)
= Bv(Tp, Tp)
= ∫ dx
∫ dθ
f(x|v) g(θ|x) (θ - v)
This is not, in general, zero. However, note that
Bv(Tt, Tp)
= Bv(Tp, Tt).
The ‘bias’ of the oracle relative to a false oracle Tf
equals the bias of Tf, so empirical measurement of
relative bias does not serve to distinguish false and true oracles.
Also, note that
Bv(Tt, Tt)
≠ Bv(Tt, Tf),
since Tt is not honest.
Of course,
Bv(Tt, Tt) = 0
for all v.
Similarly, letting Dv be the function
Vv(θ) = (θ - v)2,
the usually-defined variance of T0
is Vv(T0, Tt),
and can be empirically measured by the value of Vv in a
sequence trial of T0 and a false oracle Tf.
For the false oracle Tp,
- Vv(Tp, Tt)
= Vv(Tp, Tp)
= ∫ dx
∫ dθ
f(x|v) g(θ |x) (θ - v)2
This expression
shows that the variance of a false oracle is
the variance of the convolution over x of f and g.
For h(θ) slowly varying, it is about twice the
variance of a conventional Bayes estimator with quadratic loss function.
5. SMML Estimators
Strict Minimum Message length (SMML)
estimation has been described elsewhere (Wallace & Freeman 1987).
We use the same Bayesian framework defined by
Θ, X, h(θ), f(x|θ) etc. as before,
but initially it is convenient to consider x to be a discrete variable,
so X is a countable set, and
f(x|θ) and r(x) are probabilities rather than densities.
The SMML estimator θ = m(x) has a range Θ*
= {θi, i=1, 2, ...}
= {θ : θ = m(x), x∈X}
which is a countable subset of Θ.
Define Xi = {x : m(x) = θi}
and qi = q(θi)
= ΣXi r(x) for all i.
Then m( ) minimizes
- I1 = - ΣX r(x) log[ q(m(x)) f(x|m(x)) ]
- = - Σi [ qi log qi
+
ΣXi r(x) log f(x | θi) ]
The estimator is derived by considering the expected length
of a message encoding x and decodeable given knowledge of
X, Θ, f(x|θ) and h(θ).
By standard coding theory, the expected length cannot be less than
I0 = - ΣX r(x) log r(x),
where the base of logarithms depends on the coding alphabet,
e.g., 2 for binary coding.
However, we consider a message having the form of a
statement of an estimate θ, followed by an
encoding of x using a code which would be optimal
(i.e. have least expected length)
were θ = θ.
Using optimal coding for the statement of θ, the
length of the message encoding x is
L(x) = - log q(m(x)) - log f(x|m(x)).
(Actually, the length of a
real message might exceed L(x) by as much as one symbol,
since real messages must have integral lengths. However, we
neglect this rounding-up effect.) The first term of L(x)
gives the length of the encoding of the estimate
θ = m(x).
Note that q(θi) is the
unconditional probability that θi will be the
estimate. The second term is the length of the encoding of x
using a code optimal if θ = m(x).
I1 is the expected value of L(x).
Since I0 is independent of the estimator, the SMML
estimator minimizes
- I1 - I0
= - ΣX r(x) log [q(m(x)) f(x|m(x)) / r(x)]
If the discretization of the data is so fine that we can
consider X as a continuum, we can revert to the framework
of the earlier sections, and the SMML estimator minimizes
- I1 - I0
= ∫X
r(x) log[q(m(x)) f(x|m(x)) / r(x)]dx,
- where
- q(θi)
= ∫Xi r(x)dx.
Θ* remains a countable subset of Θ,
but I1 and I0 are not individually definable.
Minimization of I1 - I0
requires a compromise in the size of the region Xi
within which all datum values lead to the same estimate θi.
Large regions have large values of qi, leading to a short
statement of θi, but if the region is too large,
it will contain values of x for which
f(x|θi) is much less than
f(x|θm) where
θm is the maximum-likelihood estimate for x,
so the expected length of the statement of x will
be greater than for smaller regions.
We now show that in a sequence of basic estimation problems
such as was described in section 4, the SMML estimator
approximates a false oracle. Two cases are considered.
In the first, which applies to basic problems with
approximately quadratic log-likelihoods, a fairly exact
analysis is possible. The second presents a less exact
treatment of a more general sequence problem. Throughout,
j indexes the component problems of a sequence of J
basic problems, and i indexes the members of Θ*,
the range of an SMML estimator.
6. A Simple Case
Assume that for all j,
f(xj|θj) =
(1 / (√(2 π) σ))
e-(xj-θj)2 / 2σ2,
σ known, and
h(θj) = a,
a constant over a large range.
If SMML estimation were applied separately to each problem
of the sequence, we would obtain for each j an estimator
θj = m(xj)
with the same form for every j.
We drop the j suffix and consider one estimator m(x).
Its range is simply a set of
equally-spaced values of θ, with spacing s say.
The SMML estimator maps x into the member of the set most
nearly equal to x.
Since, over the range of interest,
r(x) = h(θ) = a,
we have that for every
θi ∈ Θ*,
Xi is the interval
{x: θi - s/2 <
x < θi + s/2}
and qi = as.
The estimator minimizes the expected value of
- log q(m(x)) - log f(x|m(x)) + log r(x).
Since all of the regions Xi have the same pattern,
the expectation can be taken over any one of them, giving
- I1 - I0
- = - log(as)
+ 1/2 log(2 π)
+ log σ
+ 1/s {
θi-s/2∫θi+s/2 dx
(x - θi)2
/ (2 σ2) }
+ log a
- = - log s
+ (1/2) log(2 π)
+ log σ
+ s2 / (24 σ2)
- Minimization with respect to s gives
- s2 = 12 σ2,
- I1 - I0 = (1/2) log (2 π e / 12).
- The result of applying this estimator to all J problems of
the sequence is therefore an expected message length
increment for the whole sequence given by
- I1 - I0 = (1/2) J log(2 π e / 12).
This result is not the best that can be achieved.
- An SMML estimator for the whole sequence problem minimizes
the expected value of
- - log q(m(x))
- Σj log f(xj|mj(x))
+ log r(x)
- where now the estimator m(x)
maps the sequence or vector of data
x = (x1, x2 ... xj ...)
into a vector
- θ
= (θ1, θ2 ... θj ...)
= (m1(x), m2(x), ...,
mj(x), ...),
- and r(x) = aJ.
- Thus the SMML estimator minimizes the expected value of
- L(x) =
- log q(m(x))
+ (J/2) log (2 π)
+ J log σ
+ (m(x)
- x)2 / (2 σ2)
+ J log a
Its range Θ* is a countable set of vectors or points
in J-space, {θi}.
Assume these points form a lattice.
Then
q(θi) = qi
is the same for all i,
and for all i the region Xi within which
m(x) = θi
is the set of x values
closer to θi than to any other
member of Θ*, i.e., the Voronoi region of θi.
For Θ* a lattice, the Voronoi regions of all
its members have the same size and shape. Let s be the
volume of Xi for all i.
Then qi = saJ for all i.
The expectation of L(x) can be taken over
any Voronoi region, giving
- I1 - I0 =
- log (saJ)
+ (J/2) log (2 π)
+ J log σ
+ E(x - θi)2 / (2 σ2)
+ J log a.
The expectation
E(x - θi)2
over the Voronoi region of θi depends on the volume
s and the shape of the region.
To separate these effects, write
E(x - θi)2
= J kJ s2/J,
where kJ depends only on the shape of Xi,
i.e. the geometry of the lattice.
Then
- I1 - I0
= - log s + (J/2) log (2 π)
+ J log σ + JkJ s2/J / (2 σ2)
- Minimization with respect to s gives
- s2/J
= σ2 / kJ,
- I1 - I0
= (J/2) log (2 π e kJ)
Minimization with respect to the geometry of the lattice
requires choosing a lattice which, for a given density of
points, minimizes the mean squared distance of all points
from the nearest lattice point.
Lattices chosen in this way are called ‘quantizing lattices’.
Use of the single-problem SMML estimator separately for each
problem is equivalent to choice of a rectangular lattice, for which
kJ = 1/12 for all J.
However, for J > 1,
lattices exist with kJ < 1/12.
For instance, for J = 2, a lattice with hexagonal rather
than square Voronoi regions has
kJ = 0.080175.. < 1/12
and hence leads to a smaller value for
I1 - I0.
Note that in the resulting SMML estimator, θ1,
is a function of both x1 and x2.
If x1 is held constant and x2 is changed,
θ1 may assume in turn two or more different values,
all however close to x1.
Similarly, θ2
is a function of both x1 and x2,
but is always close to x2.
- For large J, the optimum quantizing lattice is not known.
However, Zador (1982) has shown that
- Γ(J/2+1)2/J Γ(2/J+1) / (J π)
> kJ
> Γ(J/2+1)2/J / ((J+2) π)
The upper bound is the value for a random set of points
(not a lattice) and the lower bound would be achieved only if the
Voronoi region were a J-sphere.
- Using
Stirling’s approximation
in the form
- log Γ(y+1)
= (y + 1/2) log y - y
+ 1/2 log(2 π)
+ O(1/y),
-
- log kJ
> (2/J) [(J/2 + 1/2) log(J/2) - J/2 + 1/2 log(2 π)]
- log(J+2)
- log π
+ O(1/J2)
- > - log(2 π e)
+ (log J + log π - 2) / J
+ O(1 / J2)
- Similarly, and using
log Γ(1+y) = - γ y + O(y2),
where γ is Euler’s constant,
- log kJ <
- log(2 π e) + (log J + log π - 2 γ) / J
+ O(1/J2)
Hence
(1/2) log(J π) - γ
> I1 - I0
> (1/2) log(J π) - 1,
to order 1/J.
We see that, whereas use of a rectangular lattice, i.e.
repeated use of the one-problem SMML estimator, gives
I1 - I0 = (1/2) J log (2 π e / 12)
proportional to J,
the optimum SMML estimator has at worst
I1 - I0
= (1/2) log (J π) - γ.
Although the shape of the optimum Voronoi region for large
J is unknown, the close approximation of the optimum
lattice to the behaviour of spherical regions strongly
suggests that the optimum Voronoi region is very nearly
spherical for large J.
- Making this assumption, the equation
s2/J = σ2/kJ
for its volume s gives for its radius ρ
- ρJπJ/2 / Γ(J/2 + 1)
= s
= σJ / kJJ/2
Using the lower bound as an approximation for kJ,
ρ = σ √(J+2)
If we consider one component of θ and x, i.e.
one estimation problem of the sequence,
the estimate θj is a function of
all components of x, not just
of xj. If the other components are thought of as
random variables, all that is known of x is that it lies
in the subspace defined by xj, and all that is known
of θ is that it lies somewhere in a sphere of
radius ρ centred on x. Thus, θj can be
regarded as a random variable generated by the projection of
a random point in this sphere onto the j axis. The
density of θj is therefore proportional to
- {ρ2
- (θj-xj)2}(J-1)/2
= {(J+2)σ2
- (θj-xj)2}(J-1)/2
For any J, this distribution has mean xj and
standard deviation σ. For large J, it closely approximates
N(xj, σ2),
which is the posterior density of θj given xj.
The SMML estimator thus produces estimates for the
individual problems of the sequence which have approximately
the behaviour of values chosen at random from the individual
posterior densities. That is, it approximates the false oracle Tp.
The results for this simple case are readily generalized to
any sequence of estimators in which the log likelihood for
the sequence of data has an approximately quadratic
behaviour near its maximum, and where the prior for the
parameter sequence is slowly-varying. It is not necessary
that all basic estimation problems in the sequence be
instances of the same problem, or even that they require
estimation of the same number of parameters. The results
depend only on the assumption of a slowly-varying priors,
and approximately quadratic log-likelihood.
7. A More General Case
We now describe a construction for a near-optimal encoding
of a data sequence which does not require smooth or
quadratic behaviour in the log prior and log likelihood
functions. As before, we consider a sequence of J
independent instances of the same basic problem described by
Θ, X, h(θ) and f(x|θ).
We assume x is a discrete variable.
In this construction, each estimate θj will
be selected from a set Θ* of equally-spaced values
{θi, i=1, 2 ...} whose spacing δ is
small compared with the expected estimation error of each θj.
Index i is used for indexing members of Θ*, and
index j for indexing problems of the sequence.
We assume
Σi δ h(θi) = 1.
The message is constructed as J segments.
Basically, each segment comprises two parts.
The first states an estimate θj of θj,
selecting θj from Θ*,
and has length
- log(δ h(θj)).
The second encodes xj using a code optimal if
θj = θj, and has length
- log f(xj| θj).
The total length lj of segment j is
- lj =
- log(δ h(θj) f(xj|θj))
= - log(δ p(θj, xj))
However, the segments are combined into a complete message
in such a way that the length of the message is less than
Σj lj.
As δ is small compared with the expected estimation
error of θj, there will be several members of Θ*,
separated by multiples of δ, any of which
would give a near-minimal value of lj. The basis of
the construction is use of the freedom to choose any one of
several more-or-less equally good values for θj
so as to convey information about the following segment, j+1.
For definiteness, assume binary coding, so logarithms are to
base 2.
The construction proceeds backwards, in decreasing order of j.
Let MJ+1 be the empty string,
of length LJ+1 = 0.
Suppose that a string Mj+1 of length Lj+1
has been constructed encoding estimates and data
for problems j+1, ..., J.
To form a string Mj covering
in addition problem j, proceed as follows:
- (a) Compute a distribution over Θ* given by
uij
= δ h(θi)
f(xj| θi)
/ r*(xj),
where
r*(x)
= δ Σi
h(θi) f(x| θi).
- (b) Construct an optimum prefix code Cj (e.g. a Huffman
code) for this distribution. Cj is a set of binary
strings, or words, {wij, i=1, 2,...}
in one-to-one correspondence with the members of Θ*, such that no
word is a prefix of another, any sufficiently long binary
string has a word as prefix, and the length of wij is
approximately - log uij.
The approximation, which
arises because the lengths of words are necessarily
integral, will be neglected.
- (c) Find the value k of index i for which wkj is a
prefix of Mj+1. If no such k exists, append
randomly-chosen binary digits to the end of Mj+1
until a match occurs. Let tj be the number of digits appended.
- (d) Delete wkj from the front of Mj+1,
leaving a string M-j+1
of length
L-j+1
= Lj+1 + log ukj + tj.
Note that if
tj ≠ 0,
L-j+1 = 0.
- (e) Construct a string Sj encoding datum xj using
estimate θj = θk.
Its length is
lj
= - log(δ p(θk, xj)).
- (f) Construct Mj by prepending Sj
to the front of M-j+1.
Steps (a) to (f) are repeated until the final complete
message M1 has been formed.
We must show that M1 can be decoded using knowledge
only of h and f, and of the construction algorithm.
Decoding proceeds in order of increasing j.
The segment S1 for problem 1 appears in full as the
prefix of M1. It is easily decoded. Its first part
is a Huffman or similar coding of θ1 based on the
known distribution δ h(θi) over Θ*.
Having decoded θ1, the second
part can be decoded, as it is a Huffman or similar coding of xi
based on the now-known distribution
f(x|θ1) over X.
Having decoded s1, we can delete it from the front of
M1, leaving M-2.
Now, knowing x1, we can compute the distribution
ui1 over Θ*, and hence reconstruct the code C1.
Suppose the decoded estimate θ1 has
index k in Θ*. Then we can select w1k from
C1, and prepend it to M-2
giving M2.
Since S2 appears in full as the prefix of M2,
it too can now be decoded, and so on until the entire
sequence of data (and estimates) has been recovered.
At the end of the decoding process, we will be left with a
string M+J+1
comprising all the Σj tj
random digits appended in steps (c) of the coding algorithm.
It will be useful in what follows to imagine that the coding
process began with these randomly-chosen digits already in
place, i.e. with
MJ+1 = M+J+1,
LJ+1 = Σj tj.
The result of the coding is then
unchanged, but in step (c) it would never be necessary to
append additional digits, and at every stage of the coding
process we have
L-j+1
= Lj+1 + log ukj.
- For all j,
- Lj
= L-j+1 + lj
= Lj+1 + log ukj
- log(δ p(θk, xj))
- where, of course, k is a function of j.
Now,
ukj =
δ h(θk) f(xj|θk)
/ r*(xj)
= δ p(θk, xj) / r*(xj)
- Hence,
- Lj = Lj+1 - log r*(xj)
- L1 = LJ+1 - Σj log r*(xj)
- We now aim to show that the sequence of estimates yielded by
this construction is close to an SMML estimate for the sequence.
To do so, we will try to show that I1, the
expected value of L1 over the marginal distribution
of x, is not much greater than the minimum possible value
- I0 = - ΣX r(x) log r(x).
- In a long sequence, datum value x may be expected to occur
with relative frequency r(x). Hence the expected value of
the second term in L1 is
- - J ΣX r(x) log r*(x)
Now, r*(x)
= δ Σi p(θi, x)
= r(x) δ Σi g(θi|x)
where the sum is over a set Θ* of points equally
spaced in Θ at intervals of δ.
For sufficiently regular g( ),
δ Σi g(θi|x)
approximates the integral
∫ dθ g(θ|x) = 1
with an error which depends on the function g and the size of δ.
Further,
r*(x) =
δ Σi h(θi) f(x| θi)
- For sufficiently regular distributions,
- ΣX r*(x)
= δ Σi h(θi) ΣX f(x| θi)
= δ Σi h(θi)
= 1
- so r*(x) is a proper distribution. Hence,
- - ΣX r(x) log r*(x)
≥ - ΣX r(x) log r(x)
- and
- I1 - I0
= E(LJ+1)
- J ΣX r(x) log (r*(x) / r(x))
- = E(LJ+1)
- J ΣX r(x) log Σi δ g(θi|x)
The amount by which the sum
Σi δ g(θ|x)
departs from one may, for regular g( ) be
expected to be proportional to some small power of δ.
Then we can write
- I1 - I0
= E(LJ+1) + J A δn
to order δn
where A is a constant depending on r( ) and g( ) but not on J.
Now consider
E(LJ+1) = E(Σj tj).
In the first stage of the message construction, forming MJ,
as many random digits will be required in step (c)
as there are digits in the word wkJ.
In effect, wkJ
is selected from CJ by finding a match
between a word of CJ and the prefix of a long random
binary string.
The probability that a word of length l will match is just 2-l.
The length of wiJ is
- log uiJ.
Hence the expected value of tJ is
the expected value of
- Σi uiJ log uiJ.
Now uiJ is
δ h(θi) f(xJ|θi)
/ r*(xJ),
or approximately
δ g(θi|xJ).
Thus the expected value of tJ is approximately
- log δ + B,
where B is a constant approximating the expected entropy of the
posterior distribution g( ).
It is unlikely that tj ≠ 0 (j<J).
For tJ-1 ≠ 0,
we require that MJ of length
LJ =
- log(δ p(θJ, xJ))
be so short that we cannot find in it a prefix matching a word of
CJ-1.
The length of a word wi,J-1 of CJ-1 is approximately
- log(δ g(θi|xJ-1))
so we will fail to find a prefix when
g(θi|xJ-1) <
p(θJ,xJ).
Since g(θ|x) = p(θ,x) / r(x),
we expect the ratio
g(θi|xJ-1) /
p(θJ, xJ)
to be of order 1/r(x), which is typically much greater than 1.
We therefore approximate
E(Σj tj) by E(tJ).
- Thus, to order δn,
- I1 - I0
= B - log δ + JA δn
- Choosing δ to minimise
I1 gives
δn = 1/(nJA),
- I1 - I0
= B + (1/n) + (1/n) log (nJA).
Thus we find that this construction, like the more exact but
less general construction of section 6, gives a value for
I1 - I0
of order log J
for sufficiently regular h( ) and f( ).
As this value for I1 - I0
is less than is achieved by the use of a single-problem SMML estimator on
each basic problem of the sequence,
which gives I1 - I0
of order J, the construction must be close to
achieving the shortest possible expected length for a
message using estimates, and so its estimate sequence must
be similar to that given by the SMML sequence estimator.
It is worth remarking that this construction, like that of section 6,
does not require that every problem of the sequence have the same form.
We now show that the construction imitates a false oracle.
Mj+1 is a near-optimal binary coding of the data
sequence {xj+1, ..., xJ}.
When random variates from some distribution are encoded as a binary
string using an optimal code for that distribution, the
resulting binary string is a realization of a Bernoulli
sequence with p=0.5.
Thus Mj+1, (extended if
necessary by appended random digits) is such a realization.
The probability that a given binary string of length l is
a prefix of Mj+1 is 2-l.
So the probability that step (c) will select
θj = θi is
2-length(wij) = uij.
But
uij =
δ p(θi, xj) / r*(xj)
and r*(xj) approximates r(xj).
Hence uij approximates
δ g(θi|xj).
If we consider the data sequence
{xj+1, ..., xJ}
to be a realization of their joint marginal
distribution, the probability that this realization will,
for some xj, cause the selection of the value
θi as the estimate θj of
θj,
is just the posterior probability of θi
given xj. In this sense, θj is a
random selection from the (quantized) posterior distribution
of θj given xj, as is made by the false
oracle Tp.
8. Discussion
We have defined a class of ‘empirical’ criteria for the
comparison of estimators based on their performance on data
or sequences of data. For sequences, there are empirical
criteria which estimate the bias and variance of estimators,
and any other conventional measure of performance
expressible as the expectation of some function of the true
parameter and the estimated parameter. We have shown the
existence of estimators called false oracles which no
empirical criterion can distinguish from the oracle
(whose ‘estimates’ are the true values). A false oracle can be
constructed by randomly selecting an estimate from its
posterior distribution.
The SMML estimator for a sequence has, in two cases, been
shown to imitate a false oracle. In the first case,
applicable whenever the parameters have slowly-varying
priors and approximately quadratic log likelihoods, the SMML
estimator is based on a quantizing lattice. The estimate
θj of the j-th parameter is a
(discontinuous) function of all the data
{x1, x2, ..., xj,
..., xJ}
but, if one regards the data values other than xj
as being randomly drawn from their respective marginal distributions,
θj is effectively sampled from a
distribution approximating its posterior given xj.
In the second, more general case, we have shown that a
near-optimal coding of a data sequence, (and so a near-SMML estimator)
is obtained when each estimate θj
is sampled from its posterior distribution given xj
by using the coded data xj+1...xJ as a
pseudo-random source of binary digits.
In both cases, a pseudo-random selection of θj
is made using (some of) the data other than xj as
a source of “random” information. Thus the SMML estimator
is not truely a false oracle. A criterion based on
knowledge of the “pseudo-random number generator” inherent
in both coding algorithms could empirically distinguish the
SMML estimator from an oracle. However, since in both cases
the functional dependence of θj on data other
than xj is quite obscure, for all practical purposes
one may expect the statistical behavior of the SMML
estimator for a long sequence to be indistinguishable from
that of a false oracle, and hence from that of an oracle.
9. Applications
The analysis of the behaviour of minimum-message-length
estimators on long sequences of identical basic problems has
practical application. There are practical estimation
problems where the model includes not only a fixed set α
of general parameters but also a set of ‘nuisance’
parameters {θn, n=1...N} as numerous as the
sample size N. For instance, in a
factor analysis model
of a multivariate Gaussian population, once the means,
specific variances and factor loadings are known or
estimated, the estimation of the factor scores becomes a
sequence of N identical independent estimation problems.
The problem fits the assumptions of section 6, so the
analysis allows us to determine the length of the coded
message. In particular, the analysis shows that the message
length L contains a term (1/2) log(π N) - γ
rather than the larger term (N/2) log(2 π e / 12)
which is given by an analysis treating the estimation of each factor score
separately. The consequent reduction in message length
affects the comparison of the factor model against
alternative models of the same data, since the
minimum-message-length approach prefers the model with the
smaller (minimized) value of L (Wallace & Freeman 1991).
A second example is a
mixture model,
in which each datum xn
is thought to be sampled from an unknown member of a set of
different distributions
{fk(x|αk), k=1...K}.
Here, once the distribution parameters
αk and relative abundances hk are known or
estimated, the estimation of the class (i.e. distribution)
k = θn to which each datum belongs is a sequence
of identical basic problems each requiring the estimation of
a categorical parameter θn.
If each basic problem is considered separately, the shortest
encoding is obtained by estimating θn to be
that value of k for which
hk fk(xn|αk)
is a maximum.
That is, datum xn would be assigned to that component
of the mixture most likely to contain it. The resulting
message length L would then have the form
- L =
Q{αk, hk, k=1...K}
- Σn log [ maxk (hk fk(xn|αk)) ]
Estimation of the distribution parameters and abundances by
minimization of the above form has been found to give
inconsistent estimates. However, the analysis of section 7
shows that a shorter coding is possible, and has a message
length for large N of the form
- L =
Q{αk, hk}
- Σn log Σk hkfk(xn|αk)
+ 0(1)
In the mixture problems so far studied, minimization of the
latter form yields consistent estimates for {αk}
and {hk}. The smaller value of the latter
form also affects the comparison between the lengths for
models with different numbers of component distributions,
and so affects the estimation of the number of components.
It is perhaps worth mentioning that the coding device
described in section 7 has been used successfully in a
computer program for file compression.
Acknowledgements: This work was done with the assistance of
a study grant from Monash University. The work owes much to
the informed comment and assistance of P.R.Freeman.
References
- Barron, A.R. & Cover, T.M.,
Minimum Complexity Density Estimation,
IEEE Trans. Inf. Thy., 37(4), pp.1034-1054, (1991).
- Rissanen, J.,
Stochastic Complexity,
J. Royal Stat. Soc. B, 49, 3, pp.228-239, (1987).
- Wallace, C.S. & Freeman, P.R.,
Estimation & Inference by Compact Coding,
J. Royal Stat. Soc. B, 49, 3, pp.240-252, (1987).
- Wallace C.S. & Freeman P.R.,
Single Factor Analysis by MML Estimation,
J. Royal Stat. Soc. B, 54, 1, pp.195-209, (1992).
- Zador, P.,
Asymptotic Quantization Error of Continuous Signals and
the Quantization Dimension,
IEEE Trans. Inf. Thy., IT-28, pp.139-149, (1982).
© C. S. Wallace.
Also see
[MML] &
[SMML]
-- L. Allison, 2008
|
|