False Oracles

C. S. Wallace, Proc. Int. Conf. on Information, Statistics and Induction in Science (ISIS), pp.304-316, 1996

(based on TR 128, Dept. Computer Science, Monash U., 1989)

1. Introduction

Recently there has been interest in a general method of statistical estimation and inference variously known as Minimum Message Length (MML) (Wallace & Freeman, 1987) or Minimum Description Length (MDL) (Rissanen 1987). The method is an application of a general approach to inductive inference first proposed by Solomonoff (1964), and closely related to the notion of Algorithmic Complexity due to Kolmogorov (1965). The method is Bayesian, and is based on the idea of choosing an hypothesis and estimates of its parameters which together allow a concise encoding of the data. The method is of great generality, and in those specific estimation problems which have so far been addressed using MML, gives estimates similar, but often superior in some respect such as bias, to those obtained by the Maximum Likelihood (ML) method. In particular, MML handles more gracefully problems where the number of unknown parameters increases with sample size, and can automatically select an appropriate hypothesis from a family of hypotheses of increasing complexity.

This page was created from what I believe to be the 1996 source file, hopefully without adding too many errors, to make this MML paper more readily accessible -- LA 2008

NB. I have used θ for θhat and φ for φhat

The consistency of MDL for a wide range of problems has been proved by Barron & Cover (1991), who also derive the ultimate rate of convergence of the estimates with increasing sample size for a range of problem types. Unfortunately, little is known in general about the behaviour of MML estimates for modest sample sizes. For instance, there have been no general results on the bias or variance of estimates, or on the error rates of the inbuilt hypothesis selection. Wallace and Freeman (op cit) and Rissanen (op cit) have argued that the estimates are close to sufficient, in the sense of capturing all the relevant information available in the data, but the arguments are outside conventional statistical theory.

In this paper, we attempt to show that MML can be expected to have good behaviour for modest samples. Consistency arguments prove good behaviour in the limit of large samples. We have not entirely avoided the need to appeal to a limit, but rather than considering the behaviour as an unbounded volume of data is used in a single estimation, we consider the behaviour as the method is applied to an unbounded set of (possibly unrelated) estimation problems, each using only limited data. That is, we consider the expected consequences of adopting the strategy of using MML for all desired inferences. Sections 2 and 3 discuss a proposed criterion of estimate quality for a single estimation problem. Section 4 extends the criterion to cover the behaviour of a method in a sequence of problems, and relates it to conventional criteria such as bias. These sections also define an “optimal” method according to this criterion. Sections 6, 7 and 8 then show that the MML method approaches this optimum.

Of the several criteria which are commonly used for judging the quality of statistical estimators, none is entirely satisfactory, at least by itself. Some, such as consistency, are applicable only to families of estimators for an unbounded range of sample sizes. Others, such as bias and expected variance, are tied to particular parameterizations of the problem, and may give different judgements if the parameters to be estimated are subjected to functional transformation. Others again, such as sufficiency, can be satisfied by estimators with severe deficiencies. In this paper, we propose yet another criterion and investigate some of the properties of estimators satisfying it. We give a constructive proof that such estimators generally exist, and also show that Strict Minimum Message Length (SMML) estimators (Wallace & Freeman 1987) in some circumstances closely approach satisfaction of the criterion.

The framework of the discussion is Bayesian.

We wish to estimate the value of an unknown parameter θ within a space Θ, given a prior density h(θ) on Θ, an observed datum x within a space X, and a probability density f(x|θ) (x∈X, θ∈Θ). At times we use the joint density p(θ,x) = h(θ)f(x|θ), the marginal density of x, r(x) = ∫ dθ p(θ, x), and the posterior density g(θ|x) = p(θ, x) / r(x). In some parts of the discussion we prefer to treat x as a discrete datum within a countable set X.

2. Empirical Criteria

Definitions

An estimator is a function from X into Θ, which we shall write as θ = T(x).
A stochastic estimator is an estimator whose value θ = T_s(x) for any datum x is a realization of a random variable with probability density G_s(θ|x).
A deterministic estimator is not stochastic.
An honest estimator is an estimator whose value for given x is independent of the true parameter value θ.
The oracle is a distinguished estimator T_t such that T_t(x) = θ, the true parameter value. It is thus a ‘perfect’ estimator, but is of course neither realizable nor honest. Apart from the oracle, all other estimators considered will be honest.
A trial is an instance of an estimation problem described by Θ, X, h(θ), f(x|θ) yielding a datum x which is then supplied to two estimators T₀ and T₁ to give two estimates θ = T₀(x), φ = T₁(x). In a trial, the estimators are treated as black boxes. That is, the internal mechanisms of the estimators are invisible, and any comparison of the estimators can be based only on the datum x and the two estimates θ and φ. The comparison may involve use of the prior information Θ, X, h(θ) and f(x|θ). Formalizing the notion of a comparison:
A criterion c is a function from (X, Θ, Θ) to the real line. The outcome of a trial using c is written c(x, θ, φ). For some criteria, negative values of c represent a preference for T₀ and positive values a preference for T₁. That is, for the observed value of x, c(x, θ, φ) < 0 indicates that the criterion c judges θ to be a better estimate than φ, for instance because θ has the higher likelihood. All criteria are assumed to be independent of the true parameter θ.
A fair criterion is a criterion such that for all a, b ∈ Θ and x ∈ X, c(x,a,b) = - c(x,b,a).
This definition is intended to exclude criteria which have an inbuilt preference for one of the two estimators. As the internal workings of the estimators are invisible, no such preference would be rational.

The above definitions are intended to provide a framework for the empirical comparison of estimators, i.e. comparison based on the observable behaviour of estimators rather than on their structure. Of course, the outcome of a single trial using some criterion is a function of the value of x realized in the trial, and so the preference for one estimator over the other will be subject to the accidents of the particular trial. Later, we will discuss a class of estimation problems for which a single trial might be sufficient to determine a preference reliably. However, we now retreat somewhat from the ideal of a purely empirical comparison, and define the expected outcome of a trial.

c( , , )
c( , )

The expected value of a trial of estimators T₀ and T₁ using criterion c is the expected value of c(x, T₀(x), T₁(x)), written c(T₀, T₁), where the expectation is taken over the joint distribution of x and θ. Where one or both of the estimators are stochastic, the expectation is also taken over their value distributions.

For T₀, T₁ honest and deterministic, the estimators are independent of θ and the expectation is simply over the marginal distribution of x.: c(T₀, T₁) = ∫ dx r(x) c(x, T₀(x), T₁(x))
For T₀, T₁ honest, T₁ stochastic, T₀ deterministic,: c(T₀, T₁) = ∫ dx ∫ dφ r(x) G₁(φ|x) c(x, T₀(x), φ)
and for T₀, T₁ honest and stochastic: c(T₀, T₁) = ∫ dx ∫ dθ ∫ dφ r(x) G₀(θ|x) G₁(φ|x) c(x, θ, φ)
For the special case that one estimator, T₀ say, is the oracle T_t and T₁ is deterministic,: c(T_t, T₁) = ∫ dθ ∫ dx h(θ) f(x|θ) c(x,θ, T₁(x))
and for T₀ the oracle, T₁ stochastic: c(T_t, T₁) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) G₁(φ|x) c(x,θ, φ)
Note that in all cases, for c fair, c(T₀, T₁) = - c(T₁, T₀).

3. False Oracles

A false oracle is an estimator T_f such that for all fair criteria c, and for T_t the oracle, c(T_f, T_t) = 0. That is, no fair criterion will be expected to prefer the oracle to a false oracle, or vice versa.

3.1 Theorem:
For T_f a false oracle, T_t the oracle, T₁ honest and c a criterion not necessarily fair c(T_f, T₁) = c(T_t, T₁).
Proof: Consider the function: c*(x,a,b) = c(x,a,T₁(x)) - c(x,b,T₁(x)), a,b ∈ Θ
where, if T₁ is stochastic, we have the same realization of T₁(x) in both places. Since T₁ is honest, c* is not a function of θ. Also, c*(x,a,b) = -c*(x,b,a). Hence c* is a fair criterion, so by definition of T_f,: c*(T_t, T_f) = 0
But: c*(T_t, T_f) = E{c(x, T_t(x), T₁(x)) - c(x, T_f(x), T₁(x))}; = E{c(x, T_t(x), T₁(x))} - E{c(x, T_f(x), T₁(x))}; = c(T_t, T₁) - c(T_f, T₁) --Q.E.D.

This theorem shows that, while we cannot in practice compare an estimator with the oracle, since the oracle is unavailable, we can instead compare the estimator with a false oracle and get the same result in expectation.

3.2 Corollary: With the same conditions as for the theorem,: c(T₁, T_f) = c(T₁, T_t)
Also,: c(T_f, T_t) = c(T_f, T_f) = c(T_t, T_f)

3.3 Corollary: With the same conditions as for the theorem, the value on a trial of c(x, T_f(x), T₁(x)) is as good an estimate of c(T_t, T₁) as is the value on a trial of c(x, T_t(x), T₁(x)) = c(x, θ, T₁(x)).

Proof: Define c_k(x, θ, φ) = (c(x, θ, φ))^k
Then c_k is a criterion, so c_k(T_f, T₁) = c_k(T_t, T₁)
But c_k(T₀, T₁) is the k^th moment of the sampling distribution of c(x, T₀(x), T₁(x)).
Hence c(x, T_f, T₁) and c(x, T_t, T₁) have the same sampling distribution.

3.4 Theorem:
False oracles exist in sufficiently regular cases.
The proof is constructive:
Let T_p be the stochastic estimator described by the density: G_p(θ|x) = g(θ|x) = h(θ) f(x|θ) / r(x) = p(θ,x) / r(x).
For any criterion c,: c(T_p, T_t) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) g(φ|x) c(x,φ,θ); = ∫ da ∫ dx ∫ db p(a,x) p(b,x) c(x,b,a) / r(x), (a = θ ∈ Θ, b = φ ∈ Θ); c(T_t, T_p) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) g(φ|x) c(x,θ,φ); = ∫ db ∫ dx ∫ da p(b,x) p(a,x) c(x,b,a) / r(x), (b = θ, a = φ)
The expressions for c(T_p, T_t) and c(T_t, T_p) differ only in the order of integration. Hence, for sufficiently regular h, f and c, they are equal,: c(T_t, T_p) = c(T_p, T_t)
But, if c is fair, c(T_t, T_p) = - c(T_p, T_t)
Hence c(T_p, T_t) = 0 for all fair c, so T_p is a false oracle.

3.5 The Merit of False Oracles

False oracles deserve to be considered good estimators in that no fair criterion is expected to show a preference for the true parameter value over the estimate of a false oracle. Indeed, the above results show that no test based only on the information available to the estimator can reliably distinguish an oracle from a false oracle. With the more conventional criteria applied to estimators (some of which will be further discussed below) it is often the case that an estimator satisfying one criterion (e.g. lack of bias) will not satisfy others (e.g. maximum likelihood). However, for the empirical criteria discussed here, we have shown that there exists a class of estimators which give results as good as the true values by all empirical tests, since c(T_f, T_t) = c(T_t, T_f) for all c, fair and otherwise.

Note that it is not true that all criteria, or even all fair criteria, will prefer a false oracle to an honest estimator which is not a false oracle. Suppose x and θ are commensurable, and that x is distributed as N(θ,1). The fair criterion

c(x, θ, φ) = (x - θ)² - (x - φ)²

will always prefer the estimator θ = x to a false oracle. It will also, of course, always prefer θ = x to the oracle θ = θ, so the import of its preference is dubious.

The construction of the false oracle T_p amounts to a rule for characterizing the posterior distribution of θ by a value (estimate) drawn randomly from the posterior. It may seem a strange rule compared with, say, choosing the mean, median or mode of the posterior, but it has the advantage of being invariant under arbitrary measure-preserving transformations of the parameter space.

4. Sequence Trials

We now attempt to define a class of estimation problems, and associated trials, which allow a comparison between our ‘empirical’ criteria and some more conventional ones.

A sequence problem is an ordered sequence of independent instances of the same basic problem. For a basic problem described by Θ, x, h(θ), f(x|θ), a sequence problem of length J has parameter space Θ^J and data space X^J. The data are an ordered sequence of values {x_j, j=1 ... J} from which an estimator must form an estimate which is an ordered sequence {θ_j, j=1 ... J}, where θ_j is an estimate of the true value θ_j of the parameter in the j^th instance of the basic problem. The prior density for the true sequence parameter {θ_j, j=1 ... J} is ∏_j h(θ_j) and the conditional data density is ∏_j f(x_j|θ_j). Note that we assume the instances are wholly independent: data from early instances do not lead to revision of the prior or conditional data density in later instances.

An estimator T for a sequence is a function from X^J into Θ^J. We do not require that the estimator have the form of J instances of the same function applied in turn to {x_j}. The ‘black box’ model of an estimator is not required to produce any output until it has the entire data sequence, so θ_j may depend on all components of {x_k, k=1 ... J}, not just x_j.

In a sequence trial of estimators T₀, T₁ using criterion c, c is a function c({x_j, θ_j, φ_j, j=1...J}) which operates on the data sequence and the two estimate sequences to yield a single value, which need not be expressible as a combination of J applications of a simpler function.

Since a sequence problem and sequence trial are just special cases of the problems and trials defined earlier, all definitions and results carry over, with only the notational difference that now the symbols Θ, X, h( ) and f( ) refer to the component basic problems of the sequence rather than to the sequence problem as a whole. Note that, since the basic problems are independent, the posterior density of {θ_j, j=1...J} given {x_j, j=1...J} is ∏_j g(θ_j|x_j). Thus the false oracle T_p of theorem 3.4, which picks an estimate sequence randomly from this posterior density, may be considered to choose randomly and independently a value from each basic posterior g(θ_j|x_j).

4.1 Sequence Criteria:

Using sequence trials, we can define empirical criteria corresponding to some conventional criteria. For D a function from Θ to the real line, v∈Θ, δ a small constant and J large, define

D_v{x_j, θ_j, φ_j, j=1...J} = Σ_j∈S D(θ_j) / N_S

where S is the set {j; |φ_j - v| < δ} and N_S is the cardinality of S. D_v is the average value of D(θ_j) over those components of the sequence trial in which φ_j approximately equalled v. Its expected value, D_v(T₀, T₁) approximates the expected value of D(θ) given φ = v.

Letting D_v be the function B_v(θ) = θ - v, call B_v(T₀, T₁) the bias of T₀ relative to T₁. It is, of course, a function of v. For T₁ the oracle, B_v(T₀, T_t) is simply the expected value of θ - θ, given θ = v. Hence B_v(T₀, T_t) corresponds to the usual definition of the bias of T₀. While T_t is unavailable, we have B_v(T₀, T_f) = B_v(T₀, T_t) for T_f a false oracle. Hence the bias of T₀ can be estimated by measuring the bias of T₀ relative to T_f in a sequence trial. Letting T₀ be the false oracle T_p,

B_v(T_p, T_t) = B_v(T_p, T_p) = ∫ dx ∫ dθ f(x|v) g(θ|x) (θ - v)

This is not, in general, zero. However, note that B_v(T_t, T_p) = B_v(T_p, T_t). The ‘bias’ of the oracle relative to a false oracle T_f equals the bias of T_f, so empirical measurement of relative bias does not serve to distinguish false and true oracles. Also, note that B_v(T_t, T_t) ≠ B_v(T_t, T_f), since T_t is not honest. Of course, B_v(T_t, T_t) = 0 for all v.

Similarly, letting D_v be the function V_v(θ) = (θ - v)², the usually-defined variance of T₀ is V_v(T₀, T_t), and can be empirically measured by the value of V_v in a sequence trial of T₀ and a false oracle T_f. For the false oracle T_p,

V_v(T_p, T_t) = V_v(T_p, T_p) = ∫ dx ∫ dθ f(x|v) g(θ |x) (θ - v)²

This expression shows that the variance of a false oracle is the variance of the convolution over x of f and g. For h(θ) slowly varying, it is about twice the variance of a conventional Bayes estimator with quadratic loss function.

5. SMML Estimators

Strict Minimum Message length (SMML) estimation has been described elsewhere (Wallace & Freeman 1987). We use the same Bayesian framework defined by Θ, X, h(θ), f(x|θ) etc. as before, but initially it is convenient to consider x to be a discrete variable, so X is a countable set, and f(x|θ) and r(x) are probabilities rather than densities.

The SMML estimator θ = m(x) has a range Θ* = {θ_i, i=1, 2, ...} = {θ : θ = m(x), x∈X} which is a countable subset of Θ. Define X_i = {x : m(x) = θ_i} and q_i = q(θ_i) = Σ_{X_i} r(x) for all i. Then m( ) minimizes

I₁ = - Σ_X r(x) log[ q(m(x)) f(x|m(x)) ]

= - Σ_i [ q_i log q_i + Σ_{X_i} r(x) log f(x | θ_i) ]

The estimator is derived by considering the expected length of a message encoding x and decodeable given knowledge of X, Θ, f(x|θ) and h(θ). By standard coding theory, the expected length cannot be less than I₀ = - Σ_X r(x) log r(x), where the base of logarithms depends on the coding alphabet, e.g., 2 for binary coding. However, we consider a message having the form of a statement of an estimate θ, followed by an encoding of x using a code which would be optimal (i.e. have least expected length) were θ = θ. Using optimal coding for the statement of θ, the length of the message encoding x is L(x) = - log q(m(x)) - log f(x|m(x)). (Actually, the length of a real message might exceed L(x) by as much as one symbol, since real messages must have integral lengths. However, we neglect this rounding-up effect.) The first term of L(x) gives the length of the encoding of the estimate θ = m(x). Note that q(θ_i) is the unconditional probability that θ_i will be the estimate. The second term is the length of the encoding of x using a code optimal if θ = m(x). I₁ is the expected value of L(x).

Since I₀ is independent of the estimator, the SMML estimator minimizes

I₁ - I₀ = - Σ_X r(x) log [q(m(x)) f(x|m(x)) / r(x)]

If the discretization of the data is so fine that we can consider X as a continuum, we can revert to the framework of the earlier sections, and the SMML estimator minimizes

I₁ - I₀ = ∫_X r(x) log[q(m(x)) f(x|m(x)) / r(x)]dx,
where: q(θ_i) = ∫_{X_i} r(x)dx.

Θ* remains a countable subset of Θ, but I₁ and I₀ are not individually definable.

Minimization of I₁ - I₀ requires a compromise in the size of the region X_i within which all datum values lead to the same estimate θ_i. Large regions have large values of q_i, leading to a short statement of θ_i, but if the region is too large, it will contain values of x for which f(x|θ_i) is much less than f(x|θ_m) where θ_m is the maximum-likelihood estimate for x, so the expected length of the statement of x will be greater than for smaller regions.

We now show that in a sequence of basic estimation problems such as was described in section 4, the SMML estimator approximates a false oracle. Two cases are considered. In the first, which applies to basic problems with approximately quadratic log-likelihoods, a fairly exact analysis is possible. The second presents a less exact treatment of a more general sequence problem. Throughout, j indexes the component problems of a sequence of J basic problems, and i indexes the members of Θ*, the range of an SMML estimator.

6. A Simple Case

Assume that for all j, f(x_j|θ_j) = (1 / (√(2 π) σ)) e^{-(x_j-θ_j)² / 2σ²}, σ known, and h(θ_j) = a, a constant over a large range. If SMML estimation were applied separately to each problem of the sequence, we would obtain for each j an estimator θ_j = m(x_j) with the same form for every j. We drop the j suffix and consider one estimator m(x). Its range is simply a set of equally-spaced values of θ, with spacing s say. The SMML estimator maps x into the member of the set most nearly equal to x. Since, over the range of interest, r(x) = h(θ) = a, we have that for every θ_i ∈ Θ*, X_i is the interval {x: θ_i - s/2 < x < θ_i + s/2} and q_i = as.

The estimator minimizes the expected value of - log q(m(x)) - log f(x|m(x)) + log r(x). Since all of the regions X_i have the same pattern, the expectation can be taken over any one of them, giving

I₁ - I₀

= - log(as) + 1/2 log(2 π) + log σ + 1/s { _{θ_i-s/2}∫^θ_i+s/2 dx (x - θ_i)² / (2 σ²) } + log a

= - log s + (1/2) log(2 π) + log σ + s² / (24 σ²)

Minimization with respect to s gives: s² = 12 σ²,; I₁ - I₀ = (1/2) log (2 π e / 12).

The result of applying this estimator to all J problems of the sequence is therefore an expected message length increment for the whole sequence given by: I₁ - I₀ = (1/2) J log(2 π e / 12).

This result is not the best that can be achieved.

An SMML estimator for the whole sequence problem minimizes the expected value of: - log q(m(x)) - Σ_j log f(x_j|m_j(x)) + log r(x)
where now the estimator m(x) maps the sequence or vector of data x = (x₁, x₂ ... x_j ...) into a vector: θ = (θ₁, θ₂ ... θ_j ...) = (m₁(x), m₂(x), ..., m_j(x), ...),
and r(x) = a^J.

Thus the SMML estimator minimizes the expected value of: L(x) = - log q(m(x)) + (J/2) log (2 π) + J log σ + (m(x) - x)² / (2 σ²) + J log a

Its range Θ* is a countable set of vectors or points in J-space, {θ_i}. Assume these points form a lattice. Then q(θ_i) = q_i is the same for all i, and for all i the region X_i within which m(x) = θ_i is the set of x values closer to θ_i than to any other member of Θ*, i.e., the Voronoi region of θ_i. For Θ* a lattice, the Voronoi regions of all its members have the same size and shape. Let s be the volume of X_i for all i. Then q_i = sa^J for all i. The expectation of L(x) can be taken over any Voronoi region, giving

I₁ - I₀ = - log (sa^J) + (J/2) log (2 π) + J log σ + E(x - θ_i)² / (2 σ²) + J log a.

The expectation E(x - θ_i)² over the Voronoi region of θ_i depends on the volume s and the shape of the region. To separate these effects, write E(x - θ_i)² = J k_J s^2/J, where k_J depends only on the shape of X_i, i.e. the geometry of the lattice. Then

I₁ - I₀ = - log s + (J/2) log (2 π) + J log σ + Jk_J s^2/J / (2 σ²)

Minimization with respect to s gives: s^2/J = σ² / k_J,; I₁ - I₀ = (J/2) log (2 π e k_J)

Minimization with respect to the geometry of the lattice requires choosing a lattice which, for a given density of points, minimizes the mean squared distance of all points from the nearest lattice point. Lattices chosen in this way are called ‘quantizing lattices’.

Use of the single-problem SMML estimator separately for each problem is equivalent to choice of a rectangular lattice, for which k_J = 1/12 for all J. However, for J > 1, lattices exist with k_J < 1/12. For instance, for J = 2, a lattice with hexagonal rather than square Voronoi regions has k_J = 0.080175.. < 1/12 and hence leads to a smaller value for I₁ - I₀. Note that in the resulting SMML estimator, θ₁, is a function of both x₁ and x₂. If x₁ is held constant and x₂ is changed, θ₁ may assume in turn two or more different values, all however close to x₁. Similarly, θ₂ is a function of both x₁ and x₂, but is always close to x₂.

For large J, the optimum quantizing lattice is not known. However, Zador (1982) has shown that: Γ(J/2+1)^2/J Γ(2/J+1) / (J π) > k_J > Γ(J/2+1)^2/J / ((J+2) π)

The upper bound is the value for a random set of points (not a lattice) and the lower bound would be achieved only if the Voronoi region were a J-sphere.

Using Stirling’s approximation in the form: log Γ(y+1) = (y + 1/2) log y - y + 1/2 log(2 π) + O(1/y),; log k_J > (2/J) [(J/2 + 1/2) log(J/2) - J/2 + 1/2 log(2 π)] - log(J+2) - log π + O(1/J²); > - log(2 π e) + (log J + log π - 2) / J + O(1 / J²)

Similarly, and using log Γ(1+y) = - γ y + O(y²), where γ is Euler’s constant,: log k_J < - log(2 π e) + (log J + log π - 2 γ) / J + O(1/J²)

Hence (1/2) log(J π) - γ > I₁ - I₀ > (1/2) log(J π) - 1, to order 1/J.

We see that, whereas use of a rectangular lattice, i.e. repeated use of the one-problem SMML estimator, gives I₁ - I₀ = (1/2) J log (2 π e / 12) proportional to J, the optimum SMML estimator has at worst I₁ - I₀ = (1/2) log (J π) - γ.

Although the shape of the optimum Voronoi region for large J is unknown, the close approximation of the optimum lattice to the behaviour of spherical regions strongly suggests that the optimum Voronoi region is very nearly spherical for large J.

Making this assumption, the equation s^2/J = σ^2/k_J for its volume s gives for its radius ρ: ρ^Jπ^J/2 / Γ(J/2 + 1) = s = σ^J / k_J^J/2

Using the lower bound as an approximation for k_J, ρ = σ √(J+2)

If we consider one component of θ and x, i.e. one estimation problem of the sequence, the estimate θ_j is a function of all components of x, not just of x_j. If the other components are thought of as random variables, all that is known of x is that it lies in the subspace defined by x_j, and all that is known of θ is that it lies somewhere in a sphere of radius ρ centred on x. Thus, θ_j can be regarded as a random variable generated by the projection of a random point in this sphere onto the j axis. The density of θ_j is therefore proportional to

{ρ² - (θ_j-x_j)²}^(J-1)/2 = {(J+2)σ² - (θ_j-x_j)²}^(J-1)/2

For any J, this distribution has mean x_j and standard deviation σ. For large J, it closely approximates N(x_j, σ²), which is the posterior density of θ_j given x_j.

The SMML estimator thus produces estimates for the individual problems of the sequence which have approximately the behaviour of values chosen at random from the individual posterior densities. That is, it approximates the false oracle T_p.

The results for this simple case are readily generalized to any sequence of estimators in which the log likelihood for the sequence of data has an approximately quadratic behaviour near its maximum, and where the prior for the parameter sequence is slowly-varying. It is not necessary that all basic estimation problems in the sequence be instances of the same problem, or even that they require estimation of the same number of parameters. The results depend only on the assumption of a slowly-varying priors, and approximately quadratic log-likelihood.

7. A More General Case

We now describe a construction for a near-optimal encoding of a data sequence which does not require smooth or quadratic behaviour in the log prior and log likelihood functions. As before, we consider a sequence of J independent instances of the same basic problem described by Θ, X, h(θ) and f(x|θ). We assume x is a discrete variable.

In this construction, each estimate θ_j will be selected from a set Θ* of equally-spaced values {θ_i, i=1, 2 ...} whose spacing δ is small compared with the expected estimation error of each θ_j. Index i is used for indexing members of Θ*, and index j for indexing problems of the sequence. We assume Σ_i δ h(θ_i) = 1.

The message is constructed as J segments. Basically, each segment comprises two parts. The first states an estimate θ_j of θ_j, selecting θ_j from Θ*, and has length - log(δ h(θ_j)). The second encodes x_j using a code optimal if θ_j = θ_j, and has length - log f(x_j| θ_j). The total length l_j of segment j is

l_j = - log(δ h(θ_j) f(x_j|θ_j)) = - log(δ p(θ_j, x_j))

However, the segments are combined into a complete message in such a way that the length of the message is less than Σ_j l_j.

As δ is small compared with the expected estimation error of θ_j, there will be several members of Θ*, separated by multiples of δ, any of which would give a near-minimal value of l_j. The basis of the construction is use of the freedom to choose any one of several more-or-less equally good values for θ_j so as to convey information about the following segment, j+1.

For definiteness, assume binary coding, so logarithms are to base 2.

The construction proceeds backwards, in decreasing order of j. Let M_J+1 be the empty string, of length L_J+1 = 0. Suppose that a string M_j+1 of length L_j+1 has been constructed encoding estimates and data for problems j+1, ..., J. To form a string M_j covering in addition problem j, proceed as follows:

(a) Compute a distribution over Θ* given by u_ij = δ h(θ_i) f(x_j| θ_i) / r*(x_j), where r*(x) = δ Σ_i h(θ_i) f(x| θ_i).
(b) Construct an optimum prefix code C_j (e.g. a Huffman code) for this distribution. C_j is a set of binary strings, or words, {w_ij, i=1, 2,...} in one-to-one correspondence with the members of Θ*, such that no word is a prefix of another, any sufficiently long binary string has a word as prefix, and the length of w_ij is approximately - log u_ij. The approximation, which arises because the lengths of words are necessarily integral, will be neglected.
(c) Find the value k of index i for which w_kj is a prefix of M_j+1. If no such k exists, append randomly-chosen binary digits to the end of M_j+1 until a match occurs. Let t_j be the number of digits appended.
(d) Delete w_kj from the front of M_j+1, leaving a string M^-_j+1 of length L^-_j+1 = L_j+1 + log u_kj + t_j.
Note that if t_j ≠ 0, L^-_j+1 = 0.
(e) Construct a string S_j encoding datum x_j using estimate θ_j = θ_k. Its length is l_j = - log(δ p(θ_k, x_j)).
(f) Construct M_j by prepending S_j to the front of M^-_j+1.

Steps (a) to (f) are repeated until the final complete message M₁ has been formed.

We must show that M₁ can be decoded using knowledge only of h and f, and of the construction algorithm. Decoding proceeds in order of increasing j.

The segment S₁ for problem 1 appears in full as the prefix of M₁. It is easily decoded. Its first part is a Huffman or similar coding of θ₁ based on the known distribution δ h(θ_i) over Θ*. Having decoded θ₁, the second part can be decoded, as it is a Huffman or similar coding of x_i based on the now-known distribution f(x|θ₁) over X.

Having decoded s₁, we can delete it from the front of M₁, leaving M^-₂.

Now, knowing x₁, we can compute the distribution u_i1 over Θ*, and hence reconstruct the code C₁. Suppose the decoded estimate θ₁ has index k in Θ*. Then we can select w_1k from C₁, and prepend it to M^-₂ giving M₂.

Since S₂ appears in full as the prefix of M₂, it too can now be decoded, and so on until the entire sequence of data (and estimates) has been recovered.

At the end of the decoding process, we will be left with a string M⁺_J+1 comprising all the Σ_j t_j random digits appended in steps (c) of the coding algorithm.

It will be useful in what follows to imagine that the coding process began with these randomly-chosen digits already in place, i.e. with M_J+1 = M⁺_J+1, L_J+1 = Σ_j t_j. The result of the coding is then unchanged, but in step (c) it would never be necessary to append additional digits, and at every stage of the coding process we have L^-_j+1 = L_j+1 + log u_kj.

For all j,: L_j = L^-_j+1 + l_j = L_j+1 + log u_kj - log(δ p(θ_k, x_j))
where, of course, k is a function of j.

Now, u_kj = δ h(θ_k) f(x_j|θ_k) / r*(x_j) = δ p(θ_k, x_j) / r*(x_j)

Hence,: L_j = L_j+1 - log r*(x_j); L₁ = L_J+1 - Σ_j log r*(x_j)

We now aim to show that the sequence of estimates yielded by this construction is close to an SMML estimate for the sequence. To do so, we will try to show that I₁, the expected value of L₁ over the marginal distribution of x, is not much greater than the minimum possible value: I₀ = - Σ_X r(x) log r(x).

In a long sequence, datum value x may be expected to occur with relative frequency r(x). Hence the expected value of the second term in L₁ is: - J Σ_X r(x) log r*(x)

Now, r*(x) = δ Σ_i p(θ_i, x) = r(x) δ Σ_i g(θ_i|x) where the sum is over a set Θ* of points equally spaced in Θ at intervals of δ. For sufficiently regular g( ), δ Σ_i g(θ_i|x) approximates the integral ∫ dθ g(θ|x) = 1 with an error which depends on the function g and the size of δ.

Further, r*(x) = δ Σ_i h(θ_i) f(x| θ_i)

For sufficiently regular distributions,: Σ_X r*(x) = δ Σ_i h(θ_i) Σ_X f(x| θ_i) = δ Σ_i h(θ_i) = 1
so r*(x) is a proper distribution. Hence,: - Σ_X r(x) log r*(x) ≥ - Σ_X r(x) log r(x)
and: I₁ - I₀ = E(L_J+1) - J Σ_X r(x) log (r*(x) / r(x)); = E(L_J+1) - J Σ_X r(x) log Σ_i δ g(θ_i|x)

The amount by which the sum Σ_i δ g(θ|x) departs from one may, for regular g( ) be expected to be proportional to some small power of δ. Then we can write

I₁ - I₀ = E(L_J+1) + J A δⁿ

to order δⁿ where A is a constant depending on r( ) and g( ) but not on J.

Now consider E(L_J+1) = E(Σ_j t_j). In the first stage of the message construction, forming M_J, as many random digits will be required in step (c) as there are digits in the word w_kJ. In effect, w_kJ is selected from C_J by finding a match between a word of C_J and the prefix of a long random binary string. The probability that a word of length l will match is just 2^-l. The length of w_iJ is - log u_iJ. Hence the expected value of t_J is the expected value of - Σ_i u_iJ log u_iJ.

Now u_iJ is δ h(θ_i) f(x_J|θ_i) / r*(x_J), or approximately δ g(θ_i|x_J). Thus the expected value of t_J is approximately - log δ + B, where B is a constant approximating the expected entropy of the posterior distribution g( ).

It is unlikely that t_j ≠ 0 (j<J). For t_J-1 ≠ 0, we require that M_J of length L_J = - log(δ p(θ_J, x_J)) be so short that we cannot find in it a prefix matching a word of C_J-1. The length of a word w_i,J-1 of C_J-1 is approximately - log(δ g(θ_i|x_J-1)) so we will fail to find a prefix when g(θ_i|x_J-1) < p(θ_J,x_J). Since g(θ|x) = p(θ,x) / r(x), we expect the ratio g(θ_i|x_J-1) / p(θ_J, x_J) to be of order 1/r(x), which is typically much greater than 1. We therefore approximate E(Σ_j t_j) by E(t_J).

Thus, to order δⁿ,: I₁ - I₀ = B - log δ + JA δⁿ

Choosing δ to minimise I₁ gives δⁿ = 1/(nJA),: I₁ - I₀ = B + (1/n) + (1/n) log (nJA).

Thus we find that this construction, like the more exact but less general construction of section 6, gives a value for I₁ - I₀ of order log J for sufficiently regular h( ) and f( ).

As this value for I₁ - I₀ is less than is achieved by the use of a single-problem SMML estimator on each basic problem of the sequence, which gives I₁ - I₀ of order J, the construction must be close to achieving the shortest possible expected length for a message using estimates, and so its estimate sequence must be similar to that given by the SMML sequence estimator.

It is worth remarking that this construction, like that of section 6, does not require that every problem of the sequence have the same form.

We now show that the construction imitates a false oracle. M_j+1 is a near-optimal binary coding of the data sequence {x_j+1, ..., x_J}. When random variates from some distribution are encoded as a binary string using an optimal code for that distribution, the resulting binary string is a realization of a Bernoulli sequence with p=0.5. Thus M_j+1, (extended if necessary by appended random digits) is such a realization. The probability that a given binary string of length l is a prefix of M_j+1 is 2^-l. So the probability that step (c) will select θ_j = θ_i is 2^{-length(w_ij)} = u_ij.

But u_ij = δ p(θ_i, x_j) / r*(x_j) and r*(x_j) approximates r(x_j).

Hence u_ij approximates δ g(θ_i|x_j). If we consider the data sequence {x_j+1, ..., x_J} to be a realization of their joint marginal distribution, the probability that this realization will, for some x_j, cause the selection of the value θ_i as the estimate θ_j of θ_j, is just the posterior probability of θ_i given x_j. In this sense, θ_j is a random selection from the (quantized) posterior distribution of θ_j given x_j, as is made by the false oracle T_p.

8. Discussion

We have defined a class of ‘empirical’ criteria for the comparison of estimators based on their performance on data or sequences of data. For sequences, there are empirical criteria which estimate the bias and variance of estimators, and any other conventional measure of performance expressible as the expectation of some function of the true parameter and the estimated parameter. We have shown the existence of estimators called false oracles which no empirical criterion can distinguish from the oracle (whose ‘estimates’ are the true values). A false oracle can be constructed by randomly selecting an estimate from its posterior distribution.

The SMML estimator for a sequence has, in two cases, been shown to imitate a false oracle. In the first case, applicable whenever the parameters have slowly-varying priors and approximately quadratic log likelihoods, the SMML estimator is based on a quantizing lattice. The estimate θ_j of the j-th parameter is a (discontinuous) function of all the data {x₁, x₂, ..., x_j, ..., x_J} but, if one regards the data values other than x_j as being randomly drawn from their respective marginal distributions, θ_j is effectively sampled from a distribution approximating its posterior given x_j.

In the second, more general case, we have shown that a near-optimal coding of a data sequence, (and so a near-SMML estimator) is obtained when each estimate θ_j is sampled from its posterior distribution given x_j by using the coded data x_j+1...x_J as a pseudo-random source of binary digits.

In both cases, a pseudo-random selection of θ_j is made using (some of) the data other than x_j as a source of “random” information. Thus the SMML estimator is not truely a false oracle. A criterion based on knowledge of the “pseudo-random number generator” inherent in both coding algorithms could empirically distinguish the SMML estimator from an oracle. However, since in both cases the functional dependence of θ_j on data other than x_j is quite obscure, for all practical purposes one may expect the statistical behavior of the SMML estimator for a long sequence to be indistinguishable from that of a false oracle, and hence from that of an oracle.

9. Applications

The analysis of the behaviour of minimum-message-length estimators on long sequences of identical basic problems has practical application. There are practical estimation problems where the model includes not only a fixed set α of general parameters but also a set of ‘nuisance’ parameters {θ_n, n=1...N} as numerous as the sample size N. For instance, in a factor analysis model of a multivariate Gaussian population, once the means, specific variances and factor loadings are known or estimated, the estimation of the factor scores becomes a sequence of N identical independent estimation problems. The problem fits the assumptions of section 6, so the analysis allows us to determine the length of the coded message. In particular, the analysis shows that the message length L contains a term (1/2) log(π N) - γ rather than the larger term (N/2) log(2 π e / 12) which is given by an analysis treating the estimation of each factor score separately. The consequent reduction in message length affects the comparison of the factor model against alternative models of the same data, since the minimum-message-length approach prefers the model with the smaller (minimized) value of L (Wallace & Freeman 1991).

A second example is a mixture model, in which each datum x_n is thought to be sampled from an unknown member of a set of different distributions {f_k(x|α_k), k=1...K}. Here, once the distribution parameters α_k and relative abundances h_k are known or estimated, the estimation of the class (i.e. distribution) k = θ_n to which each datum belongs is a sequence of identical basic problems each requiring the estimation of a categorical parameter θ_n.

If each basic problem is considered separately, the shortest encoding is obtained by estimating θ_n to be that value of k for which h_k f_k(x_n|α_k) is a maximum. That is, datum x_n would be assigned to that component of the mixture most likely to contain it. The resulting message length L would then have the form

L = Q{α_k, h_k, k=1...K} - Σ_n log [ max_k (h_k f_k(x_n|α_k)) ]

Estimation of the distribution parameters and abundances by minimization of the above form has been found to give inconsistent estimates. However, the analysis of section 7 shows that a shorter coding is possible, and has a message length for large N of the form

L = Q{α_k, h_k} - Σ_n log Σ_k h_kf_k(x_n|α_k) + 0(1)

In the mixture problems so far studied, minimization of the latter form yields consistent estimates for {α_k} and {h_k}. The smaller value of the latter form also affects the comparison between the lengths for models with different numbers of component distributions, and so affects the estimation of the number of components.

It is perhaps worth mentioning that the coding device described in section 7 has been used successfully in a computer program for file compression.

Acknowledgements: This work was done with the assistance of a study grant from Monash University. The work owes much to the informed comment and assistance of P.R.Freeman.

References

Barron, A.R. & Cover, T.M., Minimum Complexity Density Estimation, IEEE Trans. Inf. Thy., 37(4), pp.1034-1054, (1991).
Rissanen, J., Stochastic Complexity, J. Royal Stat. Soc. B, 49, 3, pp.228-239, (1987).
Wallace, C.S. & Freeman, P.R., Estimation & Inference by Compact Coding, J. Royal Stat. Soc. B, 49, 3, pp.240-252, (1987).
Wallace C.S. & Freeman P.R., Single Factor Analysis by MML Estimation, J. Royal Stat. Soc. B, 54, 1, pp.195-209, (1992).
Zador, P., Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension, IEEE Trans. Inf. Thy., IT-28, pp.139-149, (1982).