Approximations

Approximations to MML

The general form for MML ("MML87") depends on the determinant of the Fisher information matrix.

MML87: For model parameter(s) θ, prior h(θ), data-space X, data x, and likelihoodfunction f(x|θ),: θ = <θ₁, ..., θ_n>,; F(x, θ)_ij = d²/dθ_i dθ_j { - ln f(x|θ) },; F(θ) = ∑_x:X{ f(x|θ).F(x,θ) } -- i.e., expectation,
then: msgLen = m_model + m_data where; m_model = - ln(h θ) + (1/2)ln |F θ| + (n/2)ln k_n nits,; m_data = - ln f(x|θ) + n/2 nits.; Note, k₁ = 1/12 = 0.083333, k₂ = 0.080188, k₃ = 0.078543, k₄ = 0.076603, k₅ = 0.075625, k₆ = 0.074244, k₇ = 0.073116, k₈ = 0.071682, and k_n->1/(2πe) = 0.0585498 as n->∞ [Conway & Sloane '88].; (MML87 requires that f(x|θ) varies little over the data measurement accuracy region and that h(θ) varies little over the parameter uncertainty region.)

Sometimes the Maths for the Fisher is not tractable. It may be possible to transform the problem so that it becomes easier (e.g. as in the use of orthonormal basis functions for polynomial fitting) which is acceptable because MML is invariant. Failing that, the remaining options include:

simplifying assumptions,
numerical approximations,
empirical Fisher.

Gradient²

CSW, csse tea room, 22/5/'01: We take the gradient (a vector), G, of the log-likelihood function and form the matrix GG', i.e. the outer product. This will transform like the square of a density. Assuming that the data are i.i.d., we can then sum over all the observed data to get γ = ∑_k=1..N (GG'). This, again, will transform like the square of a density. So, it can be used as an approximation to the expected Fisher information, as γ will be invariant.

The downside is that we need the amount of data, N, to be at least as large as the number of parameters to be estimated. If not, then the matrix γ will be singular.

This approximation has been used in some versions of SNOB.

(Present csw, dld, la, rdp, 22/5/'01.)

Probability, pr(x|θ), nlpr(x|θ) = - log pr(x|θ).

Given data x₁, x₂, ..., x_n,

negative log likelihood, L,

L = ∑_i nlpr(x_i|θ) = ∑_i{ - log pr(x_i|θ) }

1st derivative of L wrt θ:

dL/dθ = ∑_i nlpr'(x_i|θ) = ∑_i{ - (d/dθ pr(x_i|θ)) / pr(x_i|θ) }

2nd derivative of L wrt θ:

d²L/dθ² = ∑_i{ - (d²/dθ² pr(x_i|θ)) / pr(x_i|θ) + {(d/dθ pr(x_i|θ)) / pr(x_i|θ)}² }

~ ∑_i{ (d/dθ pr(x_i|θ)) / pr(x_i|θ) }²

= ∑_i{nlpr'(x_i|θ)}²

assuming that ∑_i{ - (d²/dθ² pr(x_i|θ)) / pr(x_i|θ) } is small; note that the expected value is

E_x - (d²/dθ² pr(x|θ)) / pr(x|θ)

= ∫ - (d²/dθ² pr(x|θ) / pr(x|θ) . p(x|θ) dx

= ∫ - d²/dθ² pr(x|θ) dx

= d²/dθ² ∫ - pr(x|θ) dx --unless pr is pathological

= d²/dθ² 1 -- !

= 0

If θ = <θ₁, ..., θ_k>, the 2nd derivative becomes the matrix of 2nd derivatives d²L/dθ_iθ_j, nlpr' becomes grad pr (may also see the Jacobian, J), and the { }² becomes the outer product.

-- 2007, LA

Empirical Fisher

The Fisher information matrix contains expected 2nd derivatives of the -log likelihood function with respect to the model parameters. It is possible to estimate these 2nd derivatives, given the data, by perturbing the parameters, individually and in pairs, by small amounts and calculating the changes in the likelihood. This computation is feasible for quite large numbers of parameters.

Unfortunately the resulting matrix is not guaranteed to be positive definite. The gradient² method described above does not have this (possible) problem.

The empirical Fisher is also not invariant.

(This has been discussed by csw since well before 1991.)

MMLD

MsgLen ~ -log{ ∫_R h(θ) dθ } -	∫_R h(θ) log f(x\|θ) dθ
	∫_R h(θ) dθ

E. Lam, Improved approximations to MML, Honours thesis, 2000, CSSE, Monash University, Australia.
C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length, Springer Verlag, 2005.

Gradient2

Empirical Fisher

MMLD

Gradient²