Fisher Information

Fisher information, two-part message, accuracy of parameter (inference), multiple parameters

For one continuous-valued parameter, θ, the Fisher information is defined to be:

F(θ) = E_x( d²/dθ² { - ln f(x|θ) } )

where f(x|θ) is the likelihood, i.e. P(x|θ) for data 'x' and parameter value (or hypothesis, ...) θ. E_x is the expectation, i.e. average over x in the data-space X.
(NB. The 'd's should be curly but this is HTML not XML.)

The Fisher information shows how sensitive the likelihood is to the parameter θ. It turns out to be the key to how accurately parameter estimates, i.e. inferences, should be stated. We should infer a parameter estimate (usually) close to the maximum likelihood estimate, i.e. close to where d/dθ f(x|θ) = 0, and the second derivative, d²/dθ² f(x|θ), is the inverse of the curvature of the likelihood function here.

The logs can be in any base, provided that we remember which one, for the units (bits, nits, ...), but differentiation etc. favour natural logs to base e, log_e=ln. Quantities can easily be converted to bits later.

MML

A parameter estimate, θ, can only be stated to finite accuracy. How much accuracy is optimal? If a coin is tossed three times and comes up heads once we surely have much less information about any bias (θ) than if the coin is tossed 300 times and comes up heads 100 times. Finite accuracy amounts to stating that θ lies in an interval (θ-s/2, θ+s/2); note that the width, s, depends on θ in general.

First Part of Message

If h(θ) is the prior probability density function of θ, the probability, and message length, of the interval are approximated by

probability = h(θ) . s

msgLen = - ln( h(θ) . s ) nits

always assuming that h(θ) does not vary much over the interval.

Second Part of Message

The second part of the message transmits the data given the first part. The receiver has not seen the data, x, and does not know any estimate based on the data unless told by the transmitter, so we must use the average over the interval (θ-s/2, θ+s/2).

Letting θ' = θ + t, where -s/2<t<s/2, the message length of the second part is

- ln f(x|θ')

= - ln f(x|θ+t) where -s/2<t<s/2

= - ln f(x|θ) + t (d/dθ{ - ln f(x|θ) }) + (1/2) t² (d²/dθ²{ - ln f(x|θ) }) + ...

by the Taylor expansion, ignoring O(t³)-terms.

Noting that

the linear term in t averages to zero over (-s/2, s/2), and
the integral of t² over (-s/2, s/2) is [t³/3]_-s/2,s/2 = s³/12,

the average for t ranging over (-s/2, s/2) is

- ln f(x|θ) + (s²/24) d²/dθ²{ - ln f(x|θ) }

Choosing 's'

Adding the message lengths for the two parts of the message:

- ln( h(θ).s ) - ln f(x|θ) + (s²/24) d²/dθ²{ - ln f(x|θ) }

to find the minimum, and thus the value for s, differentiate w.r.t. s and set to zero

let F(x, θ) = d²/dθ²{ - ln f(x|θ) }

s² = 12 / F(x, θ)

This value of s depends on x which the receiver does not know. We must instead use the expected quantity

s² = 12/(E_x f(x|θ).F(x,θ)) = 12/F(θ)

as x ranges over X, i.e the Fisher information; both transmitter and receiver can evaluate this.

msgLen = - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + (1/2) F(x,θ) / F(θ)

Finally, "what is usually done is to replace the last term [...] by 1/2" (- Farr 1999 p.41) to give an approximation which is reasonable provided that F(x,θ)-F(θ) is small over (θ-s/2, θ+s/2).

~ - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + 1/2

A number of simplifying assumptions have been made along the way; beware if their preconditions do not hold! The simplifications lead to more tractable mathematics.

Multiple Parameters

With multiple parameters, or equivalently a vector of parameters θ = <θ₁, ..., θ_n>, the sensitivity of the likelihood is indicated by the second partial derivatives (Wallace & Freeman 1987).

θ = <θ₁, ..., θ_n>

F(x, θ)_ij = d²/d θ_i θ_j { - ln f(x|θ) }

F(θ) = ∑_x:X f(x|θ).F(x,θ)

We have two n*n matrices, F(x,θ)_ij and F(θ)_ij. The Fisher information is now defined to be the determinant of F(θ).

The message length is

msgLen = - ln(h θ) - ln f(x|θ) + (1/2) ln(F θ) + (n/2) (1 + ln k_n) nits

	model	data\|model
=	- ln(h θ) + (1/2) ln(F θ) + (n/2) ln k_n	- ln f(x\|θ) + n/2

where the k_n are lattice constants to do with partitioning the n-dimensional parameter space, k₁ = 1 / 12 = 0.0833..., k₂ = 5 / (36.√3), k₃ = 19 / (192 . 2^1/3), and k_n → 1/(2 π e) = 0.0585498 as n → ∞ (Farr 1999 p.43).

Strict MML, SMML

Note that [Strict MML] (SMML) (Wallace & Boulton 1975, Farr 1999 p.49) does not make the simplifying approximations of MML, however the mathematical and algorithmic consequences can be severe (Farr & Wallace 1997).

Notes

The MML derivations above generalise the particular forms for the binomial, multinomial and normal distributions, which were first given by Wallace and Boulton (1968), to other distributions such as Student's t-distribution.
— LA, 8/1999
This material is based on talks given by C. S. Wallace c1988, on Wallace & Freeman (1987), R. Baxter's PhD thesis (1996), and G. Farr (1999).

C. S. Wallace & D. M. Boulton. An Invariant Bayes Method for Point Estimation. Classification Soc. Bulletin, 3, pp.11-34, 1975.
C. S. Wallace & P. R. Freeman. Estimation and Inference by Compact Coding. J. Royal Stat. Soc., 49(3), pp.240-265, 1987, [paper].
R. Baxter. Minimum Message Length Inductive Inference - Theory and Application. PhD thesis, Dept. Computer Science, Monash University, Dec. 1996.
G. Farr & C. S. Wallace. The Complexity of Strict Minimum Message Length Inference. TR97/321, Department of Computer Science, Monash University, Aug 1997.
G. Farr. Information Theory and MML Inference. School of Computer Science and Software Engineering, 1999.