
Fisher information,
twopart message,
accuracy of parameter (inference),
multiple parameters

For one continuousvalued parameter, θ, the
Fisher
information is defined to be:
 F(θ) =
E_{x}( d^{2}/dθ^{2} {  ln f(xθ) } )
where f(xθ) is the likelihood,
i.e. P(xθ) for data 'x' and parameter value
(or hypothesis, ...) θ.
E_{x} is the expectation,
i.e. average over x in the
dataspace X.
(NB. The 'd's should be curly but this is HTML not XML.)
The Fisher information shows how sensitive
the likelihood is to the parameter θ.
It turns out to be the key to how accurately parameter estimates,
i.e. inferences, should be stated.
We should infer a parameter estimate (usually) close to
the maximum likelihood estimate,
i.e. close to where
d/dθ f(xθ) = 0,
and the second derivative,
d^{2}/dθ^{2} f(xθ),
is the inverse of the curvature of the likelihood function here.
The logs can be in any base, provided that we remember which one,
for the units (bits, nits, ...), but differentiation etc.
favour natural logs to base e, log_{e}=ln.
Quantities can easily be converted to bits later.
MML
A parameter estimate, θ, can only be stated to finite
accuracy. How much accuracy is optimal?
If a coin is tossed three times and comes up heads once
we surely have much less information about any bias (θ) than
if the coin is tossed 300 times and comes up heads 100 times.
Finite accuracy amounts to stating that θ lies in an interval
(θs/2, θ+s/2); note that the width, s,
depends on θ in general.
First Part of Message
If h(θ) is the prior probability density function of θ,
the probability, and message length, of the interval are approximated by
 probability = h(θ) . s
 msgLen =  ln( h(θ) . s ) nits
always assuming that h(θ) does not vary much over the interval.
Second Part of Message
The second part of the message transmits the data given the first part.
The receiver has not seen the data, x, and does not know any
estimate based on the data unless told by the transmitter, so we must use the
average over the interval (θs/2, θ+s/2).
Letting θ' = θ + t, where s/2<t<s/2,
the message length of the second part is
  ln f(xθ')
 =  ln f(xθ+t) where s/2<t<s/2
 =  ln f(xθ)
+ t (d/dθ{  ln f(xθ) })
+ (1/2) t^{2} (d^{2}/dθ^{2}{  ln f(xθ) })
+ ...
by the Taylor expansion, ignoring O(t^{3})terms.
Noting that
 the linear term in t averages to zero over (s/2, s/2), and
 the integral of t^{2} over (s/2, s/2)
is [t^{3}/3]_{s/2,s/2} = s^{3}/12,
the average for t ranging over (s/2, s/2) is
  ln f(xθ)
+ (s^{2}/24) d^{2}/dθ^{2}{
 ln f(xθ) }
Choosing 's'
Adding the message lengths for the two parts of the message:
  ln( h(θ).s )
 ln f(xθ)
+ (s^{2}/24) d^{2}/dθ^{2}{
 ln f(xθ) }
to find the minimum, and thus the value for s,
differentiate w.r.t. s and set to zero
 let F(x, θ) =
d^{2}/dθ^{2}{  ln f(xθ) }

 s^{2} = 12 / F(x, θ)
This value of s depends on x which the receiver does not know.
We must instead use the expected quantity
 s^{2}
= 12/(E_{x} f(xθ).F(x,θ))
= 12/F(θ)
as x ranges over X,
i.e the Fisher information;
both transmitter and receiver can evaluate this.
 msgLen
=  ln h(θ)  ln f(xθ)
+ (1/2) ln(F θ)
 (1/2) ln 12
+ (1/2) F(x,θ) / F(θ)
Finally,
"what is usually done is to replace the last term [...] by 1/2"
( Farr 1999 p.41) to give an approximation
which is reasonable provided that
F(x,θ)F(θ) is small over (θs/2, θ+s/2).

~  ln h(θ)  ln f(xθ)
+ (1/2) ln(F θ)
 (1/2) ln 12
+ 1/2
A number of simplifying assumptions have been made along the way;
beware if their preconditions do not hold!
The simplifications lead to more tractable mathematics.
Multiple Parameters
With multiple parameters, or equivalently a vector of parameters
θ = <θ_{1}, ..., θ_{n}>,
the sensitivity of the likelihood is indicated by
the second partial derivatives (Wallace & Freeman 1987).
 θ = <θ_{1}, ..., θ_{n}>
 F(x, θ)_{ij} =
d^{2}/d θ_{i} θ_{j} {  ln f(xθ) }
 F(θ) = ∑_{x:X} f(xθ).F(x,θ)
We have two n*n matrices,
F(x,θ)_{ij} and F(θ)_{ij}.
The Fisher information is now defined to be
the determinant of F(θ).
The message length is
 msgLen =
 ln(h θ)
 ln f(xθ)
+ (1/2) ln(F θ)
+ (n/2) (1 + ln k_{n}) nits

 model  datamodel 
=

 ln(h θ) + (1/2) ln(F θ) + (n/2) ln k_{n}

 ln f(xθ) + n/2

where the k_{n} are lattice constants
to do with partitioning the ndimensional parameter space,
k_{1} = 1 / 12 = 0.0833...,
k_{2} = 5 / (36.√3),
k_{3} = 19 / (192 . 2^{1/3}),
and
k_{n} → 1/(2 π e) = 0.0585498
as n → ∞ (Farr 1999 p.43).
Strict MML, SMML
Note that
[Strict MML] (SMML)
(Wallace & Boulton 1975, Farr 1999 p.49)
does not make the simplifying approximations of MML,
however the mathematical and algorithmic consequences can be severe
(Farr & Wallace 1997).
Notes
 The MML derivations above generalise the particular forms for the
binomial,
multinomial and
normal distributions,
which were first given by Wallace and Boulton (1968),
to other distributions such as
Student's tdistribution.

 This material is based on talks given by C. S. Wallace c1988,
on Wallace & Freeman (1987), R. Baxter's PhD thesis (1996),
and G. Farr (1999).
 C. S. Wallace & D. M. Boulton.
An Invariant Bayes Method for Point Estimation.
Classification Soc. Bulletin, 3, pp.1134, 1975.
 C. S. Wallace & P. R. Freeman.
Estimation and Inference by Compact Coding.
J. Royal Stat. Soc., 49(3), pp.240265, 1987,
[paper].
 R. Baxter.
Minimum Message Length Inductive Inference  Theory and Application.
PhD thesis, Dept. Computer Science, Monash University, Dec. 1996.
 G. Farr & C. S. Wallace.
The Complexity of Strict Minimum Message Length Inference.
TR97/321, Department of Computer Science, Monash University, Aug 1997.
 G. Farr. Information Theory and MML Inference.
School of Computer Science and Software Engineering, 1999.

