
In this page:
maximum likelihood (ML), MLestimators,
MML, Fisher information, MMLestimators,
measurement accuracy  lloyd

Maximum Likelihood
The negative log likelihood, L, for `n' observations assumed to come from
a normal distribution, N_{mu,sigma}, is:
n 1 (x_{i}mu)^{2}
L = log{ PROD .exp() }
i=1 sqrt(2 pi) sigma 2 sigma^{2}

n n 1 n
= .log(2 pi) + .log(sigma^{2}) + .SUM (x_{i}mu)^{2}
2 2 2sigma^{2} i=1

Maximum Likelihood estimator for mu
Differentiating with respect to mu
d L 1 d n
 = .{ SUM (x_{i}mu)^{2} }
d mu 2sigma^{2} d mu i=1

= (n.mu  (x_{1}+ ... + x_{n})) / sigma^{2}

Setting this to zero gives the
maximum likelihood estimator for mu
mu_{ML} = (x_{1}+ ... +x_{n})/n

i.e. the (sample) mean.
Maximum Likelihood estimator for the variance (& sigma)
Differentiating L w.r.t. v = sigma^{2}:
d L n 1 n
 =   .SUM (x_{i}mu)^{2}
d v 2.v 2.v^{2} i=1

setting this to zero:
n
v_{ML} = SUM (x_{i}mu_{ML})^{2}/n
i=1

the maximum likelihood estimate
for the variance v = sigma^{2}.
Note that if n=0, the estimate is zero, and
that if n=2 the estimate effectively assumes that the mean lies
between x_{1} and x_{2} which is clearly not necessarily
the case, i.e. v_{ML} is
biased and underestimates the variance in general.
Minimum Message Length (MML)
Wallace and Boulton (1968) derived the uncertainty region for the
[normal distribution]
from first principles.
Later it was seen to be a special case of a general form using the
[Fisher]
information.
Fisher Information
The offdiagonal term of the
Fisher information is given by the expectation of:
d^{2}L
 =  (n.mu  (x_{1}+ ... +x_{n})) / v^{2}
d mu d v

and in expectation (i.e. on average),
this is zero.
The second derivative of L w.r.t. mu is:
d^{2}L
 = n/v = n/sigma^{2}
d mu^{2}

The second derivative of L w.r.t. v is:
d^{2}L n 1 n
 =   + .SUM (x_{i}mu)^{2}
d v^{2} 2.v^{2} v^{3} i=1

and in expectation this is
n n v
  +  = n/(2.v^{2}) = n/(2.sigma^{4})
2.v^{2} v^{3}

The Fisher information is therefore
n/(2.v^{3}) = n^{2}/(2.sigma^{6})

 (Note, the above is with respect to mu and v.
Now v = sigma^{2}, so
d v / d sigma = 2.sigma.
 To calculate the Fisher information with respect to
mu and sigma, the above
must be multiplied by
(d v / d sigma)^{2} ,
which gives

 as can also be confirmed by forming
d L / d sigma and
d^{2} L / d sigma^{2}
directly. [L.A. 1/12/2003])
Minimum Message Length Estimators
msgLen = log(h(mu,v)) + L +(1/2).log(F) + constant
 
= log(h(mu,v))
+ (n/2)log(2pi) + (n/2)log(v) + (1/2v).SUM(x_{i}mu)^{2}
+ (1/2)log(n^{2}/2)  (3/2)log(v)
+ constant

h L F

differentiate w.r.t. mu:
d msgLen d n
 =  (log h(mu,v)) + .(mu(x_{1}+...+x_{n})/n)
d mu d mu v

and w.r.t. v:
d msgLen d n3 1
 =  (log h(mu,v)) +   SUM (x_{i}mu)^{2}
d v d v 2.v 2v^{2}

If the prior is
h(mu,v) ~ 1/v, (improper) then
d h/d mu = 0
and
mu_{MML} = (x_{1}+ ... +x_{n})/n = mu_{ML}

With such a prior,
d h/d v
~ 1/v^{2},
so
d msgLen 1 n3 1
 =  +   .SUM (x_{i}mu)^{2}
d v v 2.v 2v^{2}

n1 1
=   .SUM (x_{i}mu)^{2}
2.v 2v^{2}

set to zero:
v_{MML} = {SUM_{i=1..n} (x_{i}mu)^{2}}/(n1)

This use of a divisor of (n1), rather than n,
is also a "well known" but (there) adhoc
correction for the bias in v_{ML},
however here it is derived in a justified way for MML.
Measurement Accuracy
In the case of continuous distributions, such as N_{mu,sigma},
the likelihood function is a probability density function.
To turn it into a genuine probability, it must be
multiplied by the measurement accuracy.
e.g. If observations are measured to two decimal places, say,
then the probability of an observation
x =
x_{0} . x_{1} x_{2}
+/ 0.005
is N_{mu,sigma}(x)*0.01.
Assuming sigma>>0.01,
it can be seen that, if it is included,
this measurement accuracy "passes through"
the calculations above untouched, not affecting the estimators.
It does however affect the overall message length.
MML v. SMML
MML is an approximation
to strict minimum message length (SMML) inference.
As cautioned elsewhere, if MML's simplifying assumptions
(i.e. h(params) nearly constant
over uncertainty region &
likelihood function nearly constant over uncertainty region and
over measurement accuracy)
do not hold then either more accurate approximations should be used or
the above equations must only be used with reservations.
This is simply a matter of common sense.
Notes
 C. S. Wallace & D. M. Boulton.
An Information Measure for Classification.
The Computer Journal 11(2) pp.185194,
August 1968.
 See also the Special Issue on Clustering and Classification,
The Computer Journal,
F. Murtagh (ed), 41(8), 1998.

