## Normal Distribution (2)

 LA home Computing MML  Glossary  Continuous   Normal 1   Normal 2   t-distn   von Mises-Fisher   von Mises   Linear1   Fisher>
 In this page: maximum likelihood (ML), ML-estimators, MML, Fisher information, MML-estimators, measurement accuracy -- lloyd

### Maximum Likelihood

The negative log likelihood, L, for `n' observations assumed to come from a normal distribution, Nmu,sigma, is:

 ``` n 1 (xi-mu)2 L = -log{ PROD ----------------.exp(--------) } i=1 sqrt(2 pi) sigma 2 sigma2 ``` ``` n n 1 n = -.log(2 pi) + -.log(sigma2) + -------.SUM (xi-mu)2 2 2 2sigma2 i=1 ```

#### Maximum Likelihood estimator for mu

Differentiating with respect to mu

 ``` d L 1 d n ---- = -------.-----{ SUM (xi-mu)2 } d mu 2sigma2 d mu i=1 ``` ``` = (n.mu - (x1+ ... + xn)) / sigma2 ```

Setting this to zero gives the maximum likelihood estimator for mu

 ```muML = (x1+ ... +xn)/n ```

i.e. the (sample-) mean.

#### Maximum Likelihood estimator for the variance (& sigma)

Differentiating L w.r.t. v = sigma2:

 ```d L n 1 n --- = --- - ----.SUM (xi-mu)2 d v 2.v 2.v2 i=1 ```

setting this to zero:

 ``` n vML = SUM (xi-muML)2/n i=1 ```

the maximum likelihood estimate for the variance v = sigma2.

Note that if n=0, the estimate is zero, and that if n=2 the estimate effectively assumes that the mean lies between x1 and x2 which is clearly not necessarily the case, i.e. vML is biased and underestimates the variance in general.

### Minimum Message Length (MML)

Wallace and Boulton (1968) derived the uncertainty region for the [normal distribution] from first principles. Later it was seen to be a special case of a general form using the [Fisher] information.

#### Fisher Information

The off-diagonal term of the Fisher information is given by the expectation of:

 ``` d2L -------- = - (n.mu - (x1+ ... +xn)) / v2 d mu d v ```

and in expectation (i.e. on average), this is zero.

The second derivative of L w.r.t. mu is:

 ``` d2L ----- = n/v = n/sigma2 d mu2 ```

The second derivative of L w.r.t. v is:

 ``` d2L n 1 n ---- = - ---- + --.SUM (xi-mu)2 d v2 2.v2 v3 i=1 ```

and in expectation this is

 ``` n n v - ---- + --- = n/(2.v2) = n/(2.sigma4) 2.v2 v3 ```

The Fisher information is therefore

 ```n/(2.v3) = n2/(2.sigma6) ```
(Note, the above is with respect to mu and v. Now v = sigma2, so  d v / d sigma = 2.sigma.
To calculate the Fisher information with respect to mu and sigma, the above must be multiplied by (d v / d sigma)2 , which gives
 2.n2/sigma4,
as can also be confirmed by forming d L / d sigma and d2 L / d sigma2 directly. [--L.A. 1/12/2003])

#### Minimum Message Length Estimators

 ```msgLen = -log(h(mu,v)) + L +(1/2).log(F) + constant ``` ```= -log(h(mu,v)) + (n/2)log(2pi) + (n/2)log(v) + (1/2v).SUM(xi-mu)2 + (1/2)log(n2/2) - (3/2)log(v) + constant ``` --h--L--F

differentiate w.r.t. mu:

 ```d msgLen d n -------- = - ----(log h(mu,v)) + -.(mu-(x1+...+xn)/n) d mu d mu v ```

and w.r.t. v:

 ```d msgLen d n-3 1 -------- = - ---(log h(mu,v)) + --- - ---SUM (xi-mu)2 d v d v 2.v 2v2 ```

If the prior is `h(mu,v) ~ 1/v,` (improper) then `d h/d mu = 0` and

 ```muMML = (x1+ ... +xn)/n = muML ```

With such a prior, ```d h/d v ~ -1/v2,``` so

 ```d msgLen 1 n-3 1 -------- = - + --- - ---.SUM (xi-mu)2 d v v 2.v 2v2 ``` ``` n-1 1 = --- - ---.SUM (xi-mu)2 2.v 2v2 ```

set to zero:

 ```vMML = {SUMi=1..n (xi-mu)2}/(n-1) ```

This use of a divisor of (n-1), rather than n, is also a "well known" but (there) ad-hoc correction for the bias in vML, however here it is derived in a justified way for MML.

#### Measurement Accuracy

In the case of continuous distributions, such as Nmu,sigma, the likelihood function is a probability density function. To turn it into a genuine probability, it must be multiplied by the measurement accuracy. e.g. If observations are measured to two decimal places, say, then the probability of an observation x = x0 . x1 x2 +/- 0.005 is Nmu,sigma(x)*0.01. Assuming sigma>>0.01, it can be seen that, if it is included, this measurement accuracy "passes through" the calculations above untouched, not affecting the estimators. It does however affect the overall message length.

### MML v. SMML

MML is an approximation to strict minimum message length (SMML) inference. As cautioned elsewhere, if MML's simplifying assumptions (i.e. h(params) nearly constant over uncertainty region & likelihood function nearly constant over uncertainty region and over measurement accuracy) do not hold then either more accurate approximations should be used or the above equations must only be used with reservations. This is simply a matter of common sense.

### Notes

• C. S. Wallace & D. M. Boulton. An Information Measure for Classification. The Computer Journal 11(2) pp.185-194, August 1968.
• See also the Special Issue on Clustering and Classification, The Computer Journal, F. Murtagh (ed), 41(8), 1998.