Multistate and Multinomial Distributions
The MML estimator for an Mstate distribution gives θ_{i}=(n_{i}+1/2)/(N+M/2), where n_{i} is the number of observations of state_{i}, during a total of N observations, N=∑_{i=1..M} {n_{i}}.
In general, the uncertainty region for the MML estimate for kparameters θ = ⟨θ_{1},θ_{2}, ...,θ_{k}⟩, is approximately √(12^{k}/F(θ)), where F(θ) is the [Fisher information]. Note that k=M1 for the multistate distribution.
3States, M=3
A 3state source has two parameters, θ_{1} and θ_{2} in [0,1]; i.e., θ=⟨θ_{1},θ_{2}⟩. It is convenient to define θ_{3}, also in [0,1], where θ_{3}=1θ_{1}θ_{2}, but θ_{3} is not a third (free) parameter. We observe n_{1} occurrences of state_{1}, n_{2} of state_{2} and n_{3} of state_{3} where N=n_{1}+n_{2}+n_{3}. The likelihood, LH = θ_{1}^{n1}.θ_{2}^{n2}.θ_{3}^{n3}, so ...
 log LH = n_{1} log θ_{1}  n_{2} log θ_{2}  n_{3} log(1θ_{1}θ_{2})
 d/d θ_{1} {log LH} = n_{1}/θ_{1} + n_{3}/(1θ_{1}θ_{2})
 d/d θ_{2} {log LH} = n_{2}/θ_{2} + n_{3}/(1θ_{1}θ_{2})
 d^{2}/d θ_{1}^{2} {log LH} = n_{1}/θ_{1}^{2} + n_{3}/(1θ_{1}θ_{2})^{2}
 d^{2}/d θ_{2}^{2} {log LH} = n_{1}/θ_{2}^{2} + n_{3}/(1θ_{1}θ_{2})^{2}
 d^{2}/d θ_{1} d θ_{2} {log LH} = n_{3}/(1θ_{1}θ_{2})^{2} = d^{2}/d θ_{2} d θ_{1} {log LH}
The expection of n_{1} over the data space is N.θ_{1}, similarly for n_{2} and n_{3}, so the Fisher is ...




N/θ_{1}+N/θ_{3} N/θ_{3} 


N/θ_{3} N/θ_{2}+N/θ_{3}

= N^{2}
θ_{3}^{2} 


(1θ_{2})/θ_{1} 1 


1 (1θ_{1})/θ_{2}

= N^{2}
. { (1θ_{2})
. (1θ_{1})
 1 } θ_{3}^{2} θ_{1} θ_{2}
 = (N^{2}/θ_{3}^{2}).((1θ_{1}θ_{2})/(θ_{1} θ_{2}) = N^{2}/(θ_{1}.θ_{2}.θ_{3})
MStates
It can be shown that for Mstates, i.e., M1 parameters, and probabilities θ_{1}, θ_{2}, ..., θ_{M1}, and θ_{M} = 1θ_{1}...θ_{M1}, θ = ⟨θ_{1},...,θ_{M1}⟩, that F(θ) = N^{M1} / (θ_{1}.θ_{2}...θ_{M}).
Demonstration
Use the HTML FORM below to generate a data sample for specified probabilities and length N. The 'code' button calculates message lengths for various codes. Note, the appoximations used may break down for very small values of N.
 Requirements: 1 ≤ M ≤ 10, θ_{[1,M]} > 0 (will be normalised), N ≥ 0, and sample ∈ [0, M1]^{N}.
Notes
 C. S. Wallace & D. M. Boulton.
An Information Measure for Classification.
Computer Journal 11(2) pp185194,
Aug 1968
(see the appendix) and ...  D. M. Boulton & C. S. Wallace.
The Information Content of a Multistate Distribution.
J. Theor. Biol. 23 pp269278, 1969.
When these papers were written there were different notions of the information content of a sequence in use in the literature. W&B showed that if the calculations were done correctly and if all information was truly taken into account then these notions gave essentially the same answer.
 And a delightful piece of trivia about dice via Dean McKenzie [7/1999]:
 'Several decades ago, the Harvard statistician Frederick Mosteller had
an opportunity to test the [dicetossing] model against the behavior of
real dice tossed by a real person. A man named Willard H. Longcor. who
had an obsession with throwing dice, came to him with an amazing offer
to record the results of millions of tosses. Mosteller accepted, and
some time later he received a large crate of big manilla envelopes, each
of which contained the results of twenty thousand tosses with a single
die and a written summary showing how many runs of different kinds had
occurred. "The only way to check the work was by checking the runs and
then comparing the results with theory," Mosteller recalls.
"It turned out [Longcor] was very accurate" Indeed, the results
even highlighted some errors in the thenstandard theory of
the distribution of runs'.
 Peterson, I. (1998) The Jungles of Randomness. Penguin, London. pp.78. (originally published 1998 by Wiley, New York).