Minimum Message Length (MML)

Minimum message length (MML) inference was devised by Chris Wallace and David Boulton c1968 [WB68] and developed by Chris Wallace [Wal05] and many colleagues. MML is a Bayesian method of inference:

Bayes's theorem:: pr(H&D) = pr(H).pr(D|H) = pr(D).pr(H|D); pr(H|D) = pr(H&D) / pr(D) ∝ pr(H).pr(D|H)
Shannon:: msgLen(E) = I(E) = - log₂(pr(E)) bits
so: msgLen(H&D) = msgLen(H) +msgLen(D|H) = msgLen(D) +msgLen(H|D)

for hypothesis H, data D, event E. MML is a practical realisation [All05, All18] of Ockham's razor. Some have assumed that MML is the same as maximum aposterior inference (MAP) but in general it is not. And unlike the later minimum description length (MDL) principle, MML favours explicit priors and fully parameterized models.

Key points are that every continuous (real, floating point) variable has some limited measurement accuracy and that every continuous parameter has some optimal limited precision to which it should be inferred and stated. A consequence is that even continuous data and continuous parameters have non-zero probabilities (and hence finite message lengths), not just probability densities, and therefore Bayes's theorem still applies as is. Interestingly, there are many cases where even a discrete parameter must be estimated to less precision than its discreteness would seem to allow.

Some statistical models that have been MML-ed include:

Discrete, ch.2 [All18].: Binomial, 2-state.; Multinomial, k-state.; Integer, Geometric, Poisson, Universal distributions, ch.3 [All18].
Continuous, ch.4 [All18].: Normal (Gaussian).; Linear regression.; von Mises - Fisher (vMF) and von Mises distributions, ch.9 [All18].; Student's t-Distribution.
Structured.: Mixture models (clustering, classification), ch.7 [All18].; (Hidden) Markov models, PFSA.; Classification- (decision-) trees and graphs etc., ch.8 [All18].; Regression- and Model-trees.; Sequence segmentation.; Megalithic stone circles!; Bayesian networks.; Supervised learning.; Unsupervised learning.; Trees and Graphs, ch.11 [All18].

MML has theoretical support in the form of Kolmogorov complexity.

Strict MML (SMML) is a sort of MML "gold standard". Unfortunately SMML is computationally intractible for all but simple problems but, happily, accurate and feasible approximations to SMML exist [WF87].

References

[All05] L. Allison, 'Models for Machine Learning and Data Mining in Functional Programming', Journal of Functional Programming, 15(1), pp.15-32, doi:10.1017/S0956796804005301, January 2005.
[All18[ L. Allison, 'Coding Ockham's Razor', Springer, doi:10.1007/978-3-319-76433-7, 2018.
[Wal05] C. S. Wallace, 'Statistical and Inductive Inference by Minimum Message Length', Springer, doi:10.1007/0-387-27656-4, 2005.
[WB68] C. S. Wallace & D. M. Boulton, 'An Information Measure for Classification', The Computer Journal, 11(2), pp.185-194, doi:10.1093/comjnl/11.2.185, August 1968.
[WF87] C. S. Wallace & P. R. Freeman, 'Estimation and inference by compact coding', Journal of the Royal Statistical Society, series B, 49(3), pp.240-265, jstor:2985992, 1987.