
The reasons why
minimum message length (MML) inference
works are quite elementary and were long
hidden in plain sight^{(a)},
so the fact that it was not in use before
1968
is surprising, and
that there is any debate about it is more surprising.
 Even a continuous attribute (variable) of a
datum can only be measured to some limited
accuracy, ±ε/2, ε > 0,
so
 every datum that is possible under a model has a probability
that is strictly > 0,
not just a probability density (pdf).
 Any continuous parameter of a model (theory, hypothesis, ...)
can only be inferred (estimated) to some limited
precision^{(b)}, ±δ/2,
δ > 0, so
 every parameter estimate that is possible under a prior
has a probability that is strictly > 0,
not just a probability density.
 So,
Bayes's
theorem can always be used^{(c)}, as is,
 pr(H&D)
= pr(H).pr(DH)
= pr(D).pr(HD),
for hypothesis H and datum D.
However, that is not to say that to then go and make MML work in
useful applications is easy, in fact it can be quite
difficult.
After the self evident observations above, a lot of hard work on
efficient encodings, search algorithms, code books, invariance,
Fisher information, fast approximations, robust heuristics,
adaptions to specific problems,
and all the rest, remained to be done.
Fortunately, MML has been made to work in many
general and useful applications including, but not limited to, these lists,
[1],
[2], &
[3], and
in other areas such as
bioinformatics [4],
say.
 BTW, given Bayes,
 pr(H&D) = pr(H).pr(DH)
= pr(D).pr(HD),
 pr(HD) ~ pr(H) . pr(DH),
 it is sometimes claimed that MML inference is just MAP inference but,
in general, this is not the case.
MML requires that one sets not just the "height", pdf(H),
but also chooses the set of
distinguishable^{(d)} hypotheses
{H_{1}, H_{2}, ...}
and the optimal precison ("width") of each one.
(It could be argued that MML is MAP done properly.)

  L.A., 9/2011.

 MML Reading List:

[W&B],
[W&F],
[book],
[history],
& see
[CSW].
 ^{(a)}
If you must invent something,
the best kind of thing to invent is something
that can be "got" easily, but only once it has been described 
the "doh, I could have done that" moment!
There were other, somewhat related, theoretical ideas around in the
1960s,
but MML arrived with a practical computer program to solve
an important inference problem.
 ^{(b)}
(i) ε comes with the data but
working out the optimal value of δ may not be easy.
(ii) Given multiple continuous parameters,
an optimal region of precision is not rectangular in general
but its area (volume) is in any case > 0.
(iii) Even a discrete parameter may have
an optimal precision that is less than what appears to be possible
from its discreteness.

^{(c)} If you
accept priors.

statisticians 
frequentists 
Bayesians 
loss functionists 
estimateers 
MMLists 
. . . 

 ^{(d)} Distinguishable
given some amount of data.

