On MML

The reasons why minimum message length (MML) inference works are quite elementary and were long hidden in plain sight^(a), so the fact that it was not in use before 1968 is surprising, and that there is any debate about it is more surprising.

Even a continuous attribute (variable) of a datum can only be measured to some limited accuracy, ±ε/2, ε > 0, so
every datum that is possible under a model has a probability that is strictly > 0, not just a probability density (pdf).
Any continuous parameter of a model (theory, hypothesis, ...) can only be inferred (estimated) to some limited precision^(b), ±δ/2, δ > 0, so
every parameter estimate that is possible under a prior has a probability that is strictly > 0, not just a probability density.
So, Bayes's theorem can always be used^(c), as is,
pr(H&D) = pr(H).pr(D|H) = pr(D).pr(H|D), for hypothesis H and datum D.

However, that is not to say that to then go and make MML work in useful applications is easy, in fact it can be quite difficult. After the self evident observations above, a lot of hard work on efficient encodings, search algorithms, code books, invariance, Fisher information, fast approximations, robust heuristics, adaptions to specific problems, and all the rest, remained to be done. Fortunately, MML has been made to work in many general and useful applications including, but not limited to, these lists, [1], [2], & [3], and in other areas such as bioinformatics [4], say.

BTW, given Bayes,: pr(H&D) = pr(H).pr(D|H) = pr(D).pr(H|D),; pr(H|D) ~ pr(H) . pr(D|H),
it is sometimes claimed that MML inference is just MAP inference but, in general, this is not the case. MML requires that one sets not just the "height", pdf(H), but also chooses the set of distinguishable^(d) hypotheses {H₁, H₂, ...} and the optimal precison ("width") of each one. (It could be argued that MML is MAP done properly.)
-- L.A., 9/2011.
MML Reading List:: [W&B], [W&F], [book], [history], & see [CSW].

^(a) If you must invent something, the best kind of thing to invent is something that can be "got" easily, but only once it has been described -- the "doh, I could have done that" moment! There were other, somewhat related, theoretical ideas around in the 1960s, but MML arrived with a practical computer program to solve an important inference problem.
^(b) (i) ε comes with the data but working out the optimal value of δ may not be easy. (ii) Given multiple continuous parameters, an optimal region of precision is not rectangular in general but its area (volume) is in any case > 0. (iii) Even a discrete parameter may have an optimal precision that is less than what appears to be possible from its discreteness.
^(c) If you accept priors. statisticians frequentists Bayesians loss function-ists estimate-ers MML-ists . . .
^(d) Distinguishable given some amount of data.