A Brief History of MML 

Chris WallaceA seminar to

This is an edited transcript of the seminar.
The following conventions are used: The school does have a video tape of the seminar  requests should be made to the CSSE general office.  LA 6/2004. 
For me the whole business I suppose started to present itself to me as a bit of a problem when I was an honours student studying physics. An experimental group in the school had got very excited by a recent result that they had got. They had been measuring the intensity of a certain type of cosmic ray and discovered that the intensity seemed to go up and down according to the sidereal time. That is to say according to the attitude of the earth with respect to the distant stars. Now if that was so, it would have indicated a preferred direction for the cosmic rays to be coming from, and they were of a type which, it was thought, would not have a preferred direction. So this was quite an exciting result. And the statistical analysis of their intensity records showed that the results they had were, according to the standard statistical tests, significant to about the 1% level. That is to say, you would have expected to get such results by chance only 1% of the time. Now that is a reasonably good significance level. Most studies in medicine and psychology and the more difficult sciences jump up and down if they get a result which is even 5% significant. But it occurred to me listening to their presentation, at a seminar rather similar to this one, that the fit which their presumed siderealtime variation gave to their data required them to say where the peak of this wave was coming to within about one hour, plus or minus one hour, say two hours. But if you fitted a curve with its peak outside this range the significance level dropped like a stone. So what?
Well then I said to myself, well they would have got just as excited had they found the peak at a different time of the day. And if the peak was sort of about two hours wide, well that would mean anywhere in twelve different positions round the clock (24hour clock) would have been just as exciting. [...]
So the chances of their getting a result by pure chance that would have made them excited wasn't one in a hundred, it was one in a hundred divided by twelve which is only about one in eight. And then it further occurred to me, well they would have got just as excited had they discovered something which showed two peaks every day, or a peak every year, or whatever. So there were really rather a lot of hypotheses that they would have embraced if they found that they gave a fit to the data with a significance of 1%. And when you throw that in, the real significance of the result wasn't one in a hundred, or even one in eight, but something probably more like one in five. In other words there would be at least a one in five chance that purely random data would have thrown up something that would have got them excited.
Well, while accidents of the order of one in a hundred don't happen often, accidents of the order of chances one in five happen at least once a week. So I didn't think they had a decent result. I raised my little voice in the seminar and tried to explain my reservations but, as usual, was howled down; remember I was only an honours student and that's the way you've got to treat honours students. And they went on collecting data. But over the next year the effect disappeared. I didn't say I told you so because by that time I had become a research student and they have got to be even more careful about what they say.
But it got me thinking, [...] if you are talking about the fit of a hypothesis to some data, and the fit looks good, you really have to discount this by how many hypotheses you would have contemplated before choosing the one that fitted the data best. Put it another way, the more precisely you have to specify your hypothesis out of a range of possibilities in order to get a good fit to the data, the less plausible that hypothesis really becomes given the data.
So anyway, I put that thought to the back of my mind and forgot all about it. Later on as a research student I got involved in some rather complex statistical analyses of data from a different sort of cosmic ray and got introduced to Bayesian statistics because that seemed to be necessary to get more or less unbiased results. So I did that and put that on the side. Well OK, by the time I had assumed the full and unquestionable authority of a junior lecturer I had had the odd thought about how one should go about weighing evidence and had had a little bit of experience with statistics. [...]
It happened that I got a research student coming to do a masters. He was an electrical engineer named David Boulton who came from New Zealand where all good things come from  like my parents  and for some reason he decided he would do his masters under my supervision. (No he wasn't my first, he was my second student.) And at that time, which is quite a while back, it was about 1966, one of the very promising techniques for building computer logic was a thing called a tunneldiode which could switch far faster than the then available transistors. So his masters was to be on making use of tunnel diodes in logic  logic circuits. [...] This went on and we got some results but after about six months it became clear to both of us that this was going to be a deadend, as it indeed proved to be. The things were just too difficult to handle; the results were difficult to make reliable and by the time you had made them reliable they weren't all that fast anyway. So here he was, poor student, come over from New Zealand to do a masters and half way through it was obvious that it just wasn't going anywhere. So what to do?
[...] A year or two before I had come across a problem which was presented to me by a geologist friend of how to classify samples of data from a large population. They were in fact samples of sediments from oil wells and as one put down oil wells in different places one encounters different sorts of sediments. The origins and nature of these wasn't entirely understood and he thought it would be a nice idea to get some sort of classification as to what sort of sediments were what. [...] At the time I had done nothing much more than to keep him happy more or less with something which today would be called a kmeans description of the data. That worked, more or less. But then I forgot about it. Anyway, it occurred to me when David Boulton ran out of things to do that maybe it would be a good idea to go back to that classification problem and see if we could put it in a proper statistical framework: Given data from some large population is it reasonable to regard the samples that you've got (there may be a thousand different samples) as representing different kinds of thing? Is the population, in other words, made up of several different classes of thing?
Now David was an engineer with a good decent science background and he also knew something about information theory. And he had the bright idea that he had always thought of theories as a way of concisely representing data and his classic example was the periodic table of the elements, something which some of you may know about. [...] It is a way of arranging the 92 elements (or as it was originally done [when] only about 70 elements [were] known), in a table in a way which reveals a lot of relationships and similarities among the chemical elements. Ones in the same column of the table have very similar chemistries. Ones in the same row of the table have rather similar masses. And so if you know the table, you already know a lot about these elements, just represented in a simple geometric arrangement of where you have listed the element names.
So his cut on doing classification was that if you do a decent classification and describe what each class looks like and then what class each thing in your sample belongs to, you have already said a lot about the thing because if you've said it belongs to a particular class then you know it's got to look pretty much like the description of the things in the class. And so his take on it was that it was all a matter of getting the data and using some classification in some way to encode the data more concisely  data compression. Now I thought well yeah, well maybe, but look mate this is statistics, what one really wants to do is to do a proper Bayesian statistical analysis here, but taking account of how precisely you have to specify the properties of each class and thing. And the more classes you bung in, the more properties you'll have to describe, so the more precise, or the more complex, the whole description of the population becomes. Let's go and do this properly with Bayesian statistics, putting on proper prior distributions on numbers of classes and prior probabilities on parameters and so forth.
[...] We argued about this for some weeks, waving arms and shouting at each other, and then we decided well look this was no good, we will have to prove our points by going away and doing the maths and writing the algorithms and seeing what happens. So peace fell on the subterranean rooms where we were working for a week or two, then we came together again and I looked at his maths and he looked at my maths and we discovered that we had ended up at the same place. There really wasn't any difference between doing Bayesian analyses the way I thought one ought to do Bayesian analyses and doing data compression the way he thought you ought to do data compression. Great light dawned  wrote a program, got some data from somebody who had been looking at seal skulls from around the world, whacked it in, cranked the handle, and said oh there's six species of seal here. So we went to an expert and said how many species of seal are there and which ones were supposed to be which, and it was a pretty good match except that we had confused two seal species, which are now incidentally regarded as being the same species, they are just two different populations, and unfortunately split up some of the species into two classes which on inspection turned out to be male and female of the species. So we were really chuffed, wrote a paper, got it published and David got his masters.
Then in '68 I came down here to Monash (I've been here a long, long time) and David, for his sins, decided to follow me and do a PhD  on elaborating the classification algorithm we had so that it could build a hierarchic tree of classes, not just one class, two class, three class, four class,... but superclasses and these dividing into smaller classes and whatever  the kind of things that taxonomists do in the biological arena. And so he slaved away at it and we both knew what was going on by then, so I was able to give him some help, and [he] finally got a decent PhD out of it. And there the matter rested for a while. Except that towards the end of this exercise it had begun to dawn on me that this approach to working out models, or forming hypotheses from data, while it did seem to work in the problem of forming a classification of a population, was not necessarily restricted to that arena and could have wider applications.
So David went off and did analogue computers and such but I, in spare time, and even heads of department occasionally had spare time in those days, pursued this idea of seeing if I could generalise this approach of data compression, or Bayesian analysis done right, to attack in principle more or less any problem of inductive inference. By inductive inference here I mean specifically the kind of thing that scientists try to do, which is by taking a whole lot of data try to come up with general propositions about the source of this data and which are true of lots of the data, even data that you haven't seen yet  the discovery of at least approximations to general laws, general propositions, by looking at specific instances. And I did come up with a formalism which has now entered the literature as what is called the Strict Minimum Message Length formalism which at least gives a formal answer to the question of how to do it, at least in arenas where you have a fairly well defined body of hypotheses from which you wish to select. It might be quite broad, like the set of all possible ways of classifying a population or, as Mike Georgeff later suggested, the set of all possible finitestate automata which could represent the grammar of a language, but the formalism was there. I could prove a few things about it, but unfortunately the mathematical complexity of it made it almost unusable. So after a while I developed, using the work that Boulton and I had done in Snob, at getting tractable approximations to this general idea and ended up with things which are pretty usable.
Had very little success in getting any of this work published because the statisticians did not want to know on the whole and indeed one paper, which I still consider to be a pretty good paper, was rejected by a notable statistical journal on the basis of a referee's comment that he found it objectionable that you could somehow improve maximum likelihood by adding obscure terms to it. He didn't say it was wrong, he didn't say it didn't work, it was just objectionable. So that was that. Right. I went on to other things.
But then out of the blue a mad Irishman descended on us armed with survey data from piles of rubble which he had found in Ireland, and he was persuaded that these were in fact megalithic stone circles, like Stonehenge only being Irish not quite as big. And an equally mad engineer in Britain called Thom had achieved considerable fame by proposing that the shapes of these megalithic circles were in fact carefully constructed geometric shapes. Although roughly circular, they weren't truly circular and they were actually constructed from Pythagorean triangles (like 3, 4, 5, and 5, 12, 13, whatever, the right angled triangles with integer numbered sides) together with a whole lot of other stuff. And our mad Irishman didn't really think this was very likely and he wanted to do a proper statistical analysis. Now statistical analyses of Thom's results had been attempted but were inconclusive  basically conventional statistics just is not equipped to resolve questions of that complexity. So John Patrick came with all of his survey data. He'd read something of the work on classification and thought this was maybe a promising approach. Anyway he did a PhD on it, in the course of which we developed and refined some of the approximations to minimum message length and out of it came some pretty reasonable results, which actually did lend evidence to at least one of the geometric shapes that Thom had proposed but could not really find any evidence for any of the others. So this was duly published and excited the attention of a couple of the statisticians who had themselves attempted to analyse Thom's data. One of them, Peter Freeman, then Professor of Statistics at Leicester, got sufficiently interested in it to come out here for a sabbatical year and work with me to see if there really was anything in it. With his assistance, we together ended up writing a paper which actually made sense to the statisticians, at least enough for it to get published. Phew! After quite a number of years this was. So I got interested in it again, did some more work with Peter Freeman on some rather more complex models, again got published, and this was great.
Now if we can stop there and just go back and think about what was really happening in the rest of the world. David Boulton and I had got started in '67, 1967. But we had been well gazumped by a chap in the United States, Ray Solomonoff, in I think [.../1964]. And he had come up with the notion that he wasn't really interested so much in getting theories out of data. His emphasis was, if you get some data from some source, can we predict what future data might come from the same source? So his emphasis was on prediction. And his take on the subject was, well, let us suppose that we can represent the data that we've got in binary with a very long binary string. And now let us take a universal Turing machine, and I'm simplifying here a little bit but that is to say an honest to God computer which will accept a binary input. And feed just random binary bits into that computer and look at what comes out. Well what comes out will also be a binary string and you throw away all of the inputs which made it produce garbage which wasn't very interesting, or failed, or halt and catch fire, or loop forever, or all of the other things that computers can do. And you look only at those random strings which, when input, made the computer output a binary string which was precisely the data that you had. And then you said OK, these random input binary strings are somehow encoding the data because when we bung them in this computer it makes the computer output the data that we've got. So let us put those strings in again but tack some more random binary digits on the end and see what we can get. Well we'll get our original data in each case plus some further binary strings which look like maybe future data from the same source, and by noticing how often [in] this process, this gives us future data of a particular kind, we get a probability over what might be the future data that emerges. Now this sounds completely bizarre but Solomonoff managed to prove that if the data comes from a source which produces data from any computable probability distribution then his prediction technique, given enough data to work on, will converge to making predictions based on precisely that computable probability distribution. In other words, he proves it works. And it doesn't matter how complicated the source, what we're talking about, whether it's the incidence of rabies in French foxes or the, I don't know, the motion of the planets, whatever, if this data displays any computable regularity this process will detect it, pick it up, and use it to make as accurate a prediction as it is possible to make of what is going to come next.
What has this to do with minimum message length? Well it turns out that when you look at it, the strings which went into the computer which have most effect on making the predictions of future data are the strings which encoded the available data most concisely. In other words data compression. And the better the compression the more weight was given to that string in making the predictions as to what's happening next. So although he was not using this technique to pick on a model of the source of the data, he was using it to average over all possible models of the data and in effect assigning to each possible model a posterior probability which turned out to be precisely the posterior probability that you end up giving to a model found by minimum message length.
So there is a very, very close correspondence between the work. The mathematical pursuit was quite different: Turing machines are quite awful things to work with on a theoretical basis, whereas I'd stuck to, [...] if not finite, at least countable sets of hypotheses which you can enumerate. The hypotheses which can be somehow used by a Turing machine are not countable unfortunately; well they are countable but they cannot be enumerated because you can't tell by looking at it whether a computer program makes any sort of sense at all  whether it will ever give any output, in general.
OK, so in terms of practical application our development had immediate applications, and has been applied, and quite a lot of people around the world are using some of the techniques that we've developed. Solomonoff's work gives a very strong theoretical respectability to the whole approach because he can prove very powerful theorems about what is going to happen of a quite different nature to the kind of thing that we can prove about MML from statistical theory. But it is much harder to get anything practically useful out of Solomonoff's approach. Now we didn't become aware of Solomonoff's work until we were fairly well down the track, it would have been about '72 or '73, something like that although he had published, I think, originally in ['64]. I mean he'd published in an IEEE journal on information theory [Information and Control], I think, which was not the kind of thing that your average statistician would read. I certainly hadn't read it. And I only became aware of it through reading a Scientific American article written by [yet] another bloke. [...]
Solomonoff's work had been read by a [...] very famous Russian statistician, Kolmogorov, who'd seen in it a basis for coming up with a definition of what is randomness. I mean statisticians and probability people all talk about randomness but nobody knew really what it meant. Well Kolmogorov latched onto the idea of Solomonoff's work that if you had a binary string for given data that Solomonoff worked with then you could find some input to a Turing machine which made it output this string, but that the input that you had to put in was shorter than the string which came out, the only way this could happen would be if the string wasn't really random. So he latched onto the idea that the real definition of randomness is, at least for binary strings and by extension to any sort of data, if you can compress it, it ain't random.
Unfortunately, there was a bit of a defect in the way he developed this theory. It turns out to be quite trivial, but he never spotted it and he never really succeeded in getting his project to work. He thought it would lead to a theory of statistical inference but he couldn't make it go because of this defect in his theory. The defect had the horrible result that if you took any random string of digits and applied his criterion of randomness to it then, whatever the string was, you would find infinitely many prefixes of the string (initial sequences of the string) which were nonrandom according to his measure. Which is bizarre.
[...] Another chap, American this time, Chaitin, who gave a talk here a couple of years ago roughly. He spotted what was wrong, set it right. What was wrong was Kolomogorov hadn't thought of the point that if you have got a finite binary string and you want to show that it's compressible, you have to get an input for a Turing machine which will make it output that string and then stop! In other words the input has got to know how long the output is meant to be. Add that little bit in and the theory now works, well Chaitin made it work.
[...] So he wrote the article in Scientific American [1975] mentioning Solomonoff's work. I looked at Chaitin's work and it was all to do with randomness, and frankly I wasn't all that interested in random data. I wanted to know what you could do with something which isn't random, like data from the real world, but it did put me onto Solomonoff's work.
Again round about much the same time I'd been getting IBM technical bulletins and one of them had an article by a chap called Rissanen who had come up with a bright idea for looking at data coming out of, the output of say a chemical plant, the inputs and the outputs of this chemical plant, or any sort of plant, and just from that data working out a model of what was going on in this plant. Basically if you like, looking at a very complicated sort of Markov process and making an estimate of what the Markov process is. And lo and behold when I looked at the maths it looked remarkably like the maths for minimum message length, except that one crucial term looked to me to be the wrong way round  he had a minus sign where he should have had a plus sign. So I wrote to him and suggested he have another look at this and furnishing some of my work. [...] He went on to invent and popularise what is called minimum description length which is basically very much the same idea. And one has to concede that anybody can mistake a plus for a minus, I'm always doing it myself; he had come up with the same idea independently. [...] Once he got the sign right it all clicked and he began talking sensibly about it.
So what have we got here? We've got Solomonoff, to some extent Chaitin and Kolmogorov, certainly Chaitin, and David Boulton and myself all coming up with more or less the same idea at more or less the same time, which I guess might mean that the idea's time had come, except that obviously it hadn't because thirty years has gone by and the world still doesn't know about it! But we're working on that.
Anyway back home, we kept beavering away. Well Mike Georgeff came along and, as I've mentioned, had this idea that maybe you could use this scheme actually to estimate the structure of something, namely the structure of a grammar just given sentences from the grammar, at least in simple formal languages. This idea had actually been suggested by a chap called Gaines who had suggested [LA:1976] that one could use a model of a grammar which was essentially a finitestate automaton of a particular sort. And Gaines had given some examples of strings, simple strings in a simple grammar, and shown that his approach could recover the finitestate machine which represented the grammar from these strings but he didn't know how to "stop". He kept on making models which were more and more and more complex until you ended up with a model which could produce precisely, but only, the given examples. In other words a grammar which says these are the sentences and there are no others. And obviously that wasn't what he was looking for. But he said, well look if you stop it right at the right spot, you get the grammar which represents the given sentences but could also generate other sentences which look sort of similar. So Mike suggested well why don't we have a go at this. So we did the not very complicated maths and wrote some fairly horrible programs and lo and behold it all worked [see TR32], very slowly, but it worked, and that resulted in a free trip to Pisa where we gave a talk at an artificial intelligence conference there. And this was sort of the first introduction of these ideas to the artificial intelligence community.
Since then of course we have gone a lot further, or a fair bit further, and with Kevin [Korb] here and a bunch of students we have pursued learning other sorts of structures, in particular causal networks and binary networks, from just raw data. And that seems to work pretty well too. With Trevor Dix and Lloyd Allison they developed some calculations making some use of this theory to answer questions about the relatedness, or otherwise, of DNA strings, and in Trevor's case how fragments of sequences could be put together to make a coherent whole. I think I've got that right. And Ingrid [Zukerman] has been making a bit of use of it in looking at the structure of discourse where one of the parties is a computer trying desperately to understand what the human on the end of a telephone is actually talking about. So there has been a fair bit of progress and there is still an awful lot more progress that could be made, I think. [David Dowe has also pushed the idea along dealing with less usual statistical models.]
The artificial intelligence community, in particular machine learning because really that's what we are trying to do, is beginning to take some notice of these techniques; I think it should be taking a bit more notice. I have had some vehement opposition from a certain quarter but he hasn't actually yet managed to prove us wrong. No doubt he's working on it. But certainly in so far as the basic theory goes, whether you put it in the framework of a la Solomonoff of using a simple universal Turing machine to decode encoded descriptions of the data, or whether we put it in the Bayesian statistical framework of choosing from a possibly large but still not unlimited set of hypotheses which might fit the data, choosing the one that really does, there are lots of good theorems which say this is the right way to go, one of the most powerful actually produced by a theoretical statistician at Stanford who has managed to prove that if you use this technique and the data is indeed coming from a probability distribution which is either computable, or computable given a few uncomputable constants, then (and you go the route that we have of looking for the shortest possible encoding) [...] within a finite time you will get the right answer. You only need a finite amount of data to find the real honest to God probability distribution from which this data is coming. That is a very powerful theorem, it is very general, it's got rates of convergence and all sorts of things coming out of it. I don't understand it myself  it uses mathematics that I'm not familiar with  but it's nice to know that apparently we are doing something respectable. And nobody's really been able to show that it doesn't work  come up with any convincing example where you get the wrong answer by using this technique.
So anyway, that's the end of my history. It's still openended. There's still a great deal of work to be done. You may say what the hell has it got to do with computing but it is indeed very significant, I think, that the people that got things out of this, or came up with these ideas, were an engineer working with RCA and using computers all the time, that was Solomonoff, and who knew a lot about Turing machines (still does), another engineer, David Boulton, who knew a lot about coding and information theory, a lapsed physicist, me, who knew a bit of statistics and a fair bit about the workings of machines, and Rissanen who was also an engineer. Statisticians just didn't come into it. Learning theorists don't come into it. I think you had to have the background in dealing with real numbers crunching through real processes in a machine and coming up with real answers and thinking along those terms, rather than elegant mathematical theories, to be led down this path. It is a very simpleminded path. What it's saying is that the most convincing story about some data is the shortest one. Full stop. OK, it's not easy to turn that precept quantitatively into something that you can use, but that's basically what we've done. The fundamental idea is trivial, but very strong and I think it still has a lot in it which remains to be found out. I have been thoroughly subverted by Kevin Korb and Charles Twardy  don't go around talking to philosophers, they muck your brain around  but I am now persuaded that, for instance, we can show that although given present knowledge, the present state of the universe, we can make predictions, deduce things about the future, we cannot actually deduce things about the past. Well you can but you get stupid answers. When we are reasoning about the past we actually have to use inductive logic, which is the basis of statistical and inductive inference. So OK, this is amusing if true. If it is true it solves a couple of conundrums which have been puzzling philosophers of science for a while, but it remains to be seen whether it is. So anyway, that's the kind of thing I'm thinking about at the moment, utterly pointless, but there it is. I'm retired. I'm allowed to do that. Any questions?
Q: inaudible.
A: The point that it is the same as the Bayesian approach, even that hasn't sunk in. Because until you come to realize the importance of considering the precision of statement of parameters the ordinary Bayesian approach doesn't get you there. You can't equate posterior probability density with posterior probability. Until you do the business about Fisher informations and precisions of estimation you can't use a Bayesian technique to give you, to lead you to a hypothesis which you can plausibly say, is the hypothesis of highest posterior probability, because you can't assign any posterior probability to any hypothesis if they've got real parameters  in the conventional Bayesian sense. So that message still has got to get across.
What is happening is that people who wouldn't touch this sort of analysis with a bargepole are in fact ending up doing things which amount to the same thing. There's a hot topic in work on using neural nets to learn supervised classification which is going for "wide margin" classifiers. If you look at the mathematics of what they're actually doing, what the wide margin is doing is precisely taking account of how precisely you have to specify the parameters of the network. So they are putting that into their things.
If you look at structural risk minimisation, another apparently unrelated thing of Vapnik's which is based on VC dimension, you find that in certain cases which are pretty general, there are at least three ways of expressing the VC dimension. If you look at it closely they correspond to three different ways of encoding the data. And the different ways depend on,... well OK one can be better than the others and which comes best depends on how precisely you have to specify estimates.
So the actual practice of this, the mathematical consequences, are getting into peoples' heads, but they're getting there by all sorts of funny routes which is encouraging  it means there are all sorts of justifications for doing it.
Q: inaudible.
A: Well it Snob was produced by David Boulton and myself to do classification, that's it. If you use what we... It was a very specific application and it wasn't until after we'd done it that we realized the principles behind it were capable of being generalized to lots of other problems.
We just saw it as, if you want to do classification this is the way you damn well ought to do it. Well the principle is: Find that classification of the sample which allows you to encode the data in the sample, most concisely. Full stop. That's what drove it. That's what the algorithm,... I mean you can look at the algorithm and that's obviously what it's doing. Or you can look at the algorithm and also say what it's obviously doing is picking the model of highest Bayesian posterior probability. So it's very obvious.
Q: inaudible.
A: Oh my message is I made a big mistake by getting into academic work.
[nervous laughter from the audience]
No, I mean that. And to be a little pessimistic, I don't see a huge future for people wanting to do the kind of academic work which I've been doing all my life and which most of you have been, well the older ones among us, have been doing all their lives. I don't think it's respected, I don't think it's going to be resourced, and one has to begin to wonder what the hell the point was. Well on that unhappy note, I'll stop.
[applause]
(Added later by LA)
The MML book is an excellent source on MML.
Transcription (& mistakes) by L. Allison (2004).
www: 
