I am not talking about the whole of inductive inference at all, rather I will be talking about a subset of it which is arguing or reasoning from particular observations, or data, to general propositions. I know there are lots of other things which are called inductive inference but on the whole I will not be talking about them. This particular problem of course has been well done over.
Before I launch into it I want to make a couple of caveats. Throughout this talk, when I talk about a general proposition such as "all cats have claws," for instance, I will be wanting you to regard the general proposition as being a little bit qualified by common sense and scepticism, in that if I say, or anybody says, "all cats have claws," we must understand this to be a statement which is generally true but there may be the odd cat that does not have claws, and there may be some cats who, from genetic defects, have something on the ends of their feet but we are not sure whether they are properly to be called claws, or not, and so forth.
Now you may say this is begging the question a little bit in that if we say "all men are mortal" then by golly that's what we mean, and we don't admit of any exceptions, and if we did admit of any exceptions then we would not say all men are mortal.
But I want to put to you that even with such an absolute statement as that, when we actually come to use it in deductive reasoning about the real world and combine that assertion or compare it with observable data we are always, if we are sensible, prepared to accept the possibility that what we observe we have observed incorrectly or inaccurately, and that when regarded as a proposition about what we observe, even if the proposition is held in some ideal sense to be absolutely true, we must always regard it as being somewhat approximate and somewhat prone to exception.
OK. Just by way of obvious illustration, supposing we have for instance a very good, very well tested general law that if we put something floating in the water, the buoyancy force upwards equals the weight of the displaced water -- never significantly challenged since Archimedes. But realistically, if we are going to use this general assertion to deal with the buoyancy force that we have actually measured, and the weight of water that we have actually measured, we will be dead lucky if we get agreement to within 0.1%. And even if we are extra careful we will have to realise well, surface tension does affect things a little bit, and there could have been waves on the water, and if that boat is actually moving at even 1mm/sec then you have hydrodynamic forces to worry about as well. And so in practice the law, while entirely usable as stated in an unqualified way like this, is not so interpreted. It is interpreted with a little bit of caution and likewise the business about cats having claws. There is always the odd exception.
I want to make the point further that even before we consider the induction of general propositions, we really have to account for a more primitive form of induction -- the invention of names, common nouns, verbs. I regard the use of a noun like "cat" as the outcome of inductive reasoning, in that we observed a large number of particular animals, or particular small things scurrying around outside, and have come to the conclusion, through a process of induction, that a number of these are sufficiently similar that it is worth regarding them as forming a class and inventing a name for that class and that name is cats. Now that is an act of induction. It leads to a conclusion that there are things called cats that are worth talking about and, like any induction, it is fallible. But before we have got through this step we cannot even think about the inductive inference of general propositions because we cannot form a general proposition without common nouns and common verbs. Before we can have a universally quantified term in a proposition we must have invented the thing which is to be universally quantified. We cannot talk about "all cats" until we have decided there are cats and what we mean by cats.
So much for background. I will in fact attempt to give some account of that primitive sort of inference as well as the more ordinary sort. What do we start off with? I am going to try and simplify it. We have got some data or observations. We have got some background knowledge and, in the light of that backgound knowledge, we think that [data] D requires some explanation because it is to some extent surprising. It is certainly not implied by our background knowledge. I will suppose however that it does not contradict our background knowledge. So when I say surprising here I mean not that it causes us to jump in our chairs but that we find it unpredictable. Our background knowledge gave us no necessary reason to expect such data. So, having collected enough such data we may infer a theory about it, by some inductive process, which I will call H, a hypothesis. And we will certainly want that our background knowledge, which for our purposes I am not going to challenge, this is accepted knowledge.
The background knowledge and the data does not contradict the theory and we would like that the background knowledge together with the theory, at least roughly, implies the data, or renders it less surprising -- gives us some credible account of what we have observed. So that we hope the data will be less surprising if we assume the truth of this hypothesis.
Before I can go further, I have to introduce the idea that data, or the results of observations, or whatever, certainly data that can be communicated between one human being and another must be represented in some sort of observation language. We have a message which tells us what that data might be. It could be a sentence in English. It could be a roll of magnetic tape that a computer could read, that has come out of an instrument -- it could be whatever. But it is a message in some sort of code or language. I will call the language in which the data is presented an observation language. And the way in which we choose to represent the data may make use of such background knowledge as we have to shorten or make the representation of the data reasonably efficient. For instance, such things as the circumstances of the observations, what sort of instrument we have used, any well-known laws of nature held to be applicable, can be thrown in to help us reduce the volume of data without losing any information from it. Let us suppose that having done that, we end up with a message of some length [|D|] which tells us what we have observed.
I will now want to define a particular way of recoding what our data is which I will call an explanation in the hope that it corresponds, at least roughly, to one of the normal English senses of the word.
Suppose that we have a sender, S, who knows the background information and the data, and a receiver, R, who knows the background only. And we suppose that we would like to form a message which S can send to R and which will tell R the whole of the data -- give him every bit of detail about that data. There are many ways you could perhaps form such a message. I will want to concentrate on a particular way of doing it which I will call an explanation message. And an explanation message will have two parts. The first part will be a statement of some hypothesis that we may have inferred from inspecting the data. The second part is just a statement of the data but restated in a new code or language (not the observation language) which we expect to be efficient if H were true. Thus we could omit from the second part of the message anything in D which is in fact implied by the background knowledge and H. It would not be necessary to include that in the second part of the message.
Well that's what I call an explanation message then: A statement of a hypothesis followed by a statement of the data couched in terms which assume the truth of the hypothesis.
Now comes the interesting bit. We will regard an explanation, and the hypothesis on which it is based, as acceptable if the length of the explanation is shorter than the original length of the data as stated in the observation language. Now the length of that explanation is of course the sum of the lengths of the first part which states the hypotheses, and the second part which states the data given the hypothesis, or assuming the hypothesis. Now all of this would be great stuff, and we will say that out of all explanations of the data we will prefer the one being the shortest explanation.
So essentially this is just some form of Ockham's razor or the parsimony principle -- nothing very surprising. And nothing very useful unless we can actually reason sensibly about what these lengths actually are. So to procede any further, other than just giving a sort of motherhood principle that explanations should not be too long winded, we need to have some way of systematically, and accurately, and uncontroversially, of attaching a length to a message -- measuring the lengths of messages.
Well the ground work for one way of doing this was, of course, established by Shannon in his theory of information, and he showed that using optimal encoding, or the best possible language, the length of a message announcing some event or occurence, X, would necessarily be minus the logarithm of the probability of that occurence. So something which is improbable will require a long message to state it. Something which is very much expected anyway can be announced in a short message.
When we make the appropriate substitution in this we find that the length of an explanation is the length required to state the hypothesis which depends on the probability of that hypothesis, whatever that may mean. The second part of the length depends on the probability of the data given the hypothesis. And using a bit of Bayes's theorem, and whatever, we end up with this being minus the log of the joint probability of hypothesis and data.
Well it comes as no surprise then if we say actually we are talking about Bayes's theorem, and the probability of the hypothesis that occurs here we can think of as a Bayesian prior probability for a hypothesis -- what probability the receiver would attach to the hypothesis if he doesn't know anything about the data but only has his background knowledge.
Well, we will say that not only do we require an explanation to be shorter than the original statement of the data for it to be acceptable at all, but out of all such explanations, that meet that criterion, we would prefer, and infer from the data that one that gives the shortest explanation of the data. This is essentially the Bayesian choice rule of choosing the inference which has the highest posterior probability given the data because that can be written as the joint probability [H&D] divided by the probability of the data which doesn't change if we contemplate different hypotheses. So you might say, well I've got this far, ten minutes down the track and so far we have caught up with Bayes. So what? Well there is a difference.
The line I'm plugging differs from Bayesian acceptance theory in that we can cope quite well in this context with something Bayesian theory doesn't cope with very well at all, and that is the situation that the hypothesis we infer from the data might have some free real-valued parameters. We should write it as h(θ) where we don't know the value of θ and θ itself has to be inferred. For example, our hypothesis might be that there are things called electrons, and that these have certain properties, and that this is explaining something that we see going on in a vacuum with a hot wire in it. But to flesh out that hypothesis we would have to give values for the properties of the electron, in particular its mass, its electronic charge, its spin -- whatever happens to be relevant. These are real-valued numbers. Our hypothesis has to advance values for these. We cannot very well have a prior probability for any particular value in a continuum. All we can do -- deal with in Bayesian terms -- is a probability density, and unfortunately when we go back and interpret this in terms of probability densities we end up with a posterior probability density over hypotheses, and that does not lead, in any acceptable way, to a way of choosing a particular hypothesis or a particular estimate for the mass of the electron, or its charge, or whatever. However, when we do it in terms of message lengths, it all comes out in the wash and we do end up with (well the fact that we may have real-valued parameters causes one to scratch one's head over some algebra, but you get) some sensible answers coming out at the end.
OK. So in one sense all we have done is restate the Bayesian view of inductive inference but try to tidy it up and extend its application to the estimation of real-valued parameters, which Bayesian inference does have some problem with.
If that was all we had done it would not be terribly interesting. However there is another way of looking at the question of message lengths, an alternative to Shannon's theory, where we consider what happens if our receiver of the explanation, instead of regarding R as a person or agent with some prior probability beliefs about possible hypotheses, and background knowledge, we regard it as a universal Turing machine.
Now, a universal Turing machine is basically just a computer. The particular universal Turing machine I want to discuss can read things recorded on an input tape -- a message. It can output onto another tape, and it has a work tape, or many of them for that matter, of essentially infinite length, on which it can record and read back anything it chooses. OK, so it is a computer with a nice big disc or tape attached to it, and the ability to read data in from the outside world. Having read the input tape it cannot rewind it; it cannot go back and have another look at the input. And having output, it cannot recall the output and change it -- that's it.
Now, you will recall that our data is presented in some observation language or code. For any such observation language or code there is a trivial way of rewriting it in the binary language of 0s and 1s, and this is done in telephones and increasingly in TV and whatever. And so we will think of our computer also as just inputting and outputting 0s and 1s -- strings if 0s and 1s. In our terms now, this is going to be the receiver of the message. The message will be fed to it as a strings of 0s and 1s on the input tape and, to prove that the computer has actually understood that message and recovered that data from it, we are going to require the computer to output a copy of the data in the original observation language, again just as a string of 0s and 1s. So now an explanation is an input tape which when fed into a Turing machine causes it to output an output tape which is the priginal form of the data. And an explanation is acceptable if the input tape that we find is shorter than the output tape which is produced and, of all the possible explanations, we will prefer the shortest.
So what has changed? Well there is no mention in any of this model, on the face of it, of probabilities. And in particular, there appears to be no mention of prior probabilities of hypotheses or whatever. So what's happened? Well the probabilities are there but they are inherent in the design of the universal Turing machine. You can have lots and lots and lots of different universal Turing machines, all looking like this but with different internal works inside the black box in the middle. Now in one sense what's inside the black box does not matter very much -- which is why they are called universal Turing machines -- in that whatever is inside the guts of one Turing machine, you can make it behave like any other universal Turing machine by feeding it an appropriate input. You can reprogram it to imitate any other machine, and this is why they are called universal. But how much input you have to give the computer to make it imitate another one will depend on its internal design and on the internal design of the one you want it to imitate. So in a sense then, if we think of message lengths, a la Shannon, as reflecting the logarithm of probabilities, the length of the tape that we have to shove into a computer to make it behave like another computer, or to make it behave as if it accepted the hypothesis H, is a measure of the probability of the hypothesis H or, if you like, the prior probability of the computer we are imitating. The longer the input we have to stuff in to make the computer accept hypothesis H the less probable that hypothesis is for this paricular computer. So the choice of computer does determine the choice of our prior probabilities. This all sounds a bit weird but let's put it another way.
Consider an explanation. It contains two parts -- I've drawn it out as a long piece of tape here. The first part is in some sense a statement of a hypothesis. Now I will say a little more about just how a hypothesis is stated. The second part is the data recoded in a code or language which will be efficient if the hypthesis was true. Alright, well what does H really amount to here? When the Turing machine, T, has read H but before ir has started to read this part, when it has read H it is in effect a new and different Turing machine, one which is prepared to accept input in a code which would be efficient if H were to be true. So what is H if we think of it from a Computer Science point of view? H is a program which will cause the Turing machine to decode the code we would lke to use for recoding data if we believe H to be true. To give a simple example, supposing that we have got some data which we have obtained by picking up lots of small objects, getting their masses in a balance, applying forces to them, and seeing how fast these objects accelerate. So our data is a whole lot of triples, each recording the mass, the force applied, and the acceleration produced. And we could record these in some observational language just as a series of numbers.
Now having looked at this, and being at least
as clever as Newton, we suddenly realise that,
by golly, on the whole the force is very closely
equal to the mass times the acceleration.
So we advance an explanation
which begins with an assertion of
Newton's second law of motion,
So what goes on the front here? Well basically a little program to read in triples of numbers, multiply the first in each triple by the second, add on a small correction which is the third, and spit out the result. A pretty elementary form of Newton's second law of motion but that is all we need -- something which decodes a code which would be efficient were the hypothesis true.
Well at that point I ran out of transparencies so I'll stop and revert to a blackboard.
Let me try to answer "what can we say about this view of inference to suggest that it is worth serious consideration?" Well there's a few things that we can say which, I think, merit its consideration. The first one is that an H which is accepted as the result of looking at some data is automatically falsifiable in Popperian terms. Any H, which allows the given data to be encoded in a way shorter than its original representation, must be such that there could be some data that could not be so encoded and indeed the encoding assuming H would be longer than the naive representation. This is an immediate consequence of Shannon's theory of information. So any hypothesis accepted as a basis of this process is subject to rejection by data which is at least conceivable, even if we do not observe it. There is observable data which could cause us to reject the hypothesis. That's nice.
Second part is that it does improve, in a certains sense, in two directions, I think, on Bayesian inference: It can handle continua of theories, as I've mentioned before, without any problems. The other point however is that Bayesian inference has a little bit of a problem about it. We have the rule that we would like to say
Kevin Korb will give you one answer as to how to get out of this. I will give you another and I say then that the thing which embodies our prior probabilities is the universal Turing machine, T, which is the receiver of our message, of our explanation.
Why did we choose that particular Turing machine? Well because our background knowledge and our prior experience had led us to expect that was an appropriate Turing machine to have. It incorporates our prior beliefs about the way the world behaves. But where did these beliefs come from? Well at some previous stage, we must have been more ignorant, and we had an earlier Turing machine representing our more primitive beliefs about the state of the world but we found that in explaining lots of data to this [earlier] Turing machine, lots of the time we had to always incorporate the same little stuff up the front of our hypotheses. That whenever we were talking about physical quantities, it seemed to be a darned good idea to start the first part of the explanation, the hypothesis stating part, with something which told the Turing machine how to multiply numbers together. Turing machines don't ecessarily know that -- you have to tell them, it's a little subroutine.
And so we would say, after a little while,
well by golly it looks to me in any situation
where there are quantitative observations
it's a good chance that multiplication will be involved,
so I'll build myself a new Turing machine which
now has multiplication built in, and
instead of having to put it in a rather long
piece of tape to tell it how to do multiplication
everytime I wanted to do something like that in decoding the data,
I've just got to put in something very short.
So I have changed the prior beliefs incorporated into the Turing machine --
it now knows about multiplication.
And a little later on,
I've probably got a quite well educated Turing machine
which knows about division and sines and cosines
and Bessel functions and linked lists and
So I have explained how a more primitive Turing machine gets updated when we find that we are constantly having to give it the same little bit of input and all of our hypotheses, or many of them, tend to incorporate the same little bit of digital machinery. Well we just build it into another Turing machine.
Where did this more primitive Turing machine come from? Well it came from the same kind of process, and even more primitive Turing machine. And we have some sort of regress but is it infinite? Well the answer is no! We come to a simplest possible universal Turing machine. No matter how you define Turing machines, there is a simplest possible Turing machine. If you try to simplify it any further it ceases to be universal; it looses the capacity to imitate all Turing machines. There is in fact even some little bit known about what the simplest universal Turing machine looks like and it is really remarkably simple. It has something of the order of 100 internal states and that's all, which is extraordinarily simple. And it knows nothing; I mean it really knows nothing. You would have to feed a tape of hundreds and hundreds and hundreds of binary digits into it to tell it how to add two numbers together. So in my account at any rate, one is led rather naturally to an expression of complete ignorance.
Complete ignorance is defined by that state of knowledge represented by the simplest possible universal Turing machine. Notice incidentally that if we start with that simplest possible universal Turing machine, we don't need, strictly, ever to build smarter ones. All we need do is feed this a tape which will cause it to imitate smarter ones. There's no difference, except in how fast it will run, between a very simple Turing machine which has been educated, and a Turing machine which has been built smart. That's the whole idea of universality.
We do have that second advantage then, that I think there is a natural representation of total ignorance [writes on board]. Total ignorance has some slightly surprising features when defined in this way. If we think of a number, we may say oh well, it's a likely sort of number or an unlikely sort of number. If we have a machine which is totally ignorant then the sort of numbers like, oh I don't know, 3145926 is much more probable than this one, 3145922, because this one [the former] can be generated by a shorter input than that one because this one is just the first umpteen digits of π, and that is generated by a very easy little program, whereas this one [the latter] hasn't got any pattern to it at all, and so would take a good deal more input to generate. So total ignorance is not quite what we would expect.
It is possible that the world is unkind to us and we go and collect some data which is entirely chaotic. We look at the sky and there's patterns of stars on it with no real sense to it at all. That doesn't stop naive people from seeing patterns in it. But if we adopt this view of inductive inference, we will find that random data is indeed inexplicable: there will be no acceptable explanation of it, no hypothesis about the data which enables the data to be recoded concisely. And that's nice.
It means that this account of inference, inductive inference, suggests that what we accept as good inferences are really reflecting something which is really going on in the data -- some real regularity or pattern in the way the world is behaving. And we have the assurance that if the data was indeed random, then with extremely high probability, going to one as the amount of data increases, we would find no explanation of the data. So that's nice. What else is nice?
If the data has indeed any computable regularity then, in the limit, as the amount of data increases, the hypothesis preferred by this rule [MML] will converge to a representation of that regularity. There are even some slightly weaker results that if there is some regularity in the data, but it cannot be represented by a computable function, then the data may still be explicable, at least roughly, and that our hypothesis will converge to that computable function which is closest to the uncomputable one from which the data is really observed.
OK, so one has to take any asymptotic results like this with a grain of salt because it doesn't say how much data is going to be necessary to achieve this convergence. But at least we have the assurance that the process, as we get more and more data, is going to get us closer and closer to hypotheses that describe genuine regularities in the real world and we will not get stuck with some hypothesis about the real world which is simply wrong.
Finally, I have presented this essentially as a normative view of inductive inference. I am prepared to argue that it is, at least approxumately, a descriptive view of the kind of inductive inference we are talking about, although I would not attempt necessarily to say that the inductive inferences of a particulat thinker or scientist in a particular situation necessarily follow my prescriptions. What I would say is that those inductive inferences which a scientific culture ends up generally accepting could all satisfy the criteria which I have put up here -- they all do indeed allow the raw data we get from the outside world to be recoded in a considerably more concise fashion. Now obviously I have not got time to argue the point but I have illustrated it, for instance with Newton's laws of motion. You can think of dozens of examples for yourself if you like.
Finally I have said nothing about how we go about finding -- given data -- how we go about finding a good hypothesis. What I've talked about is testing whether the hypothesis is acceptable and whether or not it is better than some other hypothesis, given the data. And indeed there's not an awful lot I can say about finding the hypothesis in general.
In the most general case where we are including among the set of conceivable hypotheses all computable functions, or all computable rules, then one can show that there is no algorithm for finding the best hypothesis. If there was, it would be able to solve the halting problem in Computer Science which is known to be undecidable.
However, in sufficiently restricted cases one can conduct a systematic, and sometimes even reasonably efficient, search over a constrained set of hypotheses about data, and get a computer to automate this search, and come up with at least a good hypothesis, if not necessarily the best. This has been done in quite a number of problem domains, all of course fairly simple, but in many of them reaching conclusions which were not obvious to human beings and which in some cases have significantly refined the best theories that humans have previously come up with. The most recent, for example, in which David Dowe was using a program which we developed a while back, into which was fed numbers characterising properties of days, like how windy they were, how much rain there'd been, and whatever, and whether or not there had been a bushfire, and getting the computer to work out a hypothesis of this data in the form of a rule predicting whether or not there was likely to be a bushfire given the properties of the day. And the rule it came up with by automating this process indeed works better than the rules currently employed by the CFA.
You can get it to learn simple grammars given sentences in a simple language. The best explanation of those sentences is the grammar from which they are drawn; you can learn that and so on and so forth.
Well I'll stop there. But that is the final argument: It works.
-- Applause --
Kevin Korb then wound up proceedings and discussion moved to the tea room.
This seminar was presented by
Chris Wallace (CSW) on
[*] These three make good general references for Minimum Message Length (MML) inference, including as they do, one of the first papers (1968), an important generalization (1987, & it's in statistical terms), and the MML book (2005).
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated).
Created with "vi (Linux)", charset=iso-8859-1, fetched Saturday, 17-Aug-2019 21:51:53 EDT.
Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.