A General Objective for Inductive Inference 

C. S. Wallace & M. P. Georgeff

1 1 (V1).(log(t/12)+1)  log(V1)!  .SUM_{i} log p_{i} 2 2 
and that p_{i} will be estimated by approximately (n_{i}+1/2)/(t+V/2) where n_{i} is the number of times the i^{th} symbol is produced from this state. See Wallace and Boulton (1968) for the derivation.
The length of a behaviour is the sum over all transitions of minus the log of the probability p_{i} of the transition.
In computing his "acceptable sets", Gaines was compelled to develop each PFSA of a given number of states which could produce D, and then for each PFSA estimate its transition probabilities from the transition counts in its behaviour. However, we are able to develop the best PFSA (by our measure) with less exhaustive, although still tedious search. If we consider a single state of the PFSA, the length of the (b) component of m specifying the transition probabilities of this state, plus the length of all Huffman code words in the control sentences specifying transitions from the state, is approximately
1 (V1).(log(t/12)+1)  log(V1)!  SUM_{i} (n_{i}+1/2).log p_{i} 2 
This expression is closely approximated from below by
(t+V1)! log  (V1)! PROD_{i} n_{i}! 
which is in fact an exact expression for the relevant parts of an explanation in which m specifies only the structure of the PFSA M, and in which the probability of a behaviour of M is computed by integrating over all possible transition probability distributions rather than by specifying in m some estimated transition probability distribution for each state (Boulton and Wallace 1969). The latter expression has the advantage that it can be computed incrementally while following the behaviour of a trial PFSA. It is also possible to compute the (c) components incrementally. Thus we are able to abandon a trial PFSA as soon as the incremental computed length of a partiallycompleted explanation exceeds the length of the shortest known explanation.
Consider the data string
where "/" is a delimiter.
(The delimiter is known a priori to return to the start state.)
Gaines finds PFSAs up to 7 states with of course increasingly
probable
behaviours.
Some of the machines are shown in
Figure 1(b).
Because there is a marked increase in probability
(decrease in length of control sentences
Figure 1(a) shows how the length of explanation of the best nstate machine varies with n over the range 1 to 7. It is clearly seen that the best theory, under our measure, is the 4state machine. Thus in this case the quantitative measure proposed above has yielded the same best theory as Gaines' intuitions. It also corresponds to the grammar derived by Feldman, Gips, Horning and Reder (1969) and Evans (1971) using heuristic schemes.
Note that the length of the behaviour (i.e. the control sentences) of the 7state machine is the minimum over all machines. Although the length of behaviour decreases up to the 7 state machine, an increase in the number of states thereafter produces no decrease in length.
Note also that neither the 1state nor 7state machines provide an explanation of the data string, as the data string may be written in a binary code with length 66.
In a number of other examples from Gaines we similarly find the machine preferred by Gaines, except in one case (his Figure 3) where we find a 2state machine to be better than his 6state machine.
Consider the problem of trying to fit one or more straight line segments to a set of data points in a twodimensional field, where some points may be noise, and the number of line segments is unknown.
We suppose the data to comprise N coordinate points
(x_{p}, y_{p}) (p=1,...,N)
of points in a square field of size
The control sentence w_{p} giving the data sentence x_{p} y_{p} then has two parts. The first declares point p to be either noise or a member of line `l'. If it is noise, the second part specifies x_{p}, y_{p} using a code word of length 2.N.log(R/epsilon). If it is a member of line `l', the second part specifies the position of the point by giving its position along the line segment and its distance from the line. The former assumes a uniform distribution of points along the line, the latter a Gaussian distribution.
In Figure 2 we show one such set of points (Figure 2a) together with the best single line theory (2b) and the best two line theory (2c), assuming a RMS scatter of line points of 0.3. The length of the data (that is, the "null" theory) is 317.7, the length of the one line explanation is 294.9, and two line explanation 298.7. The one line theory is preferred by our measure. The points treated as noise by the explanation are shown solid.
In Figure 3(a) we show another set of points. Again assuming an RMS scatter of 0.3, Figure 3(b) shows the best twoline theory, giving an explanation of length 309.0 vs a data string length again of 317.7. For this set there is no oneline explanation, the best oneline theory giving a length of 320.8.
The above approach may be compared with the technique used by Bolles and Fischler (1981) for model fitting in the presence of "gross errors". They first filter out the gross errors by proposing a model of the data on the basis that enough of the data points are close to the model to suggest their compatibility with it (i.e. their deviations are small enough to be measurement errors). The user has to specify a suitable threshold on the number of compatible points required. This threshold is somewhat arbitrary, and usually task dependent. Thus in the examples shown in Figures 2 and 3 a low threshold would select both the one line and two line theory for both sets of data, whereas a high threshold would select only the one line theory in both cases.
They then use some smoothing techniques (e.g. least squares), and finally evaluate the fit of model and data by testing, on the basis of some simple statistics, the randomness of the residuals. We would agree with this principle in the context of the fitting problem. However, rather than using a small suite of statistical tests for randomness in the residuals, our approach produces a string of residuals (the control sentence string) which has the greatest possible randomness in the sense that it admits of no more concise encoding. Our approach also provides a direct measure for the selection among competing theories.
Given a random sample from an unknown multivariate distribution, numerical taxonomy or classification techniques attempt to construct and test hypotheses that treat the population as comprising several distinct classes. It is usually assumed that the distribution of variables within a class has a relatively simple form. In an explanation based on a classification hypothesis, the sentence m states the number of classes, and then for each class gives its relative abundance and the distribution parameters for each variable. For each member of the sample, the corresponding control sentence w effectively specifies the class to which the member is supposed to belong, and then gives the variable values for that member using a code optimised for its class.
Search for a good explanation of this form can be conducted efficiently if the distribution forms assumed within each class are sufficiently regular. The Snob program (Wallace and Boulton 1968) uses a combination of analytic optimisation and heuristic tactics to search for that classification of multivariate sample which gives the shortest explanations. Samples of over 1000 things each having over 100 attributes can be classified in the order of an hour's computer time. [NB. written in 1983.] The program determines the number of classes without the use of any arbitrary thresholds or other criteria of fit or significance. It has been found to give useful results in a variety of fields including demography, psychiatric diagnosis, and zoology.
It has been suggested (Thom 1967) that the neolithic builders of the stone circles found in the British Isles used a precise unit of length and a number of families of geometric shapes based on integer Pythagorean triangles in their construction. Patrick (1978), using the same approach as described herein, constructed two explanations for survey data of a number of Irish stone circles. One explanation used Thom's theory, and the second used a theory that the circles were roughly circular and locally smooth, with no defined unit of length. His analysis showed that the Irish circles were better explained using the simpler theory, but that there was some evidence of a preference for bilateral symmetry. It should be noted that previous analyses using conventional statistical methods had been unable to arrive at any definite conclusion (Freeman, 1976; Boradbent, 1956).
In Section 5 we mentioned how the length of an explanation could be used to avoid exhaustive search while still guaranteeing that we find an optimal theory. However, in most cases where the set of possible theories M is large, such a technique is still too time consuming to be practical. In such situations analytic and heuristic techniques may be used to search for and improve good, but not necessarily optimal, theories.
One such technique is to use explanation length to select among competing subtheories at some low level of abstraction, which in turn can form the basis (i.e., the `data') for theories at a higher level of abstraction. There is no guarantee that such an approach will lead to the best global theory, but it is reasonable to expect in most natural domains that the resulting global theory will at least be nearoptimal.
For example, in objectrecognition (vision), the approach of section 6 could first be used to find straightline segments in a set of points. The resulting lines can then be further `explained' by a higherlevel theory which described the lines in terms of shapes or objects. Again, the length of explanation could be used to choose among possible higherlevel descriptions. This technique of composing theories is common in natural sciences.
A later version of Snob (Boulton and Wallace, 1973) illustrates such composition of theories. In this version, the hypothesied class structure is itself "explained" using a higherlevel theory, viz an hypothesized hierarchic tree in which each class is described by successive dichotomy of higherlevel classes. The sentence m specifying the classes may then be regarded as itself an explanation, in which the first part specifies a number of unrelated root classes, and the second part is a string of control sentences each specifying a tree of dichotomies by which a root class is elaborated into a set of terminal classes.
As Feldman has remarked (Feldman, 1972) the search for a general method of choosing the `best' theory or model to account for given data is beset by philosophical and practical difficulties. Our contribution to this problem includes three features.
First, the introduction of the length of the control string for a probabilistic Turing machine provides a measure of the degree of fit between a theory (the PTM) and the data D. In sufficiently regular cases, this measure is equivalent to the log likelihood function, but it generalises to cases in which the likelihood function may not be computable.
Second, we have introduced a measure of the complexity of a model or theory (the length of the sentence m) which is compatible with the measure of fit. Other measures of fit and model complexity have been proposed, but as Feldman observes, it is not clear how they may or should be combined to provide an overall criterion of choice among competing theories.
Because our measures are compatible, we may simply add them to give a measure (the length of explanation) which can be used to choose the best of a set of theories, and is open to theoretical analysis. Moreover, our measure provides an absolute criterion for rejecting theories. For instance, we can show that any theory accepted by our measure is necessarily falsifiable. That is, if a theory M gives an explanation for some data D, there exists a data string C such that M cannot explain DC. Compatibility also allows us to deal naturally with the composition of theories at different levels of abstraction.
Third, with the introduction of the language L_{1}, we have clarified the role and construction of prior probability distributions over families of models. The language provides a natural framework in which prior knowledge or expectations about simple features or subsystems of a model may be combined to define a prior distribution over a complex family. It also defines the otherwise undefined Universal Turing machine assumed in some other attempts to describe induction via computational complexity (e.g. Solomonoff (1964)). Further, this framework makes it clear that any reasonable assignment of prior probabilities to models is dominated by the relative complexity of the models rather than by subjective expectations.
An important aspect of L_{1} is that it allows us to handle continua of models. Previous proposals to use Bayesian inference (e.g. Hunt (1975)) have failed to provide a general measure for inductive inference because they could not cope with continua.
Boulton, D. M. and Wallace, C. S. "An Information Measure for Hierarchical Classification", Computer Journal 16(3) 1973.
Boulton, D. M. and Wallace, C. S. "The Information Content of a Multistate Distribution", J. Theor. Biology 23 pp269278, 1969.
Evans, T. G. "Grammatical Inference Techniques in Pattern Analysis", in Software Engineering, Vol 2, ed. Tou, J. T., Academic Press, N.Y., pp183202.
Feldman, J. A., Gips, J., Horning, J. and Reder S. "Grammatical Complexity and Inference", A. I. Memo No, 89, Comp. Sci. Dept. Stanford Uni. Stanford, California, USA, 1969.
Feldman, J. "Some Decidability Results on Grammatical Inference and Complexity", Info. and Control, 20, pp244262, 1972.
Gaines, B. R. "Behaviour Structure Transformations under Uncertainty", Int. J. ManMachine Studies, 8, pp337365, 1976.
Hunt, E. B. Artificial Intelligence, Academic Press, N.Y., 1975.
Patrick, J. D. "An Information Measure Comparative Analysis of Megalithic Geometries", PhD thesis, Monash, 1978.
Salomaa, A., Formal Languages, Academic Press, N.Y., 1973.
Solomonoff, R. "A Formal Theory of Inductive Inference I & II", Info and Control, 7, pp122 and 224254, 1964.
Thom, A. Megalithic Sites in Britain, Oxford University Press, London, 1976.
Wallace, C. S. and Boulton D. M. "An Information Measure for Classification", Computer Journal 11(2) 1968.
Wallace, C. S. "An Invariant Bayes Method for Point Estimation", Class. Soc. Bull. 3(3), 1975.

[This completes TR32, the original printed copy running to 18 pages. Note, the original used the notation lg(x), rather than len(x) as here, for lengths.]
Some Further Reading:
9/1999: HTML and added mistakes by Lloyd Allison.
www: 
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Saturday, 18Jan2020 20:43:35 EST. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 