A General Objective for Inductive Inference
C. S. Wallace & M. P. Georgeff
1 1 -(V-1).(log(t/12)+1) - log(V-1)! - -.SUMi log pi 2 2
and that pi will be estimated by approximately (ni+1/2)/(t+V/2) where ni is the number of times the ith symbol is produced from this state. See Wallace and Boulton (1968) for the derivation.
The length of a behaviour is the sum over all transitions of minus the log of the probability pi of the transition.
In computing his "acceptable sets", Gaines was compelled to develop each PFSA of a given number of states which could produce D, and then for each PFSA estimate its transition probabilities from the transition counts in its behaviour. However, we are able to develop the best PFSA (by our measure) with less exhaustive, although still tedious search. If we consider a single state of the PFSA, the length of the (b) component of m specifying the transition probabilities of this state, plus the length of all Huffman code words in the control sentences specifying transitions from the state, is approximately
1 -(V-1).(log(t/12)+1) - log(V-1)! - SUMi (ni+1/2).log pi 2
This expression is closely approximated from below by
(t+V-1)! log ---------------- (V-1)! PRODi ni!
which is in fact an exact expression for the relevant parts of an explanation in which m specifies only the structure of the PFSA M, and in which the probability of a behaviour of M is computed by integrating over all possible transition probability distributions rather than by specifying in m some estimated transition probability distribution for each state (Boulton and Wallace 1969). The latter expression has the advantage that it can be computed incrementally while following the behaviour of a trial PFSA. It is also possible to compute the (c) components incrementally. Thus we are able to abandon a trial PFSA as soon as the incremental computed length of a partially-completed explanation exceeds the length of the shortest known explanation.
Consider the data string
where "/" is a delimiter.
(The delimiter is known a priori to return to the start state.)
Gaines finds PFSAs up to 7 states with of course increasingly
Some of the machines are shown in
Because there is a marked increase in probability
(decrease in length of control sentences
Figure 1(a) shows how the length of explanation of the best n-state machine varies with n over the range 1 to 7. It is clearly seen that the best theory, under our measure, is the 4-state machine. Thus in this case the quantitative measure proposed above has yielded the same best theory as Gaines' intuitions. It also corresponds to the grammar derived by Feldman, Gips, Horning and Reder (1969) and Evans (1971) using heuristic schemes.
Note that the length of the behaviour (i.e. the control sentences) of the 7-state machine is the minimum over all machines. Although the length of behaviour decreases up to the 7 state machine, an increase in the number of states thereafter produces no decrease in length.
Note also that neither the 1-state nor 7-state machines provide an explanation of the data string, as the data string may be written in a binary code with length 66.
In a number of other examples from Gaines we similarly find the machine preferred by Gaines, except in one case (his Figure 3) where we find a 2-state machine to be better than his 6-state machine.
Consider the problem of trying to fit one or more straight line segments to a set of data points in a two-dimensional field, where some points may be noise, and the number of line segments is unknown.
We suppose the data to comprise N coordinate points
(xp, yp) (p=1,...,N)
of points in a square field of size
The control sentence wp giving the data sentence xp yp then has two parts. The first declares point p to be either noise or a member of line `l'. If it is noise, the second part specifies xp, yp using a code word of length 2.N.log(R/epsilon). If it is a member of line `l', the second part specifies the position of the point by giving its position along the line segment and its distance from the line. The former assumes a uniform distribution of points along the line, the latter a Gaussian distribution.
In Figure 2 we show one such set of points (Figure 2a) together with the best single line theory (2b) and the best two line theory (2c), assuming a RMS scatter of line points of 0.3. The length of the data (that is, the "null" theory) is 317.7, the length of the one line explanation is 294.9, and two line explanation 298.7. The one line theory is preferred by our measure. The points treated as noise by the explanation are shown solid.
In Figure 3(a) we show another set of points. Again assuming an RMS scatter of 0.3, Figure 3(b) shows the best two-line theory, giving an explanation of length 309.0 vs a data string length again of 317.7. For this set there is no one-line explanation, the best one-line theory giving a length of 320.8.
The above approach may be compared with the technique used by Bolles and Fischler (1981) for model fitting in the presence of "gross errors". They first filter out the gross errors by proposing a model of the data on the basis that enough of the data points are close to the model to suggest their compatibility with it (i.e. their deviations are small enough to be measurement errors). The user has to specify a suitable threshold on the number of compatible points required. This threshold is somewhat arbitrary, and usually task dependent. Thus in the examples shown in Figures 2 and 3 a low threshold would select both the one line and two line theory for both sets of data, whereas a high threshold would select only the one line theory in both cases.
They then use some smoothing techniques (e.g. least squares), and finally evaluate the fit of model and data by testing, on the basis of some simple statistics, the randomness of the residuals. We would agree with this principle in the context of the fitting problem. However, rather than using a small suite of statistical tests for randomness in the residuals, our approach produces a string of residuals (the control sentence string) which has the greatest possible randomness in the sense that it admits of no more concise encoding. Our approach also provides a direct measure for the selection among competing theories.
Given a random sample from an unknown multivariate distribution, numerical taxonomy or classification techniques attempt to construct and test hypotheses that treat the population as comprising several distinct classes. It is usually assumed that the distribution of variables within a class has a relatively simple form. In an explanation based on a classification hypothesis, the sentence m states the number of classes, and then for each class gives its relative abundance and the distribution parameters for each variable. For each member of the sample, the corresponding control sentence w effectively specifies the class to which the member is supposed to belong, and then gives the variable values for that member using a code optimised for its class.
Search for a good explanation of this form can be conducted efficiently if the distribution forms assumed within each class are sufficiently regular. The Snob program (Wallace and Boulton 1968) uses a combination of analytic optimisation and heuristic tactics to search for that classification of multivariate sample which gives the shortest explanations. Samples of over 1000 things each having over 100 attributes can be classified in the order of an hour's computer time. [NB. written in 1983.] The program determines the number of classes without the use of any arbitrary thresholds or other criteria of fit or significance. It has been found to give useful results in a variety of fields including demography, psychiatric diagnosis, and zoology.
It has been suggested (Thom 1967) that the neolithic builders of the stone circles found in the British Isles used a precise unit of length and a number of families of geometric shapes based on integer Pythagorean triangles in their construction. Patrick (1978), using the same approach as described herein, constructed two explanations for survey data of a number of Irish stone circles. One explanation used Thom's theory, and the second used a theory that the circles were roughly circular and locally smooth, with no defined unit of length. His analysis showed that the Irish circles were better explained using the simpler theory, but that there was some evidence of a preference for bilateral symmetry. It should be noted that previous analyses using conventional statistical methods had been unable to arrive at any definite conclusion (Freeman, 1976; Boradbent, 1956).
In Section 5 we mentioned how the length of an explanation could be used to avoid exhaustive search while still guaranteeing that we find an optimal theory. However, in most cases where the set of possible theories M is large, such a technique is still too time consuming to be practical. In such situations analytic and heuristic techniques may be used to search for and improve good, but not necessarily optimal, theories.
One such technique is to use explanation length to select among competing sub-theories at some low level of abstraction, which in turn can form the basis (i.e., the `data') for theories at a higher level of abstraction. There is no guarantee that such an approach will lead to the best global theory, but it is reasonable to expect in most natural domains that the resulting global theory will at least be near-optimal.
For example, in object-recognition (vision), the approach of section 6 could first be used to find straight-line segments in a set of points. The resulting lines can then be further `explained' by a higher-level theory which described the lines in terms of shapes or objects. Again, the length of explanation could be used to choose among possible higher-level descriptions. This technique of composing theories is common in natural sciences.
A later version of Snob (Boulton and Wallace, 1973) illustrates such composition of theories. In this version, the hypothesied class structure is itself "explained" using a higher-level theory, viz an hypothesized hierarchic tree in which each class is described by successive dichotomy of higher-level classes. The sentence m specifying the classes may then be regarded as itself an explanation, in which the first part specifies a number of unrelated root classes, and the second part is a string of control sentences each specifying a tree of dichotomies by which a root class is elaborated into a set of terminal classes.
As Feldman has remarked (Feldman, 1972) the search for a general method of choosing the `best' theory or model to account for given data is beset by philosophical and practical difficulties. Our contribution to this problem includes three features.
First, the introduction of the length of the control string for a probabilistic Turing machine provides a measure of the degree of fit between a theory (the PTM) and the data D. In sufficiently regular cases, this measure is equivalent to the log likelihood function, but it generalises to cases in which the likelihood function may not be computable.
Second, we have introduced a measure of the complexity of a model or theory (the length of the sentence m) which is compatible with the measure of fit. Other measures of fit and model complexity have been proposed, but as Feldman observes, it is not clear how they may or should be combined to provide an overall criterion of choice among competing theories.
Because our measures are compatible, we may simply add them to give a measure (the length of explanation) which can be used to choose the best of a set of theories, and is open to theoretical analysis. Moreover, our measure provides an absolute criterion for rejecting theories. For instance, we can show that any theory accepted by our measure is necessarily falsifiable. That is, if a theory M gives an explanation for some data D, there exists a data string C such that M cannot explain DC. Compatibility also allows us to deal naturally with the composition of theories at different levels of abstraction.
Third, with the introduction of the language L1, we have clarified the role and construction of prior probability distributions over families of models. The language provides a natural framework in which prior knowledge or expectations about simple features or subsystems of a model may be combined to define a prior distribution over a complex family. It also defines the otherwise undefined Universal Turing machine assumed in some other attempts to describe induction via computational complexity (e.g. Solomonoff (1964)). Further, this framework makes it clear that any reasonable assignment of prior probabilities to models is dominated by the relative complexity of the models rather than by subjective expectations.
An important aspect of L1 is that it allows us to handle continua of models. Previous proposals to use Bayesian inference (e.g. Hunt (1975)) have failed to provide a general measure for inductive inference because they could not cope with continua.
Boulton, D. M. and Wallace, C. S. "An Information Measure for Hierarchical Classification", Computer Journal 16(3) 1973.
Boulton, D. M. and Wallace, C. S. "The Information Content of a Multistate Distribution", J. Theor. Biology 23 pp269-278, 1969.
Evans, T. G. "Grammatical Inference Techniques in Pattern Analysis", in Software Engineering, Vol 2, ed. Tou, J. T., Academic Press, N.Y., pp183-202.
Feldman, J. A., Gips, J., Horning, J. and Reder S. "Grammatical Complexity and Inference", A. I. Memo No, 89, Comp. Sci. Dept. Stanford Uni. Stanford, California, USA, 1969.
Feldman, J. "Some Decidability Results on Grammatical Inference and Complexity", Info. and Control, 20, pp244-262, 1972.
Gaines, B. R. "Behaviour Structure Transformations under Uncertainty", Int. J. Man-Machine Studies, 8, pp337-365, 1976.
Hunt, E. B. Artificial Intelligence, Academic Press, N.Y., 1975.
Patrick, J. D. "An Information Measure Comparative Analysis of Megalithic Geometries", PhD thesis, Monash, 1978.
Salomaa, A., Formal Languages, Academic Press, N.Y., 1973.
Solomonoff, R. "A Formal Theory of Inductive Inference I & II", Info and Control, 7, pp1-22 and 224-254, 1964.
Thom, A. Megalithic Sites in Britain, Oxford University Press, London, 1976.
Wallace, C. S. and Boulton D. M. "An Information Measure for Classification", Computer Journal 11(2) 1968.
Wallace, C. S. "An Invariant Bayes Method for Point Estimation", Class. Soc. Bull. 3(3), 1975.
[This completes TR32, the original printed copy running to 18 pages. Note, the original used the notation lg(x), rather than len(x) as here, for lengths.]
Some Further Reading:
9/1999: HTML and added mistakes by Lloyd Allison.
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated).
Created with "vi (Linux)", charset=iso-8859-1, fetched Friday, 23-Aug-2019 08:30:19 EDT.
Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.