An information measure for classification.

C. S. Wallace^* and D. M. Boulton^*

This paper derives a measure of the goodness of a classification based on information theory. A classification is regarded as a method of economical statistical encoding of the available attribute information.
The measure may be used to compare the relative goodness of classifications produced by different methods or as the basis of a classification procedure.
A classification program, `SNOB', has been written for the University of Sydney KDF 9 computer, and first tests show good agreement with conventional taxonomy.
First received December 1967)

1. Introduction

In all fields of discourse, the basic objects of concern are classified, and names given to the classes, to enable us to make general statements whose meaning applies to many objects rather than to a single object. For such a classification to be useful, the objects within a single class must essentially be equivalent at some level of discourse. The problem of generating a useful classification, exemplified by taxonomy, may be stated as follows:

Given a set of S things and for each a set of D measurements (attributes), to form a partition of the set of things, or, equivalently, a partition of the D-dimensioned measurement space within which each thing may be represented by a point, such that the things within each subset, or region of measurement space, may usefully be treated as equivalent in some discussion.

Many classification processes have been devised in answer to this problem (Sokal and Sneath, 1963; Williams and Dale, 1965). These methods have usually been directed towards producing classes such that members of the same class are as `similar' as possible and/or members of different classes are as `dissimilar' as possible. Such aims, while not necessarily equivalent to the general aim described above, can obviously be expected in practice to produce classifications which well serve the general aim. Unfortunately, the different measures of similarity between things and between classes of things which have been used in these processes result in significantly different classifications, and it is usually left to the user to choose that method which produces the most useful result. Moreover, it is difficult in many of these processes to separate a measure of the success of a classification from the process used to generate it. There is no readily applicable objective criterion firmly based on the original aim of the classification which can be used to compare the relative success of different processes.

The aim in this paper is to propose a measure of the goodness of a classification, based on information theory, which is completely independent of the process used to generate the classification.

2. The information measure

A classification may be regarded as a method of representing more briefly the information contained in the S x D attribute measurements.

These measurements contain a certain amount of information which without classification can be recorded directly as S lists of the D attribute values. If the things are now classified then the measurements can be recorded by listing the following:

1. The class to which each thing belongs.

2. The average properties of each class.

3. The deviations of each thing from the average properties of its parent class.

If the things are found to be concentrated in a small area of the region of each class in the measurement space then the deviations will be small, and with reference to the average class properties most of the information about a thing is given by naming the class to which it belongs. In this case the information may be recorded much more briefly than if a classification had not been used. We suggest that the best classification is that which results in the briefest recording of all the attribute information.

In this context, we will regard the measurements of each thing as being a message about that thing. Shannon (1948) showed that where messages may be regarded as each nominating the occurrence of a particular event among a universe of possible events, the information needed to record a series of such messages is minimised if the messages are encoded so that the length of each message is proportional to minus the logarithm of the relative frequency of occurrence of the event which it nominates. The information required is greatest when all frequencies are equal.

The messages here nominate the positions in measurement space of the S points representing the attributes of the things. If the expected density of points in the measurement space is everywhere uniform, the positions of the points cannot be encoded more briefly than by a simple list of the measured values. However, if the expected density is markedly non-uniform, application of Shannon's theorem will allow a reduction in the total message length by using a brief encoding for positions in those regions of high expected density.

If we consider the universe of all possible sets of D experimental measurements, we can form no reasonable expectation of the expected density in measurement space. However, if the observed distribution of the S given points is markedly non-uniform, the economics of Shannon's theorem can be realised in part by adding to the S messages a further message which approximately describes the average density distribution of the S points, and using this approximate non-uniform distribution as the basis for encoding the attribute messages. The message length needed to describe the average density distribution must be considered as adding to the lengths of the messages now used to encode the attributes of the things, because the receiver of the composite message has no a priori knowledge of the distribution.

The greater the detail given about the distribution, the more efficiently may the attributes of individual things be encoded. The `best' classification will result from an optimum compromise between the efficiency of attribute encoding and the length of message needed to specify the distribution.

We regard a classification as one means, among many, of stating and exploiting a non-uniform density distribution in measurement space. For instance, factor analysis may perhaps be regarded as a technique for describing and exploiting a concentration of the points near a flat subspace. A classification sets up a model of the distribution in which the space is divided, not necessarily by planes, into regions (classes), the distribution within each region being described by a separate, simple, distribution function centred on an area of high density.

A thing within a particular region has its attributes encoded on the basis of the density, in its neighbourhood, of the distribution function for that region. That is, it is regarded as a member of the `class' corresponding to the region, and the message describing it is encoded on the basis of the expected distribution of attribute values for members of that class, and contains the name of that class.

The S x D measurements are thus encoded into a message having five component parts:

1. the number of classes;

2. a dictionary of class names;

3. a description of the distribution function for each class;

4. for each thing, the name of the class to which it belongs;

5. for each thing, its attribute values in the code set up for its class

We assume already known, and not included in the message, the following:

(a) The number of things S.

(b) The number of attributes D.

We now proceed to discuss the form of the component messages 1 to 5 above.

3. Class identification format

3.1 Number of classes

We prefer to make no assumption about the expected frequency of different numbers of classes, and will therefore encode the number of classes on the basis that all values with a range of say 1 to 100 are equally likely. The message length is therefore constant and this component will henceforth be ignored.

3.2 Dictionary of class names

In the message about each thing is included a class name or label to identify the density distribution used to encode its attributes. To keep the total length of these labels as short as possible, the label for each class will be chosen with a length dependent on the relative abundance of that class. The labels are therefore not known a priori and must be described in a dictionary.

If there are n[t] members of class t (t = 1,2 ..., T) then the label used in the description of a thing to say that it belongs to class t will occur n[t] times in the total message. If we estimate the relative frequency of class t as

        T
p[t](SUM  p[t] = 1)
       t=1

then the information needed to quote the class membership of all things is

     T
- SUM  n[t] ln p[t]
    t=1

(1)

The choice of p[t] values which minimises (1) is

p[t] = n[t]/S

However, the p[t] are not known a priori and hence neither are the labels. It is shown in the appendix that when the information needed to describe the labels (class name dictionary) is taken into account, the overall information cost is minimised by tolerating an error of order 1/sqrt(S) in a choice of p[t], thus reducing the information needed for the class name dictionary, but causing the value of (1) to be an average slightly above its optimum. With the optimum choice of accuracy in selecting the p[t], the information needed by the class name dictionary is found in the appendix to be approximately

(T-1)     S                  1    T
----- ln --  -  ln(T-1)!  -  - SUM  ln p[t]
  2      12                  2   t=1

(1a)

The expected additional cost due to the non-optimum selection of p[t] is shown in equations (22) and (23) of the appendix to be about

(T-1)/2

(1b)

The total length of class identification labels and class name dictionary is given from (1), (1a) and (1b) as

T-1                             T
---(ln(S/12)+1) - ln(T-1)! - SUM (n[t]+1/2)ln(p[t])
 2                             t=1

(2)

We estimate p[t] by n[t]/S.

4. The class distribution model

We here treat only two kinds of attribute, continuous measurements (e.g. length) and unordered multistate (e.g. colour or sex). The treatment of ordered multistate data will be deferred to another communication, as it appears to warrant rather more discussion than is appropriate here.

In common with most workers in the field, we presume that within a single class, attributes are uncorrelated. The measure can, however, be extended to accommodate correlations, and we hope in another paper to show that considerable advantages might follow from such an extension.

Having assumed no correlation, the distribution within a class can be separated into independent distributions for each attribute. In calculating the information required for recording attribute values, we assume for simplicity that missing values are noted by a message element of fixed length, independent of the classification. Their contribution to the total message length can then be ignored.

4.1 Multistate attributes

For each class, the distribution of a multistate attribute is specified by the relative probability of occurrence of each state within the class. A label is created to nominate each state of the attribute, and the value of this attribute is specified for members of this class by including in their attribute messages the label nominating the appropriate state. Different labels will in general be used in different classes to nominate the same state of the same attribute.

In order that the labels used within a class may be recognised, the description of the class itself must include a description of the labels used for each state.

The considerations applying to the choice of state labels and their description are exactly analogous to those applying to the choice of class labels and the class name dictionary. However,if the number of things in class t having value m for multistate attribute d is n[m,d,t], the relative frequency of occurrence of state m of attribute d in class t will be estimated by

p[m,d,t] = (n[m,d,t] + 1)/(n[d,t] + M[d])

(3)

where M[d] is the number of states possible for attribute d, and n[d,t] is the number of members of class t having known values for attribute d, i.e.

  M[d]
SUM   n[m,d,t]
  m=1

If some values are missing, n[d,t] < n[t].

The slight bias towards (1/M[d]) introduced by (3) into the estimate of p[m,d,t] has the useful effect that newly-measured things not in the original sample S are not absolutely prevented from being assigned to class T by having a value for attribute d not possessed by any existing member of the class. Also, our approximate estimate of total message length (eqn. 4 below) will diverge if any p[m,d,t] is estimated as zero. Biased estimates of the form (3) are discussed by Good (1965).

The total information needed to encode values of attribute d in class t, including the description of the labels used within the class, is given by the expression, analogous to (2):


(M[d]-1)                                  M[d]
--------(ln(n[d,t]/12+1) -ln(M[d]-1)! - SUM (n[m,d,t]+1/2)ln p[m,d,t]
    2                                     m=1

(4)

4.2 Continuous attributes

We assume that within a class the values of a continuous attribute d are normally distributed with mean mu[d,t] and standard deviation sigma[d,t]. Other forms of distribution could be used without any difficulty, in principle. In fact, the classification could allow a selection of forms, in which case further class description messages would be needed to specify the form employed.

To simplify the following discussion, d and t subscripts will be dropped, as we will be referring for the rest of Section 4.2 to the distribution of a single continuous attribute within a single class.

It is assumed that the range of measurement for the population as a whole is known a priori as is the accuracy of measurement. If a measurement x[s] is quoted to a least count e, i.e. to an accuracy of +/-(e/2) then the probability of getting such a measurement from the distribution (mu,sigma) is approximately

 K e                  2         2
----- exp(-(x[s] - mu) / 2 sigma )
sigma

(5)

where K=1/sqrt(2 pi).

The information used to encode the n = n[t] such measurements is therefore

                                 2          2
SUM  [ln(sigma/(K e)) + (x[s]-mu) / (2 sigma )
 s in t

(6)

Here, the values mu and sigma are the values quoted in the description of the class and used as the basis for the encoding of the measurements, and will not in general be fully accurate, because quotation to full accuracy would require an excessive amount of information.

Suppose that the nominal mean mu is quoted to a least count of b, and hence does not exactly equal the true mean

    1    n
m = - SUM (x[s])
    n   s=1

Then if the mean is assumed to be in a range of size a, but no a priori assumption of its probability distribution within this range is made, the information required to encode mu is

   a
ln -
   b

(7)

Writing f = mu - m, f may be assumed to be uniformly distributed over a range -b/2 to +b/2, and

             2   1   b  2     1  2
Expectation f  = - ( - )  =  -- b
                 3   2       12

(8)

Likewise, suppose that the nominal standard deviation sigma is quoted to a least count of c within a range O to p. The information required for its encoding is

   p
ln -
   c

(9)

The total information requirement for dimension d of type t is found by summing (6), (7) and (9).

      p     a      n    sigma               2         2
I = ln- + ln- + SUM (ln ----- + (x[s]  - mu) /(2 sigma ))
      c     b    s=1     K e

    p a        sigma      n          2         2
= ln--- - n ln ----- + SUM (x[s]-m-f) /(2 sigma )
    c b         K e     s=1

    p a       sigma       2      2            2
= ln--- + n ln----- + (n z  + n f ) / (2 sigma )
    c b        K e

(10)

where z is the true sample standard deviation, and

   2      n          2
n z  = SUM (x[s] - m)
        s=1

We now find the optimum value of b.

The expectation value E(I) of I may be found by replacing f² by its expectation b²/12. The parts of E(I) dependent on b are then found from (10) to be

 - ln b  +  n b² / (24 sigma²)

(11)

Differentiating (11) with respect to b gives for the best b

b = sigma . sqrt(12 / n)

(12)

Substituting (12) in (10) and putting f² = b²/12 = sigma²/n gives

                                               2
         p       a        n        sigma    n z       1
E(I) = ln- + ln-----sqrt(--) + n ln----- + -------- + -
         c     sigma     12         K e    2 sigma²   2

(13)

We now determine the optimum value of sigma.

The parts of (13) dependent on sigma are

-ln sigma + n ln sigma + n z²/(2 sigma²)

= (n-1)ln sigma + n z²/(2 sigma²)

(14)

Differentiating (14) with respect to sigma gives for the best sigma

sigma = z sqrt( n/(n-1) )

(15)

Now sigma is quoted to a least count c, and hence will in general differ from the value given by (15). Write

sigma = z sqrt(n/(n-1)) + g

(16)

where |g| < c/2 and the expectation of g² is c²/12.

We now determine the optimum value of c. The parts of (13) depending on c or sigma are

- ln c + (n-1) ln sigma + n z²/(2 sigma²)

(17)

Before substituting (16) in (17) it will be simpler to rewrite (16) as

sigma = w(1+h)

(18)

where

w = z sqrt(n/(n-1))

(19)

and

h = g/w

(20)

Substituting (18) in (17) gives

-ln c + (n-1)ln w + (n-1)ln(1+h) + nz²/(2w²(1+h)²)

(21)

The parts of (21) dependent on the choice of c are

-ln c + (n-1)ln(1+h) + nz²/(2w²(1+h)²)

(22)

Since h, the fractional error in quoting sigma, is expected to be small, we expand (22) in powers of h up to h², to give

-ln c + (n-1)(h-h²/2) + nz²(1-2h+3h²)/(2w²)...
= -ln c + h[(n-1) - nz²/w²] + h²[-(n-1)/2+3nz²/(2w²)]	(23)

From (19), we have nz²/w²=(n-1) and so (23) becomes,

-ln c + h²[-(n-1)/2 + 3(n-1)/2]
= -ln c + h²(n-1)	(24)

Now, h=g/w and the expectation of g² is c²/12. The expectation of (24) is therefore

-ln c + (n-1)c²/(12w²)

(25)

Differentiating (25) with respect to c gives for the best c

c = w.sqrt(6/(n-1))

(26)

With this choice of c, the expected value of h² is

c²/(12w²) = 1/(2(n-1))

(27)

The expected value for I with the optimum choice of b, sigma and c is found by substitution in (13)

               p               a         n
E(I) = ln--------------- + ln------sqrt(--)
         w.sqrt(6/(n-1))     w(1+h)     12

           w(1+h)      n z²     1
     + n ln------ + --------- + -
            K e     2w²(1+h)²   2

    ap                       w
= ln--sqrt(n(n-1)/72) + n ln-- + (n-1)ln(1+h)
    w²                      Ke

  + (n-1)/((2(1+h)² + 1/2

(28)

Expanding (28) in powers of h up to h gives

         ap
E(I) = ln--sqrt(n(n-1)/72)
         w²

       w
+ n ln--- + (n-1)h² + n/2 +O(h³)
      K e

Substitution the expectation of h from (27) gives

ap w E(I) = ln--sqrt(n(n-1)/72) + n(ln-- + 1/2) + 1/2 + O(h³) w² Ke
a p w = ln-sqrt(n/12) + ln-sqrt((n-1)/6) +(n(ln-- +1/2)) + O(h³) w w Ke	(29)

In the latter form, the three major terms of (29) show respectively the information required to encode the mean, standard deviation and individual variations about the mean.

In practice we take the ranges within which the mean and standard deviation are quoted to be 4sigma_p and sigma_p respectively, where sigma_p is the standard deviation of the population as a whole.

The form (29) is an approximation based on the assumption of a large class population, and requires some modification for classes having very few members.

4.3 Notes on class descriptions

Forms (4) and (29) include the lengths of messages needed to specify the distributions of attributes within each class. As written, these forms assume that each class is described independently of all others. That is, the classes are not embedded in an hierarchic structure. Further economies can in some cases be achieved by describing the class distributions in terms of classes of classes, etc., but as the derivation of the resulting message lengths is rather tedious, it will be deferred to a later communication.

It might be thought an unnecessary extravagance to quote a different standard deviation for the same attribute in different classes. However, we have found, e.g. in the classification of six species of fur seals, that significant differences of standard deviation can occur between species. Further, there may be instances in which a difference in standard deviation is a significant aid in distinguishing two classes. Where no significant difference in standard deviation exists between two rather similar classes, an hierarchic description of their attributes could well allow the same value to be used for both classes.

5 Probabilistic interpretation

Combining the results (2), (4) and (29), we see that the total message can be separated into two major parts, one concerned with the dictionary of class names and the description of class properties, and of length at most logarithmically dependent on S, the other comprising a message for each thing giving its class and attributes.

The latter, which will dominate the total length for large S, can be written, for a member s of class t, as

-ln(n[t]/S)

-   SUM   ln[(n[x[d,s], d, t] +1)/(n[d,t] + M[d])]
    d in multistate

+   SUM   [ln(sigma[d,t]/(Ke) + ( x[d,s]
    d continuous

                       - mu[d,t] )²/(2sigma[d,t]²]

(30)

where x[d,s] is the value (discrete or continuous) of attribute d of the thing.

If the set of S things may legitimately be treated as a sample from a much larger population, then the first term above can be regarded as minus the logarithm of the probability that an arbitrary thing belongs to class t, where a slightly biased estimate of this probability has been used. The second and third terms can be regarded as minus the logarithm of the probability that an arbitrary member of class t would have the attribute values found for thing s, where again slightly biased estimates have been used.

The value of (30) is therefore minus the logarithm of the probability that an arbitrary member of the whole population would be found to be in class t and to have the measurements x[d,s]. That is, (30) is minus the logarithm of the estimated population density in measurement space of class t in the neighbourhood of s. A classification which minimises the information measure will therefore assign each thing to the class with the highest expected density in the neighbourhood of the thing. The classes will not overlap in measurement space, and so (30) will represent minus the logarithm of the probability that an arbitrary member of the population would have measurements x[d,s].

The total length of all attribute messages, found by summing (30) over all s, represents the probability of obtaining all the observed measurements given the following complex hypothesis about the set of S things:

(a) The set of S things is a sample drawn from a population which is the union of T subpopulations.

(b) Each subpopulation t occurs with a certain relative abundance, estimated by n[t]/S.

(c) Within each subpopulation t, measurements x[d,s] are distributed normally or multistate with parameters as estimated in Sections 4.2 and 4.1.

(d) Within each subpopulation, different measurements are uncorrelated.

(e) Each thing is a member of a particular subpopulation.

Minimisation of (30) by choice of the class name labels, class distribution parameters and assignment of things to classes is therefore equivalent to a maximum likelihood estimation of the relative abundance, distribution parameters, and membership of the subpopulations. However, the number of classes T could not be so estimated, as the value of (30) decreases monotonically with increasing number of classes.

The length of the class name dictionary and class description parts of the total message, which is not included in (30), can now be seen to lay the role of minus the logarithm of the a priori probability of the hypothesis elements (a) to (c). Minimisation of the total message length is equivalent to choosing the hypothesis to maximise the posterior probability including these a priori factors.

The inclusion of any form of prior probability function may be considered objectionable on the grounds that its form is unknowable. However, it should be noted that, if the prior probability that some parameter lies in the neighbourhood of a value x is written as

f(x) dx

the form employed by us corresponds to the assumption that f(x) is uniform over a range based on the spread of values observed in the whole set. The important parts of our form arise from the factor dx, not from the shape of f(x).

The effect of including a priori factors on the estimates of subpopulation distribution parameters is to introduce a small bias into the estimates of continuous standard deviations and multistate probabilities. The normal maximum likelihood estimate of standard deviation

sqrt(SUM(x_i - xbar)²)/n)

becomes from (15)

sqrt(SUM(x_i - xbar)²/(n-1))

which is actually a less biased estimate. Likewise, the encoding system derived in the appendix for the class description of multistate attributes effectively yields estimates of the relative probabilities of occupation of different states slightly different from their observed relative frequency of occupation. Although we have not investigated this question in detail, it appears that the estimates are on average biased slightly towards equality of probability in all states. The bias appears to be similar in form to, although rather less than, the bias we deliberately introduce for other reasons in (3).

The relative effects of the above biases decrease with increasing S, but the inclusion of the class description message length has the important result that the total message length exhibits a minimum with respect to T, the number of classes. Thus, minimisation of the information measure produces a classification having a number of classes determined by the measure itself, and not by the introduction of arbitrary stopping rule such as is used in association analysis (Lance and Williams, 1965) or by the arbitrary selection of a certain similarity level on a dendrogram.

As S increases, the decreasing relative importance of the class description information makes it more likely that weakly separated classes will be distinguished in the classification which minimises the information measure. In effect, the increasing information content of the data makes the weak difference between the classes become significant.

6 Discussion

There are obvious similarities between the information measure and methods employed by others. For instance, the information statistic of Lance and Williams (1966) gives a measure between two groups of things which is the information needed to encode their (discrete) attributes on a code based on the union of the groups minus the information needed using different codes for each group. However, it does not include the information needed to specify the class membership of each thing, nor that needed to specify the codes.

The method of cluster analysis based on centroid sorting with a squared Euclidean distance similarity coefficient, in so far as it leads to a classification which minimises the total over classes of squared distances from class centroids, may be regarded as a maximum likelihood estimation of class membership and centroid positions of the class distributions, using the simplifying assumptions of:

(a) uncorrelated normal distribution of each attribute within each class;

(b) distribution standard deviations equal over all attributes and classes;

A classification which minimises the information measure will assign each thing to the class having the highest estimated population density in the neighbourhood of the thing. This assignment is that suggested by Sebestyen (1962) in his Bayesian approach to pattern recognition.

Most previous numerical classification systems have been based on one sort or another of inter-individual similarity measure. The wealth of measures proposed in the literature suggests that the notion of similarity has proved difficult to quantify, and difficult to relate to rigorous statistical concepts. It is interesting to note that in a recent comparison of several different similarity coefficients and clustering strategies applied to species of the genus Stylosanthes, L. 't Mannetje (1967) found the best correspondence with accepted taxonomy to be achieved by a coefficient and strategy which are purely empirical, and have no obvious basis in statistical or probability theory.

The information measure does not involve any attempt to measure similarity, either between individuals or between classes. Rather, it is based on comparisons between an individual, on the one hand, and a class on the other, and uses the well-defined concept of the probability that a member of a class of known distribution would be found to have certain measured attributes. The question of how in a given context one can best measure the similarity between two individuals is therefore replaced by the question of how in a given context the distribution of a class can best be described. We have in this paper tried to use the most colourless possible class distribution descriptions. However, in a particular context, such as the classification of species, one might well be able, by appeal to the theory of evolution which is supposed to account for interspecific variation, to find a form of class distribution which more nearly models the observed distribution than do our assumed forms. The form of distribution in measurement space to be expected of a class in a particular context appears to us to be more amenable to systematic investigation than does the form of similarity measure, if any, appropriate to that context.

7 Results

We have not as yet applied the information measure to classifications arrived at by other numerical methods. We have, however, written a program for the University of Sydney KDF 9 computer which attempts to minimise the information measure. The only data body so far submitted to the program for which there existed an established classification is a set of some 30 measurements of 217 seal skulls, the sample comprising six species. (See Table 1.)

Table 1:
Description of seal skull sample

	Male	Female
A. tropicalis tropicalis	15	9
A. australis	31	3
A. tropicalis gazella	60	18
A. pusillus	13	6
A. doriferus	34	10
A. forsteri	16	2

The program did not have access to either the sex of the skulls or the location of capture.

The program distinguished seven classes which corresponded fairly closely to those male and female groups of the six species which were represented by a sample of reasonable size. (See Table 2).

There are three main exceptions: first, the two species A. pusillus and A. doriferus were indistinguishable to the program. The only division was on the basis of sex. This indistinguishability is apparent from an inspection of the skulls themselves. It is difficult to note any difference for either sex between skulls from the two species. Further, an inspection of the means and standard deviations of the measurements shows that the differences between pairs of means of the same measurement for a particular sex are seldom greater than about 0.3 standard deviations.

Secondly, half of the group of 15 A. tropicalis tropicalis males were grouped with the females. These seven males are the younger members of the group and tend to look more like females. This same fact of age could explain a number of other small misclassifications. The four A. Pusillus and six A. doriferus males grouped with the A. australis males are both collections of the younger members of their respective groups, and likewise for the three A. forsteri males misclassified.

Thirdly, the grouping of the A. tropicalis tropicalis and A. forsteri males is probably due to a combination of too small sample size and notable similarity between the two groups. Their means for the same measurement seldom differ by more than about 0.5 standard derivations.

8 Acknowledgements

This work was carried out in the Basser Computing Department, School of Physics, University of Sydney. We are grateful for the use of KDF 9 computer and other facilities of the School of Physics. One of the authors (D. M. Boulton) was engaged on this work whilst in receipt of a University of Sydney Post-Graduate Studentship. We are indebted to Miss Judith King from the British Museum who supplied us with all the seal skull measurements, and also to Mrs. E. Vincent for her time and patience in typing this paper.

Table 2
Program versus zoological classification for six species of seal

Program Classes	MALES						FEMALES
Program Classes	A trop. tropicalis	A australis	A trop. gazella	A pusillus	A doriferus	A forsteri	A trop. tropicalis	A australis	A trop. gazella	A pusillus	A doriferus	A forsteri
4	7	7				12
3		23		4	6	3
6	1	1	59
1				9	28
7	7						8					1
2							1		17
5								2	1	5	10	1

9 Appendix

The optimum encoding of multistate data.

The statement of the class of each thing and the statement of values of multistate attributes present the same problem of how to optimise the labels used to encode this information and how to optimise the description of the label set employed.

Given a set of N things, we wish to produce a message saying in which of M states each thing lies. This message may be regarded as the concatenation of N `labels', one per thing, each `label' being a code name for the state of the corresponding thing. For the M names of the M states to be distinguishable from one another, the length (in units of log e bits) of the name for the mth state must be -ln p[m], where

   M
SUM  p[m] <= 1
  m=1

(1)

Maximum economy is achieved by choosing the p[m] to satisfy the equality sign in (1).

If a number of r[m] N things have state m, the total length of the message giving all N states is

       M
- N SUM  r[m] ln p[m]
      m=1

(2)

Expression (2) is minimised by choosing p[m] = r[m]. However, the receiver of the message has no a priori knowledge of r[m], and cannot deduce from the message the labels employed. Thus he is unable to decode the message as it stands.

Assume that the receiver has exact knowledge of N and M, and that there has been prior arrangement between sender and receiver that for this N and M the labels used will be based on one of an agreed finite set of M-tuples {p[m]}. (A similar agreement, or agreement on a general rule, is needed for any other values of M and N. The complete collection of agreements, or the general rule, constitutes the `code book' for all such communications.)

The message of length (2) must be preceded by an additional message (the `class name dictionary') which tells the receiver which one of this finite set of M-tuples has been selected by the sender. The sender will choose that M-tuple of p[m] values which most nearly matches the relative abundances r[m] in order most nearly to minimise (2). If the agreed set of M-tuples contained one M-tuple {p[m]} for each possible M-tuple {r[m]}, then the minimum of (2) could always be achieved. However, there is a large number of possible M-tuples {r[m]}, and the length of the class name dictionary which nominates the particular M-tuple {p[m]} being used would therefore become considerable.

A better compromise is to agree on a relatively small set of M-tuples {p[m]}, thus reducing the average length of the class name dictionary. In general, the values p[m] adopted will no longer exactly equal the values r[m], and so on average the length of the main part of message, as given by (2), will be slightly greater than the minimum possible.

We would like to determine how many M-tuples {p[m]} should be employed, over what range of values of {r[m]} each should be used, how much information will be needed to nominate a particular M-tuple, and by how much the expression (2) will on average exceed its minimum.

For some range of values of {r[m]}, we employ a fixed M-tuple of names whose lengths are minus the logarithms of p[m], where SUM_m p[m] = 1.

Write

r[m] = p[m] + a[m]   (m = 1, 2, . . ., M)

(3)

where

SUM  a[m] >= 0
   m

(4)

The additional cost D incurred in the length of the main part of the message due to the use of the M-tuple {p[m]} instead of the optimum M-tuple {r[m]} is, from (2), given by

D = -N{SUM (p[m]+a[m])ln p[m] m - SUM (p[m]+a[m])ln(p[m]+a[m])} m
= n SUM (p[m] + a[m]) ln(1+a[m]/p[m]) m	(5)

Since a[m]/p[m] is expected to be small, we expand (5) to the second power of a[m]/p[m], and using (4), obtain

    N
D = - SUM  a[m]²/p[m]
    2    m

(6)

Now the possible M-tuples {r[m]} range over the (M - 1) dimensional simplex G defined by

SUM  s[m] = 1
   m

(7)

r[m] >= 0   (all m)

(8)

The density of M-tuples is uniform throughout G.

If we assume that all M-tuples within the simplex are equally likely to be encountered, the information needed to nominate the use of a particular label set {p[m]} will, in the best encoding, be the minus the logarithm of the fraction of the volume of G over which the labels {p[m]} are employed. We therefore wish to choose the region of G within which {p[m]} is used, to have the largest volume for a given average value of D over the region. The regions of application of different M-tuples {p[m]} must not overlap, and must together fill G. We choose regions of hyper-parallelepiped shape, since they can be packaged together. We now attempt to find the optimum dimensions of the region of application of a particular M-tuple {p[m]}.

The contour in G of points yielding a given additional cost D is the intersection of the M-dimensional ellipsoid F defined by (6) with the hyperplace (7) containing G. The contour is an (M-1) dimensional ellipsoid E. The linear transformation T defined by

b[m] = a[m] / sqrt(p[m])

(9)

and correspondingly

q[m] = r[m] / sqrt(p[m])

(10)

maps F into the sphere F' defined by

N
- SUM  b[m]² = D
2    m

(11)

and G into the simplex G' defined by

SUM {sqrt(p[m] q[m])} - 1  =  0
   m

(12)

q[m] > 0   (all m)

(13)

Since the centre of F lies in G, the centre of F' lies in G'. Thus the contour in G' of points with additional cost D, namely the intersection F' with the hyperplace (12), is an (M - 1), dimensional sphere of radius R equal t the radius of F', hence

    N
D = - R²
    2

(14)

By virtue of the spherical symmetry in G' of the cost D about the M-tuple {b[m] = 0}, the optimum parallelepiped region of application of this M-tuple must, under T, map into an (M - 1) dimensional hypercube C in G' centred on the point {b[m] = 0}. Let the length of the sides of C be x. Since T is linear, all points within the C are equally likely to be encountered. The average squared distance of all points in C from the centre of C is

2^M-1
----
x^M-1

|x/2
|  dy₁
|0

|x/2
|  dy₂
|0

. . .

|x/2
|  dy_M-1
|0

(y₁²+y₂²+...+y_M²)

[NB. the vertical bars above are the closest to integration symbols that I can manage in HTML-2, -LA]

  M-1
= --- x²
   12

(15)

where the y are a set of (M - 1) rectangular coordinates aligned with the edges of C.

Hence, from (14), the average additional cost D over the region is

D av = N(M-1)x²/24

(16)

Since T is linear, the fraction of G' occupied by C equals the fraction of G occupied by the region of application of {p[m]}. To find the message length needed to nominate the choice of {p[m]}, we must find the volumes of C and G'.

The volume of C is

x^M-1

(17)

To find the volume of G', consider the M-dimensional simplex S having G' as one face and the origin {q[m] = 0} as the opposite vertex. The other vertices of S are the vertices of G', viz. the points (in q-coordinates)

     1                         1
(----------,0,0,...,0), (0,----------,0,...,0), ...,
 sqrt(p[1])                sqrt(p[2])

                         1
  . . ., (0,0,0,...,----------)
                    sqrt(p[M])

The volume of S is therefore

       1
---------------------
M!.PRODUCT_m sqrt(p[m])

(18)

The volume of S is also give by Vh/M, where V is the volume of G' and h is the perpendicular distance of the origin from the plane (12). Since the sum of the squares of the coefficients of q[m] in (12) is SUM_mp[m] = 1. h is found by substituting the coordinates {p[m] = 0} of the origin into the left hand side of (12). Hence

h = 1

and from (18)

             1
 V = -------------------------
     (M-1)! PRODUCT_m sqrt(p[m])

(19)

Combining (17) and (19), we get that the length of message needed to nominate {p[m]} is

- ln { PRODUCT_m sqrt(p[m])x^M-1(M-1)!}

(20)

The total expected additional message length due to the need to nominate {p[m]} and to the non-optimum match between {p[m]} and {r[m]} is found by adding (20) and (16) to be

N(M-1)x²/24 - (1/2)SUM ln p[m]
                     m

                      - (M-1) ln x - ln(M-1)!

(21)

The value of (21) is minimised by choosing

x² = 12/N

(22)

giving (21) a minimum value of

M-1 (M-1)/2 - (1/2)SUM ln p[m] + ---ln(N/12) - ln(M-1)! m 2
M-1 = (---)(ln N/12 + 1) - (1/2)SUM ln p[m] - ln(M-1)! 2 m	(23)

The total additional cost incurred with any particular set of abundances {r[m]} will depend on the difference between {r[m]} and the nearest {p[m]}, and hence on the details of how the parallelepiped regions are packed together. We do not wish to consider these details for all N and M, and in any case there are likely to be many different packing patterns of near-optimum efficiency. It would be appropriate to obtain the additional cost we expect to incur with a particular {r[m]} by averaging the additional cost over all M-tuples {p[m]} which include {r[m]} in their ranges of application.

The additional cost incurred by using a particular {p[m]}, given a particular {r[m]}, comprises two parts. The first part, D, due to the non-optimum match between the label set used and the relative frequencies of the states, is given by (6). The second, due to the need to nominate the label set used, is given by (20) substituting the optimum value of x from (22), that is, by

  1              M-1
- -SUM ln p[m] + ---ln(N/12) - ln(M-1)!
  2  m            2

(24)

It is easily seen that the region in G including all the {p[m]} having {r[m]} in their ranges of application is approximately centered on {r[m]}, and is similar in position, size and shape to the region of application of an M-tuple {p[m]} having p[m] = r[m]. Because this region, over which we wish to average, is approximately centred on {r[m]}, we may approximate the average value of (24) in the region by substituting r[m] for p[m] in (24). The value of (6) depends on the differences a[m] between p[m] and r[m], with a weighting dependent on p[m]. If the variation in the weighting over the region of averaging is neglected, (6) may be replaced by

    N
D = -SUM  a[m]²/r[m]
    2   m

(25)

If now the region of averaging is taken as being identical to the range of application of p[m] where p[m] = r[m], the average value of (25) is given by (16), substituting the optimum value of x from (22), as

(M-1) / 2

(26)

Combining (24) and (26) gives the expected additional cost incurred with relative frequencies of states r[m] as

 M-1
(---)(ln N/12 + 1) - (1/2)SUM  ln r[m] - ln(M-1)!
  2                          m

(27)

This expression represents the expected total cost additional to the message length which would be required to give the states of the N things using optimum label lengths of -ln r[m]. The expected total message length is therefore

M-1 (---)(ln N/12 +1) - (1/2)SUM ln r[m] 1 m - ln(M-1)! - N.SUM r[m].ln r[m] m
M-1 = (---)(ln N/12 + 1) - ln(M-1)! - SUM (n[m]+1/2)ln r[m] 2 m	(28)

where n[m] is the number of things in state M. The above form is essentially that used in the body of the paper for the information needed to record the class of each thing, and to record the values of multistate variables.

It is worth noting that in size and shape, the region over which an M-tuple p[m] is applied is essentially similar to the region of expected error in the estimation of the probabilities of a multinomial distribution based on a sample of size N. Thus the message transmitted to nominate the M-tuple is sufficient to convey essentially all the available information about the probability of occurrence of each state.

10 References

Good, I.J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Research Monograph No. 30, The M.I.T. Press, Cambridge, Mass.
Lance, G.N., and Williams, W.T. (1965). Computer programs for monothetic classification (Association Analysis), Computer Journal, Vol. 8, pp.246-249.
Lance, G.N., and Williams, W.T. (1966). Computer programs for hierarchical polythetic classification (Similarity Analysis), Computer Journal, Vol. 9, pp.60-64.
'T Mannetje, L. (1967). A Comparison of Eight Numerical Procedures in a Taxonomic Study of the Genus Stylosanthes Sw., C.S.I.R.O. Division of Tropical Pastures, Brisbane, Qld.
Sebestyen, G.S. (1962). Decision Making Processes in Pattern Recognition, Macmillan, New York.
Shannon, C.E. (1948). A Mathematical Theory of Communication, Bell System Tech. J., Vol. 27, p.379 and p.623.
Sokal, R.R., and Sneath, P.H.A. (1963). Numerical Taxonomy, W.H. Freemen and Co., San Francisco and London.
Williams, W.T., and Dale, M.B. (1965). Fundamental Problems in Numerical Taxonomy (in Advances in Botanical Research 2).

^* The authors were members of the Basser Computing Department, School of Physics, University of Sydney, Sydney, Australia, when this work was performed. C. S. Wallace (1933-2004) became the foundation Professor of Computer Science at Monash University, Clayton, Victoria, Australia, later Emeritus Professor in the School of Computer Science and Software Engineering, Monash University, Clayton, Victoria, Australia. David Boulton completed a PhD, The Information Measure Criterion for Intrinsic Classification, in Computer Science, Monash University, 1975.
- L. Allison 9/2000, 10/2004.

Also see:

This HTML version was created to make this early and important MML paper more accessible. Thanks to the CSSE office staff involved.

School of Computer Science and Software Engineering, Monash University, Australia 3800

C. S. Wallace* and D. M. Boulton*