Mixtures

The HTML FORM below allows the probability density functions for the two normal distributions N_μ₁,σ₁ & N_μ₂,σ₂, scaled by w₁ & w₂ respectively (w₁+w₂=1), to be plotted with their mixture in the ratio w₁ : w_{2_.}

You can vary μ1, σ1, μ2, σ2, & w_i and press the ''plot'' button to see the result.

Above: w₁:w₂ mixture of N_μ₁,σ₁ and N_μ₂,σ₂ plotted with w₁.N_μ₁,σ₁ and w₂.N_μ₂,σ₂.

The above is a very simple example. It is possible to have mixtures of more than two component classes, mixtures of multi-variate distributions, mixtures of different kinds of distribution, and so on.

The Model

Given S things, each thing having D attributes (measurements), a mixture model attempts to describe the things as coming from a mixture of T classes (clusters):

The number of classes, T,
a dictionary of class names,
the distribution function for each class,
for each thing, the class to which it belongs (although see fractional assignment later),
for each thing, the attribute values using a code based on the class to which the thing belongs.

The following are assumed to be common knowledge: The number of things, S, the number of attributes, D, the nature of each attribute.

Number of Classes

The number of classes can be coded in any suitable code for (smallish) integers.

Class Dictionary

This specifies a code word for each class. The choice of classes is described by a multistate distribution.

Class Distributions.

Each class distribution is defined by the distribution parameters for those attributes that are important to it, e.g. mean and standard deviation for a normal distribution, or the value probabilities for a multistate distribution, etc.. A class need not specify all attributes; it can optionally fall back on "general population" class properties.

Fractional Assignment

The discussion above assumed that each thing was assigned wholly to one class or another. If two or more closes overlap strongly, e.g. if they are not several standard deviations apart, such a 'definite assignment' description and code are not optimal.

Nuisance Parameters in Definite Assignment

e.g. Consider a 50:50 mixture of C₀=N_-1,1 and C₁=N_1,1 and a thing, t_i=0.0. Now, t_i could be in either class, and we do not care which. However with definite assignment, as above, we are forced to specify that t_i is in class C₀, or that it is in class C₁, at a cost of 1-bit. Because of the position of t_i the subsequent cost of stating its one attribute is the same in either case.

The are actually two alternative (sub-)hypotheses here: H₀ that t_i is in C₀, and H₁ that t_i is in C₁. Since we do not care about H₀ v. H₁, we should add their probabilities together. This shortens the message length, and similar considerations apply to every thing, t_j, that could be in more than one class.

Class memberships have typical characteristics of nuisance parameters: Their number increases in proportion with the amount of data. If classes are close enough, then regardless of the amount of data, an inference method which uses definite assignment (such as the 1968 Snob) will not be able to detect the separate classes. A second problem is that estimates of class parameters are biased under definite assignment because the model, in effect, becomes a mixture of non-overlapping, truncated distributions.

No-Nuisance Coding

There are two ways to look at a method of coding the things efficiently.

The first view is to "borrow bits" from later in the message. The transmitter considers the code for things t_i+1,... . If this starts with a '0', t_i is coded as being in class C₀ otherwise C₁. Either way, the receiver decodes t_i, then considers the fact that the transmitter had placed it in C_i, where i=0 or 1, and therefore understand i to be the first bit (which need not therefore be transmitted explicitly) of the rest of the message. Thus a bit is saved.

The second view of the matter is to consider the distributions for C₀, C₁, and their mixture. Thing t_i has some probability, p, under class C₀. Because of the form of this example, t_i also has probability p under C₁. It therefore has probability p+p=2p under the mixture; consider code lengths, -log(2p) = -log(p) - 1.

Generalisation

We have been using an example where thing t_i has equal probability of coming from C₀ and C₁. This was only to keep the arithmetic simple. Similar considerations apply when a thing is not exactly mid-way between classes, and when there are two or more attributes, three or more classes, etc..

Benefits of Fractional Assignment.

Using fractional assignment of things to classes, and given enough data, it is possible to distinguish, i.e. infer, classes that are arbitrarily close together, and even classes that have the same mean but different variances. Infered class distribution parameters are also unbiased.

Notes

C. S. Wallace and D. M. Boulton (1968). An information measure for classification. Computer Journal, 11(2), August 1968, pp.185-194.
Introduced the minimum message length (MML) theory of unsupervised classification (mixture modelling, clustering) and a resulting computer program [SNOB], perhaps the first concrete application of Shannon's theory of communication to inductive inference. The first version (only) of Snob used definite assignment.
C. S. Wallace and P. R. Freeman. Estimation and Inference by Compact Encoding. J. Royal Stat. Soc. B, 49 pp.240-265, 1987 [paper].
T. Edgoose & L. Allison. Minimum Message Length Hidden Markov Modelling. IEEE Data Compression Conf., pp.169-178, 1998.
An extension to sequences, or time series, of multi-variate data.
L. Allison. Models for machine learning and data mining in functional programming, J. Functional Programming, 15(1), pp.15-32, Jan. 2005.
Includes a generalised expectation-maximization (EM) algorithm for mixture-modelling over arbitrary probability distributions; also see [IP].
L. Allison, Mixture Models, Ch.7 in Coding Ockham's Razor, Springer 2018 [doi:10.1007/978-3-319-76433-7_7].