Glossary

MML Glossary

Bayes (1702-1761), as in Bayes's theorem P(H&D) = P(H).P(D|H) = P(D).P(H|D).

Bayesian: Styles of inference (machine learning, statistics, etc.) that rely on Bayes's theorem and the use of priors.

Classification: See supervised and unsupervised classification.

Conditional Probability: P(B|A), the probability of B given A.

Conjugate prior: A family of prior distributions is conjugate for f(x|θ) if the posterior distribution is in the family whenever the prior is in the family; e.g., a beta prior distribution for the parameter of a binomial distribution (likelihood). [-Farr 1999 p.33]

Consistent: An estimator is consistent if it converges to the correct estimate (assuming that the model class includes the true model) as more and more data are made available.

Data Mining: Machine learning + some aspects of data bases, with the emphasis on (very) large data sets and efficient and robust (and sometimes ad hoc) methods. (If you know of a better, short definition, tell me.)

Data Space = Sample Space: Set of values from which data are drawn, e.g., Throw={head, tail}.

Estimate: Theta-hat, a value of a parameter, theta, inferred from (i.e., fitted to) data.

Estimator: A function (mapping) from the data-space to the space of parameter values.

Expected Future Data: Weighted (by posterior probability) average over all hypotheses (models, parameter estimates).

Fisher, R. A. (1890-1962).

Independent: A and B are independent if P(A&B)=P(A).P(B).

Invariant: An estimator is invariant if f'(e(D)) = e'(f(D)), where f is a monotonic transformation on the data space, and f' is the corresponding transformation on the parameter space.

Joint Probability: E.g. P(A&B), the joint probability of A and B. See conditional and independent.

Kullback Leibler distance: Between two probability distributions, KL({p_i},{q_i}) = ∑_i p_i.log₂(p_i/q_i). Always ≥0. Not necessarily symmetric. (NB. The distance can be thought of as the extra cost of coding data using the "wrong" model {q_i}. Integral replaces ∑ for continuous distributions.)

Likelihood: P(D|H), where D is the data set (training data), and H is a hypothesis (parameter estimate, model, theory).

MAP: Maximum aposteriori estimation; in the simplest cases only, MML is equivalent to MAP, this is not true in general (e.g., one or more continuous parameters, and other cases).

Markov Model of order k: Given a series x₁, x₂, x₃,... where P(x_t=e) can depend on x_t-k to x_t-1, only.

MDL: Minimum Description Length, since J.Rissannen, Parameter Estimation by Shortest Description of Data, Proc JACE Conf. RSME, pp.593-, 1976. Also see MML below.

Message Length: The length, usually in bits, of a message in an optimal code encoding some event (or data D). Often as two-part message, -log₂(P(H))+-log₂(P(D|H). Message after Shannon's mathematical theory of communication (1948).

Minimum EKL Estimator, MinEKL: The parameter estimate for a distribution (or model or hypothesis) that minimises the KL distance between the distribution and Expected Future Data, i.e., maximises the likelihood of Expected Future Data.

Mixture Model: The weighted average of two of more models, especially mixture of probability distributions in unsupervised classification.

MML: Minimum Message Length, since C.S.Wallace & D.M.Boulton, An Information Measure for Classification, Computer Jrnl., 11(2) pp.185-194, 1968.

Multivariate: Data, distribution etc. having multiple attributes (variables).

Observation: A data item, e.g., from an experiment.

Ockham: As in Ockham's razor. Also Occam.

Odds ratio: Simply the ratio of two probabilities, P(A)/P(B). Also as in posterior odds-ratio P(H₁|D)/P(H₂|D)=P(H₁).P(D|H₁)/(P(H₂).P(D|H₂)).

Prior: Before, particularly "before actual data are seen", as in prior probability distribution of parameters and/or models, P(H).

Posterior: After, particularly "after actual data are seen", as in posterior probability distribution of parameters and/or models, P(H|D)=P(H&D)/P(D)=P(H).P(D|H)/P(D).

Regression: To model, fit or infer, but particularly to fit a function (line, polynomial, etc.) through points {(x_i,y_i)} where y is dependent on x.

Sample Space: Space, set of values over which a random variable ranges. = Data Space especially in computing.

Strict MML (SMML): See Farr and Wallace (2002).

Supervised Classification: To infer a function, c:S→T, a classification function, given examples (training data) drawn from S×T.

Univariate: Data, distribution etc. having one attribute.

Unsupervised Classification: To infer a mixture model from examples (data).

Variable (1): Random variable.

Variable (2): An attribute of an observation (thing), e.g., a column of a data-set.

von Mises (- Fisher, vMF), probability distributions on directions in R^D.

Wallace, C. S. (1933-2004).

Some sources

G. Farr, Information Theory and MML Inference, School of Comp. Sci. and Software Eng., Monash University 1997-1999
G. Farr & C. S. Wallace. The Complexity of Strict Minimum Message Length Inference, The Computer Journal, 45(3), pp.285-292, 2002
C. S. Wallace & D. M. Boulton, An Information Measure for Classification, The Computer Journal, 11(2), pp.185-194, August 1968
C. S. Wallace & P. R. Freeman, Estimation and Inference by Compact Coding, J. Royal Stat. Soc., 49(3), pp.240-265, 1987
C. S. Wallace's book, 2005