Inference 

Q: What is the difference between a hypothesis and a theory? IntroductionPeople often distinguish between
It is argued that although these distinctions are sometimes of practical convenience, that is all: They are all really one and the same process of inference. A (parameter estimate of a) model (class) is not a prediction, at least not a prediction of future data, although it might be used to predict future data. It is an explanation of, a hypothesis about, the process that generated the given (past) data. It can be a good explanation or a bad explanation. Naturally we prefer good explanations. Model ClassFor example, polynomial models form a model class for
sequences of points
ModelFor example, the general cubic equation
y = ax^{3}+bx^{2}+cx+d
is a model for
It is usually the case that a model has a fixed number of parameters (e.g., four above), but this can become blurred if hierarchical parameters or dependent parameters crop up. Some writers reserve 'model' for a model (as above) where all the parameters have fixed, e.g., inferred, values; if there is any ambiguity I will (try to) use model instance, or fully parameterised model, for the latter. E.g., in these terms the normal distribution, N(μ,σ), is a model and N(0,1) is a model instance. Parameter EstimationFor example, if we estimate the parameters of a cubic to be
a=1, b=2, c=3 & d=4,
we get the particular cubic polynomial
Hypothesis ComplexityOverfittingOverfitting often appears as selecting a too complex model for the data. E.g., given ten data points from a physics experiment, a 9^{th}degree polynomial could be fitted through them, exactly. This would almost certainly be a ridiculous thing to do. That small amount of data is probably better described by a straight line or by a quadratic with any minor variations explained as "noise" and experimental error. Parameter estimation provides another manifestation of overfitting: stating an estimated parameter value too precisely is also overfitting.
Classical statistics has developed a variety of significance tests to judge whether a model is justified by the data. An alternative is described below. MMLAttempts to minimise the discrepancy between given data, D, and values implied by a hypothesis, H, almost always results in overfitting, i.e. a too complex hypothesis (model, parameter estimate,...). E.g., if a quadratic gives a certain root mean squared (RMS) error, then a cubic will in general give a smaller RMS value. Some penalty for the complexity of H is needed to give teeth to the socalled "law of diminishing returns". The minimum message length (MML) criterion is to consider a twopart message (remember [Bayes]):
The first part of the twopart message can be considered to be a "header", as in data compression or data communication. Many file compression algorithms produce a header in the compressed file which states a number of parameter values etc., which are necessary for the data to be decoded. The use of a prior, P(H), is considered to be controversial in classical statistics. NotesThe idea of using compression to guide inference seems to have started in the 1960s.
More Formally
Maximum LikelihoodThe maximum likelihood principle is to choose H so as to maximise P(DH). E.g., Binomial Distribution (Bernouilli Trials)A coin is tossed N times, landing heads, #headtimes, and landing 'tails', #tail=N#head times. We want to infer p=P(head). The likelihood of <#head,#tail> is:
To sow some seeds of doubt, note that if the coin is thrown just once, the estimate for p must be either 0.0 or 1.0, which seems rather silly, although one could argue that such a small number of trials is itself rather silly. Still, if the coin were thrown 10 times and happened to land heads 10 times, which is conceivable, an estimate of 1.0 is still not sensible. E.g., Normal DistributionGiven N data points, the maximum likelihood estimators for the parameters of a normal distribution, μ', σ' are given by:
Note that σ'^{2} is biased,
e.g., if there are just two data values
it is implicitly assumed that they lie on opposite sides
of the mean which plainly is not necessarily the case,
NotesThis section follows Farr (p.17..., 1999)


↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Sunday, 19Jan2020 14:36:44 EST. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 