On the Selection of the Order of a Polynomial Model

C. S. Wallace

 LA home Computing CSW  Polynomial also see MML  Continuous  Structured Maths  Orthogonal
 Abstract: A report by Cherkassy, Mulier and Vapnik [1] has compared the performance of several methods for selecting the order of a polynomial approximation to a function t(x) given only the values of t(x) for some set of x-values, where the given values are corrupted by i.i.d. Gaussian noise of unknown variance. They compare various "Classical" methods with a new method based on the concept of Vapnik-Chernovenkis (VC) dimension, and conclude that the latter gives the most reliable prediction of the value of t(x) at unseen values of x. This work replicates their investigation and extends it by including a method based on Minimum Message Length (MML) estimation [2,3]. The results largely confirm the previous results, but show that MML is generally superior to the VC method in terms of average squared prediction error.

 Chris Wallace wrote on this topic   (i) while visiting Royal Holloway College, London, 1997, and   (ii) in Proc. of the 14th Biennial Australian Statistical Conf., Queensland, 1998. Also see: C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer-Verlag, Information Science and Statistics, 2005, for more on MML in general, and §6.7.2, pp.272-275, for function approximation in particular.

1 The problem

The framework of the problem follows that considered by Cherkassy et al[1] (hereafter referred to as CMV). There is an unknown target function t(x) defined on the interval [-1,1]. (CMV use the interval [0,1] but the choice is immaterial.) We are given some training data comprising N pairs {(xn, yn) : n = 1 ... N)} where the x-values are chosen randomly from the Uniform distribution on [-1,1], and for each n,
yn = t(xn) + en
where the noise values {en : n = 1 ... N} are i.i.d. as Normal(0,v). The target function and the noise variance v are regarded as unknown, but the x-values and N are given.

The task is to construct some polynomial function f(d,x) of degree d which may be used to predict the value of t(x) in the interval -1 < x < 1. In what follows, we will only consider degrees up to 20, and for any degree will consider only the polynomial which minimizes the squared error on the training data,
S(d) = ∑n (yn - f(d,xn))2
Thus, the model selection problem is reduced to the choice of the degree d.

The success of a chosen approximation is measured by its expected squared error, i.e. the average value of (f(d,x) - t(x))2 in [-1, 1]. Use of the approximation for extrapolation beyond this interval is not considered. Because in this work we wished to study approximations to both regular and irregular target functions, we have estimated this average squared prediction error by a Monte-Carlo estimate
ESPE(f(d, *)) = (1 / M) ∑m (f(d,xm) - t(xm))2
where the test points {xm : m = 1 ... M} were chosen randomly and uniformly in [-1, 1]. The number of test points used is the Max (N, 100). Of course, in comparing the performance of different methods on some training data, the same set of test points was used for the estimation of the error of all the methods.

2 The experimental protocol

An experiment is defined by a target function, the number N of training data points, and a signal-to-noise ratio R, where R is defined as the variance of the target function about zero, divided by the noise variance v. In an experiment, 1000 training data sets are generated. For each data set, or case, all least-squares polynomial approximations of degrees up to 20 are found by standard means, and the training data error S(d) computed for each of the 21 polynomials. (If N < 22, only degrees up to N-2 are used.) These error values are then given to each of the selection methods being compared, and each method selects its preferred choice among the polynomials. The test error for a method in the current case is then the estimated prediction error achieved by the polynomial the method has chosen. Note again that all selection methods must choose from the same set of 21 polynomials, and their choices are "scored" on the same set of test data. Thus if two selection methods choose the same degree in some case, they are choosing the same polynomial and will incur the same (estimated) prediction error for this case.

After the 1000 cases of an experiment have been completed, we obtain for each selection method:
• Its average ESPE, averaged over all cases;
• The Standard Deviation of its ESPE;
• For each degree, the number of cases in which it chose that degree and its average ESPE on those cases;
• The 5%, 25%, 50%, 75%, 95% and 99% percentiles and the maximum of its ESPE distribution over the 1000 cases.

3 The Selection Methods

For all methods other than MML, the selection process as described by CMV is to choose the polynomial which minimizes
g(p,N) S(d)
where p = (d+1)/N, and g(,) is a function characteristic of the selection method.

CMV consider three "classical" methods derived from the literature:
• Finite Prediction Error (FPE), Akaike [4]
g(p,N) = (1+p)/(1-p)
• Schwartz's criterion (SCH) [5]
g(p,N) = 1 + 0.5 ln(N) p / (1-p)
• Generalized Cross-Validation (GCV), Craven & Wahba [6]
g(p,N) = 1 / (1-p)2
and a fourth, due to Shibata, which we did not include as it seemed uncompetitive on CMV's results.

CMV introduced a method based on VC dimension [7]:
g(p,N) = 1 / (1 - √(p - p ln(p) + ln(N)/(2N)))
Note that for VC, g(p,N) diverges for some degree less than N-2, which for small N may limit the maximum degree considered by VC to less than the maximum degree considered by other methods.

3.1 Selection by MML

Minimum Message Length[2] seeks to minimize the length of a "message" which encodes the data (in this problem the training y-values) by first stating a probabilistic model for the data, and then encoding the data using a code which would be optimal (in Shannon's sense) were the stated model true. Here, the model is assumed to be polynomial with Gaussian noise, so the model description need only specify:
the degree d of the polynomial,
the co-efficients of the polynomial, and
the estimated variance (or SD) of the noise.
The optimal encoding of these features of the model depends on our prior beliefs (or ignorance) about their likely values. In this work, we attempt to make neutral, or uninformative, prior assumptions, specifically:

3.1.1

All degrees from 0 to 20 are considered equally likely a priori, so all degrees will be coded with a code word of length ln(21) 'nits', or "natural bits." As all degrees have the same code length, the coding of the model degree has no influence on the choice of model.

3.1.2

In coding estimates of the noise variance, v, and the polynomial coefficients, some scale of magnitude may be assumed. Here, we use the variance of the given y-values to determine such a scale. Define
V = (1/N) ∑n yn2

In encoding a polynomial model of degree d, we suppose that the noise and each of the (d+1) degrees of freedom of the polynomial may be expected a priori to contribute equally (in very rough terms) to the observed variance V of the y-values. Defining u = √(V / (d+2)), we assume the Standard Deviation of the "noise" has a Negative Exponential prior density with mean u, and that each of the co-efficients of the polynomial has independently a N(0,u2) prior density. If the co-efficients were the usual co-efficients of the successive powers of x, the latter assumption would be unreasonable, and highly informative. Instead, we represent a d-degree polynomial as
f(d,x) = ∑j=0..d (aj Qj(x))
where the set {Qj(*) : j = 0 ... d} is a set of polynomials, Qj(*) being of degree j, which are orthonormal on the interval [-1, 1]. That is, the average on [-1, 1] of Qj(x)Qk(x) = 1 if j = k, and zero otherwise. The orthonormal polynomials represent effectively independent modes of contributing to the variance of f(d,*), and it seems reasonable to assume independent prior densities for the co-efficients {aj : j = 0 ... d}.

With these assumptions, the overall prior density for the unknown parameters {s, {aj}} (where s = √v) is
h(s,{a}) = (1/u) e-s/u * ∏j (1/(√(2π) u)) e-aj2 / (2 u2)

The amount of information needed to encode the estimates s and {aj} depends on the precision to which these are stated in the message. As shown in[2], the optimum precision is inversely proportional to the square root of the Fisher information F(s,{aj}), which in this case is given by
F(s,{aj}) = 2 (N / s2)d+2 |M|
where M is the co-variance matrix of the orthonormal polynomials evaluated at the given x-values:
Mj,k = (1/N) ∑n Qj(xn) Qk(xn)

As the given x-values are randomly and uniformly distributed in [-1, 1], M is typically near-diagonal.

3.1.3 Coding the data values

Once the model polynomial f(d,*) and the noise SD s have been stated, the given y-values can be coded simply by their differences from f(), these differences being coded as random values from the Normal density N(0,v). The message length required for this is then given by
L2 = (N/2) log (2 π v) + (1/2v) S(d)

3.1.4 The total message length

As shown in [2], the total message length is approximately given by
I1 = (1/2) log F(s,{aj}) - log h(s, {aj}) + L2 - ((d+2)/2) log(2 π) + (1/2) log ((d+2) π)
where the last two terms arise from the geometry of an optimum quantizing lattice in (d+2)-space.

The noise variance v = s2 is estimated as
v = (1 / (N-d-1)) ∑n (f(d,xn) - yn)2
and the co-efficients {aj} are estimated by conventional least-squares fit. These estimates do not exactly minimize the message length I1, but the difference is small.

The MML method selects that degree d which minimizes I1 as calculated above.

4 Test results

The C/Unix programs used to obtain these results are obtainable from the author.

4.1 CMV's test function

The CMV report contains only low-precision box-plots of their results, on the single target function
y = (sin (2 π x))2,   for x in [0, 1]
As our program works in the x-interval [-1, 1], we transformed this to
y = (sin (π (x+1)))2.
Edited program output is shown below for values of N and signal-noise ratio copying those shown in CMV. The editing is restricted to removing some of the detailed output for all but the first test case, which is unedited to exhibit the information provided in our tests.

The output provides a picture of the target function, and a copy of the C code statements which evaluate it. Then, each test condition gives an output headed by a line of '=' characters. The next line gives the test conditions: number of replications (here always 1000), number of data points (N), and signal/noise ratio. It also gives the resulting noise Standard Deviation, and the maximum degree of polynomial considered (MaxD) and the possibly smaller maximum degree (MaxD(VC)) used by the VC method.

Following lines give, in a column for each selection method, the average ESPE(AV), its standard deviation (SD), and percentile points and maximum of the ESPE distribution over the cases. Finally, the average ESPE is shown for those cases where a method has selected a polynomial of particular degree (avERR), together with the number of times it chose that degree (CNT).

Note that our results may not be directly comparable with CMV's: there may be a difference in our respective definitions of signal/noise ratio. In our work, the "signal" strength is calculated as the variance of the target function about zero, whereas CMV may use the variance about its mean. However, much the same range of test conditions is covered in both works.

The columns for the different methods are headed by their abbreviated names, MML, V-C etc.. The first column, headed BEST, gives the results which would be obtained if in each case, the polynomial were selected which gives the smallest ESPE as calculated on test data. Thus, BEST shows the best results which could possibly be obtained, but these results are of course unrealizable, as they are based on knowledge of the (noise-free) values of the target function at the test x-values.
```
----------------------------------------------------------------
a = sin (π * (x + 1.0));  y = a * a;
Target mean =    0.500,   SD about mean =    0.354,  SD about 0 =    0.612
Minimum =     0.0000,   Maximum =     1.0000

,''',                         ,''',
,     ,                       ,     ,

'       '                     '       '
,
'                             '         '
,           ,                 ,           ,

,             ,               ,             ,

'               '             '               '

'                 '           '                 '
,                   ,
'                   '
,                     ,       ,                     ,

'                       '     '                       '
,,'                         ',,,'                         ',

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  10, S/N ratio  10.00 NoiseSD   0.061  MaxD  8  MaxD(VC)  4
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0804;     0.1857;     3.5005;    15.8055;    16.3748;    13.0748;
SD     0.0561;     0.2633;    24.2763;    63.8077;    65.4801;    56.2695;
5pc     0.0054;     0.0078;     0.1099;     0.0091;     0.0091;     0.0082;
25pc     0.0206;     0.0385;     0.1289;     0.0863;     0.0911;     0.0733;
50pc     0.0822;     0.1236;     0.1471;     0.7974;     0.8637;     0.5827;
75pc     0.1297;     0.1880;     0.4338;     5.4448;     5.5033;     4.2212;
95pc     0.1615;     0.6075;     9.4489;    60.7315;    63.7380;    48.9659;
99pc     0.1886;     1.3700;    73.2666;   306.5231;   378.0335;   277.2285;
Max       0.2481;     3.0411;   543.0019;   771.4965;   771.4965;   736.6776;
--------------
DEG  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;
0  0.137 312;  0.141 222;  0.137 564;  .....   0;  .....   0;  .....   0;
1  0.142  51;  0.281  33;  0.205  32;  .....   0;  .....   0;  .....   0;
2  0.128  38;  0.406  27;  0.815  35;  7.587   2;  7.587   2;  2.365   7;
3  0.132   6;  0.698  23;  6.320 138; 13.099  17; 14.732  12; 12.184  26;
4  0.084 113;  0.303 177; 10.891 231; 39.709  78; 40.725  75; 24.368 113;
5  0.085   8;  0.421  30;  ---   ---  18.996 214; 22.030 210; 15.918 249;
6  0.033 373;  0.106 426;  ---   ---  10.882 340; 10.883 344;  6.819 363;
7  0.028  56;  0.112  52;  ---   ---  18.547 195; 18.577 196; 21.611 138;
8  0.018  43;  0.095  10;  ---   ---   7.068 154;  6.939 161;  5.450 104;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  10, S/N ratio   5.00 NoiseSD   0.122  MaxD  8  MaxD(VC)  4
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0932;     0.1751;     2.0351;    47.3772;    48.5253;    25.5923;
50pc     0.1068;     0.1339;     0.1422;     1.7633;     1.9880;     0.9442;
95pc     0.1656;     0.5032;     6.6747;   168.5623;   177.3814;   109.8585;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  10, S/N ratio   3.30 NoiseSD   0.186  MaxD  8  MaxD(VC)  4
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.1042;     0.1924;     1.5155;    56.3134;    61.4636;    40.5831;
50pc     0.1154;     0.1402;     0.1392;     2.5108;     2.8616;     0.9503;
95pc     0.1789;     0.5076;     3.1223;   230.9929;   252.1516;   163.1735;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  10, S/N ratio   2.50 NoiseSD   0.245  MaxD  8  MaxD(VC)  4
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.1119;     0.1823;     0.7567;    97.0134;   100.8881;    55.3633;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  20, S/N ratio  10.00 NoiseSD   0.061  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0223;     0.0652;     1.1778;     2.2633;     2.3883;     1.5057;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  20, S/N ratio   5.00 NoiseSD   0.122  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0319;     0.0743;     1.9007;     7.2120;     8.0623;     3.9577;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  20, S/N ratio   3.30 NoiseSD   0.186  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0434;     0.0917;     2.6293;    16.1960;    16.7600;    10.8869;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  20, S/N ratio   2.50 NoiseSD   0.245  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0533;     0.1169;     1.8853;    36.5171;    38.5030;    24.9102;
18  .....   0;  .....   0;  ---   ---   .....   0;  .....   0;  .....   0;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  30, S/N ratio  10.00 NoiseSD   0.061  MaxD 20  MaxD(VC) 19
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0064;     0.0187;     0.2288;     1.1916;     1.2417;     0.6701;
50pc     0.0021;     0.0047;     0.0043;     0.0099;     0.0116;     0.0053;
95pc     0.0234;     0.0748;     0.3239;     3.6852;     4.2520;     0.9907;
99pc     0.1026;     0.1919;     1.9990;    26.2156;    26.2156;    15.0542;
Max       0.1604;     2.1525;    92.2390;   106.6274;   106.6274;   106.6274;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  30, S/N ratio   5.00 NoiseSD   0.122  MaxD 20  MaxD(VC) 19
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0125;     0.0281;     0.2706;     2.1147;     2.2608;     0.9889;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  30, S/N ratio   3.30 NoiseSD   0.186  MaxD 20  MaxD(VC) 19
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0211;     0.0382;     0.2834;     2.7250;     3.3901;     1.4695;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  30, S/N ratio   2.50 NoiseSD   0.245  MaxD 20  MaxD(VC) 19
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0288;     0.0543;     0.3746;    11.0609;    12.3142;     1.8985;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   2.00 NoiseSD   0.306  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0085;     0.0115;     0.0137;     0.3339;     0.3204;     0.3212;
50pc     0.0076;     0.0094;     0.0093;     0.0110;     0.0106;     0.0106;
Max       0.0621;     0.1866;     2.0669;   122.5578;   122.5578;   122.5578;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   1.00 NoiseSD   0.612  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0301;     0.0525;     0.1094;     0.1180;     0.0697;     0.0684;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   0.50 NoiseSD   1.225  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0921;     0.1581;     0.1409;     3.2680;     3.1393;     3.1387;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   0.25 NoiseSD   2.449  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.1755;     0.2703;     0.1863;     1.2425;     1.1484;     1.1897;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   0.18 NoiseSD   3.402  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.2394;     0.3867;     0.2454;    13.1019;     0.7891;     0.8816;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 100, S/N ratio   0.13 NoiseSD   4.711  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.3508;     0.6396;     0.3554;     2.9207;     1.7376;     2.4722;

--------------------------------------------------------------
```
The results clearly show that none of the methods FPE, SCH, or GCV is competitive with VC or MML, except under conditions of low noise and large N, when all methods give similar results. Of the three, GCV comes closest. These three methods will not be further discussed.

Comparing MML and VC, MML gives lower average errors in all tests except for N = 100, S/N ≤ 0.5. In these tests, VC almost always chose degree 0 or 1, whereas MML often selected a higher degree. Other results on other target functions suggest that VC outperforms MML only when, by virtue of the shape of the target function, low N or large noise, the best approximating polynomial is of degree 0 or 1. Even then, the MML average error is only about twice the VC average, whereas (as in most of the tests shown above), the VC error can be 10 or 20 times the MML error on average, indeed hundreds of times greater on some targets, as will appear below.

The VC method is based on theory which does not assume that the target function belongs to the model family from which the approximation is drawn (in this case, the polynomials). MML, however, is in part motivated by theory which does make this assumption. It is curious that the only target/test conditions under which VC consistently betters MML are ones where this assumption is largely valid.

Examination of the percentile points of the error distributions shows that the MML and VC median errors are usually close. The VC average error is increased by a small number of cases giving very large VC errors. For instance, the test with N = 30, S/N = 10 shows VC actually with a slightly smaller median error than MML, but with an average error about 20 times greater.

The claim in CMV that although VC is based on worst-case bounds, it generally provides the best average error performance, is not supported by these results. Further, although CMV hints that the VC method has been designed to minimize the error exceeded in only 5% of cases, our results show that the 95-th percentile of the VC error distribution is usually larger (often much larger) than the MML 95% point. For instance, the N = 10, S/N = 10 test shows 95% points of 9.4 (VC) versus 0.6 (MML).

4.2 A logarithmic function

Here, the target function is a segment of the Log function close to its divergence at zero. It is therefore quite difficult to approximate well with a polynomial of modest degree. The full results are presented.

```
---------------------------------------------------------------

y = log (x + 1.01);
Target mean =   -0.275,   SD about mean =    0.927,  SD about 0 =    0.967
Minimum =    -4.6052,   Maximum =     0.6981

,,,,,,,'''''''
,,,,,'''''
,,,,'''''
,,,'''
,,'''
,,''
,'
,''
,
,'
,

'

'

,

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  20, S/N ratio  30.00 NoiseSD   0.032  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0117;     0.0510;     0.5492;     1.4009;     1.4758;     1.0794;
SD     0.0263;     0.2047;     8.2916;    11.0360;    11.1183;    10.6686;
5pc     0.0006;     0.0018;     0.0014;     0.0013;     0.0012;     0.0013;
25pc     0.0016;     0.0063;     0.0045;     0.0065;     0.0065;     0.0048;
50pc     0.0038;     0.0146;     0.0132;     0.0227;     0.0246;     0.0163;
75pc     0.0109;     0.0409;     0.0371;     0.1179;     0.1399;     0.0590;
95pc     0.0475;     0.1525;     0.2229;     2.6842;     3.0647;     1.2720;
99pc     0.1138;     0.5033;     7.3907;    23.5344;    24.0526;    13.8202;
Max       0.4116;     4.0862;   240.2782;   240.2782;   240.2782;   240.2782;
--------------
DEG  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;
0  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
1  0.363   2;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
2  0.175   2;  0.168  13;  0.151   6;  0.156   3;  0.156   3;  0.156   3;
3  0.088  16;  0.069 137;  0.084  54;  0.084  22;  0.069  15;  0.088  32;
4  0.037  41;  0.029 234;  0.035 151;  0.038  68;  0.053  58;  0.040  82;
5  0.029  70;  0.063 216;  0.020 216;  0.030 127;  0.028 114;  0.027 165;
6  0.012 137;  0.047 191;  0.114 192;  0.125 106;  0.155 100;  0.087 146;
7  0.009 189;  0.029 102;  0.871 162;  0.700 149;  0.723 145;  0.600 171;
8  0.006 192;  0.046  69;  2.321 116;  4.370 121;  3.881 137;  4.014 126;
9  0.004 149;  0.102  19;  1.126  73;  1.640 125;  1.678 127;  1.420 118;
10  0.003 104;  0.075  10;  0.775  25;  1.865  90;  1.820  96;  0.888  68;
11  0.003  58;  0.050   7;  0.066   5;  2.735  80;  2.504  88;  4.020  40;
12  0.003  27;  0.089   1;  ---   ---   1.219  40;  2.212  46;  1.063  24;
13  0.002  10;  0.504   1;  ---   ---   1.207  39;  1.246  38;  2.173  13;
14  0.002   2;  .....   0;  ---   ---   1.783  16;  1.586  18;  0.561   7;
15  0.001   1;  .....   0;  ---   ---   0.537   9;  0.537   9;  0.097   3;
16  .....   0;  .....   0;  ---   ---   0.049   1;  0.251   2;  .....   0;
17  .....   0;  .....   0;  ---   ---   8.295   3;  8.295   3;  0.229   2;
18  .....   0;  .....   0;  ---   ---   0.004   1;  0.004   1;  .....   0;
```
The above results again show a marked improvement of MML over VC, with the other three methods worse again. Note that the VC error quantiles up to the 75% mark are actually less than the MML quantiles: again it is the presence of a small number of very bad choices which lets VC down. It is interesting that VC's bad choices are not neccessarily of excessive degree. It made poor choices of polynomials of degrees 7, 8 and 9, and even some of degree 6, although MML also used these degrees frequently.

The very large average VC error for degree 8 suggests that VC sometimes chose a polynomial which had a reasonable fit to the training data, but which had excursions far outside the range of the training data. MML is dissuaded from such choices because, to maintain a good fit to the training data, the co-efficients of the estimated polynomial would have to be specified to high relative precision, requiring a long model description.

Another test on the same target function is shown below, where the training data contained much more information.
```
::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N = 300, S/N ratio  10.00 NoiseSD   0.097  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0006;     0.0013;     0.0012;     0.0009;     0.0009;     0.0009;
```

In this test, MML and VC performed about equally, but were bettered by all the "classical" methods. These methods should not be dismissed in situations when there are abundant training data, although in no case that we have found do the classical methods give notably lower errors than MML. Note also that in the test above, the classical methods fairly often used degree 20, and may well have chosen higher degrees had the program permitted. That is, the limitation to degree 20 may well have been the only thing preventing these methods from the serious overfitting which they exhibit in other tests.

4.3 A Function with Discontinuous Derivative

This target, a shifted version of the absolute value function, was used to test performance on a non-analytic form.

```
---------------------------------------------------------------
y = fabs (x + 0.3) - 0.3;
Target mean =    0.245,   SD about mean =    0.355,  SD about 0 =    0.432
Minimum =    -0.3000,   Maximum =     1.0000

,
,'
,'
,'
,'
,'
,'
,'
,'
',                                       ,'
',                                   ,'
',                               ,'
',                           ,'
',                       ,'
',                   ,'
',               ,'
',           ,'
',       ,'
',   ,'
','

::::::::::::::::::::::::::::::::::::::::::::;;;;;
(Moderate data available. MML somewhat better than VC, both better
than classical methods.)
1000 Cases, N =  50, S/N ratio  10.00 NoiseSD   0.043  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0007;     0.0024;     0.0093;     0.0963;     0.0965;     0.0531;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
(Plentiful data, all methods comparable.)
1000 Cases, N = 300, S/N ratio   3.00 NoiseSD   0.144  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0008;     0.0012;     0.0015;     0.0012;     0.0011;     0.0012;

::::::::::::::::::::::::::::::::::::::::::::;;;;;
(Sparse data, MML much better than VC, classical methods poor.)
1000 Cases, N =  20, S/N ratio  10.00 NoiseSD   0.043  MaxD 18  MaxD(VC) 11
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0024;     0.0110;     0.1448;     1.2638;     1.4441;     0.7647;
SD     0.0030;     0.0281;     1.6201;     9.0421;     9.1974;     8.2531;
5pc     0.0007;     0.0011;     0.0010;     0.0011;     0.0012;     0.0010;
25pc     0.0011;     0.0025;     0.0019;     0.0028;     0.0035;     0.0021;
50pc     0.0016;     0.0036;     0.0034;     0.0166;     0.0260;     0.0054;
75pc     0.0026;     0.0062;     0.0081;     0.1836;     0.2495;     0.0467;
95pc     0.0062;     0.0444;     0.1684;     4.0199;     4.8170;     1.8431;
99pc     0.0151;     0.1589;     2.0720;    23.9615;    28.7447;    12.3919;
Max       0.0444;     0.3620;    33.7324;   238.2127;   238.2127;   238.2127;
(Note the huge maximum errors of the classical methods.)
--------------
DEG  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;
0  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
1  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
2  0.015  12;  .....   0;  0.080   1;  .....   0;  .....   0;  .....   0;
3  0.006 111;  0.005 510;  0.005 252;  0.006  29;  0.005  17;  0.006  64;
4  0.004  43;  0.032 112;  0.042  87;  0.060  36;  0.060  26;  0.047  54;
5  0.002 369;  0.003 278;  0.012 419;  0.042 180;  0.050 148;  0.026 294;
6  0.002 120;  0.036  59;  0.410 126;  0.439 124;  0.469 118;  0.361 151;
7  0.001 167;  0.045  34;  0.213  72;  2.332 132;  2.336 132;  2.286 132;
8  0.001 103;  0.072   5;  0.250  30;  0.702 135;  1.029 144;  0.551 107;
9  0.001  37;  0.114   1;  5.028  11;  2.095 110;  2.010 124;  2.277  79;
10  0.001  27;  0.018   1;  2.466   2;  0.851  83;  1.010  88;  0.586  48;
11  0.001   9;  .....   0;  .....   0;  2.372  60;  2.677  71;  0.480  29;
12  0.001   2;  .....   0;  ---   ---   3.225  60;  3.159  69;  3.684  22;
13  .....   0;  .....   0;  ---   ---   2.496  28;  2.318  37;  0.673  12;
14  .....   0;  .....   0;  ---   ---   2.366   8;  2.021  10;  8.503   2;
15  .....   0;  .....   0;  ---   ---   4.235  10;  4.235  10;  1.436   3;
16  .....   0;  .....   0;  ---   ---   7.112   4;  5.719   5;  2.161   3;
17  .....   0;  .....   0;  ---   ---   0.166   1;  0.166   1;  .....   0;
18  .....   0;  .....   0;  ---   ---   .....   0;  .....   0;  .....   0;
(Note the large VC errors for degrees 6...9, especially 9.)

::::::::::::::::::::::::::::::::::::::::::::;;;;;
(Large amount of noisy data. All methods do well, but VC surprisingly
the worst, even at the median.)
1000 Cases, N = 900, S/N ratio   1.00 NoiseSD   0.432  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0018;     0.0029;     0.0040;     0.0026;     0.0025;     0.0026;
SD     0.0019;     0.0054;     0.0403;     0.0951;     0.0959;     0.0909;
5pc     0.0008;     0.0013;     0.0025;     0.0011;     0.0011;     0.0011;
25pc     0.0012;     0.0024;     0.0029;     0.0017;     0.0017;     0.0017;
50pc     0.0016;     0.0030;     0.0032;     0.0023;     0.0025;     0.0023;
75pc     0.0022;     0.0035;     0.0038;     0.0032;     0.0032;     0.0032;
95pc     0.0032;     0.0044;     0.0098;     0.0052;     0.0043;     0.0051;
99pc     0.0039;     0.0054;     0.0106;     0.0069;     0.0053;     0.0069;
Max       0.0054;     0.0062;     0.0119;     0.0175;     0.0091;     0.0175;
--------------
DEG  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;
0  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
1  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
2  .....   0;  .....   0;  0.010 113;  .....   0;  .....   0;  .....   0;
3  0.003   9;  0.003 588;  0.003 882;  0.003  63;  0.003 243;  0.003  65;
4  0.003   4;  0.004  58;  0.006   1;  0.003  39;  0.003  67;  0.003  40;
5  0.002 319;  0.002 315;  0.004   4;  0.002 375;  0.002 485;  0.002 382;
6  0.002 106;  0.003  17;  .....   0;  0.002  96;  0.003  55;  0.002  95;
7  0.002 162;  0.004  18;  .....   0;  0.002 140;  0.003  90;  0.002 138;
8  0.002 262;  0.004   4;  .....   0;  0.002  90;  0.003  35;  0.002  88;
9  0.002  39;  .....   0;  .....   0;  0.003  41;  0.003   9;  0.003  41;
10  0.002  77;  .....   0;  .....   0;  0.003  56;  0.004   9;  0.003  55;
11  0.002  12;  .....   0;  .....   0;  0.004  30;  0.005   4;  0.004  28;
12  0.001   7;  .....   0;  .....   0;  0.005  15;  0.006   3;  0.005  15;
13  0.001   1;  .....   0;  .....   0;  0.004  18;  .....   0;  0.004  18;
14  .....   0;  .....   0;  .....   0;  0.005   7;  .....   0;  0.005   7;
15  .....   0;  .....   0;  .....   0;  0.006  12;  .....   0;  0.006  11;
16  0.001   1;  .....   0;  .....   0;  0.006   4;  .....   0;  0.006   4;
17  .....   0;  .....   0;  .....   0;  0.006   6;  .....   0;  0.006   5;
18  0.002   1;  .....   0;  .....   0;  0.006   3;  .....   0;  0.006   3;
19  .....   0;  .....   0;  .....   0;  0.012   3;  .....   0;  0.012   3;
20  .....   0;  .....   0;  .....   0;  0.010   2;  .....   0;  0.010   2;
(VC appears unduly reluctant to use degrees over 3.)
```

4.4 A Discontinuous function

This discontinuous function gives real problems for a polynomial approximation.

```
------------------------------------------------------------------
if (x < 0.0) y = 0.1; else y = 2*x - 1;
Target mean =    0.050,   SD about mean =    0.411,  SD about 0 =    0.414
Minimum =    -1.0000,   Maximum =     1.0000

,
,'
,
,'
,
,'
,
,'
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,                 ,
,'
,
'
,'
,'
,
'
,'
,'
,
,'

::::::::::::::::::::::::::::::::::::::::::::;;;;;
1000 Cases, N =  50, S/N ratio  10.00 NoiseSD   0.041  MaxD 20  MaxD(VC) 20
KEY       BEST;        MML;        V-C;        FPE;        SCH;        GCV;
AV     0.0225;     0.0482;     2.6485;     4.3638;     4.3657;     4.0354;
SD     0.0115;     0.0653;    13.6189;    22.4116;    22.4113;    22.1482;
5pc     0.0091;     0.0124;     0.0142;     0.0131;     0.0131;     0.0129;
25pc     0.0147;     0.0197;     0.0241;     0.0307;     0.0307;     0.0276;
50pc     0.0203;     0.0283;     0.0468;     0.1728;     0.1751;     0.1327;
75pc     0.0277;     0.0524;     0.4985;     1.5407;     1.5407;     1.1032;
95pc     0.0431;     0.1567;    12.2793;    17.2450;    17.2450;    15.9468;
99pc     0.0667;     0.2634;    46.1423;    67.2347;    67.2347;    64.1314;
Max       0.0914;     0.9774;   265.7027;   515.2273;   515.2273;   515.2273;
--------------
DEG  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;  avERR CNT;
0  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
1  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
2  0.083   3;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
3  0.068   5;  .....   0;  0.063   1;  .....   0;  .....   0;  .....   0;
4  0.053  16;  .....   0;  .....   0;  .....   0;  .....   0;  .....   0;
5  0.040  51;  0.072  34;  0.034  31;  .....   0;  .....   0;  .....   0;
6  0.035  20;  0.115   4;  0.066   4;  .....   0;  .....   0;  .....   0;
7  0.030  92;  0.034  76;  0.038  93;  0.025   5;  0.022   4;  0.026  11;
8  0.029  36;  0.033   3;  3.592  10;  8.971   4;  8.971   4;  8.971   4;
9  0.024 105;  0.057 247;  0.147 125;  0.898  18;  0.898  18;  0.519  32;
10  0.025  32;  0.067  11;  5.823  28;  9.092  20;  9.092  20;  9.608  19;
11  0.021 121;  0.044 120;  0.873 100;  1.886  45;  1.973  43;  1.115  79;
12  0.021  35;  0.050  20; 12.382  46; 14.389  52; 14.670  51; 13.100  56;
13  0.018 118;  0.047 185;  0.685 117;  1.027 115;  1.035 114;  0.869 128;
14  0.019  44;  0.062  32; 14.069  51; 22.168  64; 21.839  65; 20.454  69;
15  0.016 105;  0.034  87;  1.334  84;  1.579 129;  1.628 125;  1.378 134;
16  0.018  38;  0.044  30;  6.240  56;  8.462  64;  8.098  67;  6.298  62;
17  0.014  66;  0.045  75;  1.482  79;  1.947 124;  1.916 126;  1.851 112;
18  0.015  25;  0.052  17;  1.834  51;  3.954  82;  3.954  82;  3.613  79;
19  0.015  61;  0.038  41;  0.583  81;  0.478 158;  0.478 158;  0.523 135;
20  0.016  27;  0.043  18;  5.869  43;  3.113 120;  3.041 123;  3.967  80;
```

This test shows all methods other than MML making very poor choices among the available polynomials. Even VC gives an average error about 50 times worse than MML, and a median error twice as bad. VC's poor choices are not always in the models of highest degree: it gives poor average errors for all degrees above 7. MML often chose higher degrees, but managed to keep low average errors. Similar results were found on this target function for other test conditions.

5 Discussion

5.1 Comparison with CMV

Our results largely corroborate CMV regarding the comparison of VC and "classical" methods. Except when there is abundant information in the data, VC is more reliable and gives lower median and average errors. CMV claims that their implementation of the V-C principle minimizes a guaranteed bound on the error which is exceeded with some specified risk. CMV does not explicitly state what level of risk is accepted in their implementation, but hints in the last paragraph that the risk is 5%, i.e., that they seek to minimize a bound on the 95-th percentile of the error distribution. Our results show the VC method usually gives the lowest median (50-th percentile) error of all methods, but its 95-th percentile often exceeds that of MML. It seems that either the CMV implementation is designed with a 50% risk in mind, or the bound on the 5% risk derived from the VC principle is not very tight.

The CMV report does not show average errors for the methods compared, only the 5, 25, 50, 75 and 95% percentiles. Their failure to compute average errors conceals the tendency of VC to give a small number of extremely large errors. The CMV statement that "... the best worst-case estimates generally imply the best average-case estimates" (of error) cannot be supported.

5.2 The MML method

It may be thought that the obvious superiority of MML over a wide range of target functions and test conditions is due to its being a Bayesian method, and as such, (unfairly?) using knowledge not available in the data. This is a misconception.

In fact, the implementation of MML tested here was designed to assume as little 'prior' information as possible. The 'prior' assumed for the parameters of the polynomial model is not actually a genuine prior: the scale assumed for the parameter distributions is determined by the variance V of the data y-values, and so is determined by the data themselves, not any prior beliefs. The only genuinely prior beliefs used in the MML method are the assumptions that (a) all degrees are equally likely, and (b) components of the explained variance (i.e. the coefficients) and the unexplained variance are all likely to be of similar size. These beliefs are misleading rather than helpful with respect to the tests reported here.

In the majority of tests, the best polynomial is of modest degree, and more importantly, the co-efficients of the best polynomial decrease in size with increasing power of x. Only in the last, discontinuous, target function is the spectrum of coefficients anything like uniform.

MML does use information not used by the other methods. It uses the variance of the observed y-values (as a natural scale for encoding estimates of the parameters of the model) and the determinant of the covariance matrix |M| of the given x-values (as indicating the precision to which the MML message should state the estimated parameters). This information is inherent in the data, not imported from any prior knowledge of what the best approximating polynomial is likely to be.

If, as our results strongly suggest, use of these features of the given data improves the quality of the approximating model, there seems no reason for refusing to use them. Indeed, use of an inference principle which ignores them seems almost perverse.

There is an argument in favour of using only the information from the errors produced by each of the candidate polynomial models. It can be argued that, if a large, low-order polynomial (e.g. a linear function of x) is added to the target, keeping the absolute noise variance unchanged, then the estimates of the existence and coefficients of higher-order polynomial components should be unaffected. In other words, if much of the variance of the given data can be explained by some low-order polynomial, our estimates of the higher-order components should be based only on what is left unexplained. The classical and VC methods have this "additive" behaviour, whereas MML does not. In the present implementation, the prior used in coding the coefficient of a high-order polynomial component is independent of the values estimated for lower-order components. To the extent that this prior affects the message-length cost of including and coding a high-order coefficient, the MML choice of whether or not to choose the higher order will be affected by the size of the lower-order components of the target function, which contribute to the data variance and hence to the prior used for higher orders.

The MML implementation could easily be modified to restore the "additive" property if this argument is found convincing. The prior used for the coefficient of order 0 could be based on the observed data variance about zero, as at present, but the prior used for the order-1 component could then be based on the residual variance of the data about the estimated order-0 model, and so on. That is, the prior density assumed for the coefficient aj would be conditional on the values, not only of the observed y-value variance V, but also on the estimated values of the coefficients of degree less than j. The prior assumed for the unexplained variance v in a model of degree d would be conditional on V and all estimated coefficients. However, we doubt if such a modification would much affect the results on the tests reported here. If anything, the MML performance should be improved, since the above conditional prior on the parameters is in closer accord with with the parameters of the best models than is the independant prior form assumed in this work.

6 Acknowledgements

This work was assisted by an Australian Research Council grant A49330662 and with facilities and assistance provided by Royal Holloway College.

Chris Wallace, 1997]

7. References

[1] V. Cherkassky, F. Mulier and V. Vapnik, Comparison of VC-method with classical methods for model selection, Proc. 1996 World Congress on Neural Networks.

[2] C. S. Wallace and P. R. Freeman, Estimation and inference by compact coding, J. of the Royal Statistical Society series B, 49(3), pp.240-265, 1987.

[3] ??? but there is a bibliography of all Chris Wallace's publications [here].

[4] H. Akaike, Statistical predictor information, Annals of the Inst. of Statistical Mathematics, 22, pp.203-217, 1970.

[5] G. Schwarz, Estimating the dimension of a model, Annals of Statistics, 6(2), pp.461-464, 1978.

[6] P. Craven and G. Wahba, Smoothing noisy data with spline functions, Numerische Mathematik, 31, pp.377-403, 1979.

[7] V. Vapnik, Statistical Learning Theory, Springer Verlag, 1995.