On the Selection of the Order of a Polynomial ModelC. S. Wallace 

1 The problem2 The experimental protocol
3 The Selection Methods
3.1 Selection by MML
3.1.1All degrees from 0 to 20 are considered equally likely a priori, so all degrees will be coded with a code word of length ln(21) 'nits', or "natural bits." As all degrees have the same code length, the coding of the model degree has no influence on the choice of model. 3.1.2
3.1.3 Coding the data values
3.1.4 The total message length
4 Test resultsThe C/Unix programs used to obtain these results are obtainable from the author. 4.1 CMV's test function
 a = sin (π * (x + 1.0)); y = a * a; Target mean = 0.500, SD about mean = 0.354, SD about 0 = 0.612 Minimum = 0.0000, Maximum = 1.0000 ,''', ,''', , , , , ' ' ' ' , ' ' ' , , , , , , , , ' ' ' ' ' ' ' ' , , ' ' , , , , ' ' ' ' ,,' ',,,' ', ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 10.00 NoiseSD 0.061 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0804; 0.1857; 3.5005; 15.8055; 16.3748; 13.0748; SD 0.0561; 0.2633; 24.2763; 63.8077; 65.4801; 56.2695; 5pc 0.0054; 0.0078; 0.1099; 0.0091; 0.0091; 0.0082; 25pc 0.0206; 0.0385; 0.1289; 0.0863; 0.0911; 0.0733; 50pc 0.0822; 0.1236; 0.1471; 0.7974; 0.8637; 0.5827; 75pc 0.1297; 0.1880; 0.4338; 5.4448; 5.5033; 4.2212; 95pc 0.1615; 0.6075; 9.4489; 60.7315; 63.7380; 48.9659; 99pc 0.1886; 1.3700; 73.2666; 306.5231; 378.0335; 277.2285; Max 0.2481; 3.0411; 543.0019; 771.4965; 771.4965; 736.6776;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 0.137 312; 0.141 222; 0.137 564; ..... 0; ..... 0; ..... 0; 1 0.142 51; 0.281 33; 0.205 32; ..... 0; ..... 0; ..... 0; 2 0.128 38; 0.406 27; 0.815 35; 7.587 2; 7.587 2; 2.365 7; 3 0.132 6; 0.698 23; 6.320 138; 13.099 17; 14.732 12; 12.184 26; 4 0.084 113; 0.303 177; 10.891 231; 39.709 78; 40.725 75; 24.368 113; 5 0.085 8; 0.421 30;   18.996 214; 22.030 210; 15.918 249; 6 0.033 373; 0.106 426;   10.882 340; 10.883 344; 6.819 363; 7 0.028 56; 0.112 52;   18.547 195; 18.577 196; 21.611 138; 8 0.018 43; 0.095 10;   7.068 154; 6.939 161; 5.450 104; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 5.00 NoiseSD 0.122 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0932; 0.1751; 2.0351; 47.3772; 48.5253; 25.5923; 50pc 0.1068; 0.1339; 0.1422; 1.7633; 1.9880; 0.9442; 95pc 0.1656; 0.5032; 6.6747; 168.5623; 177.3814; 109.8585; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 3.30 NoiseSD 0.186 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1042; 0.1924; 1.5155; 56.3134; 61.4636; 40.5831; 50pc 0.1154; 0.1402; 0.1392; 2.5108; 2.8616; 0.9503; 95pc 0.1789; 0.5076; 3.1223; 230.9929; 252.1516; 163.1735; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 2.50 NoiseSD 0.245 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1119; 0.1823; 0.7567; 97.0134; 100.8881; 55.3633; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 10.00 NoiseSD 0.061 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0223; 0.0652; 1.1778; 2.2633; 2.3883; 1.5057; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 5.00 NoiseSD 0.122 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0319; 0.0743; 1.9007; 7.2120; 8.0623; 3.9577; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 3.30 NoiseSD 0.186 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0434; 0.0917; 2.6293; 16.1960; 16.7600; 10.8869; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 2.50 NoiseSD 0.245 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0533; 0.1169; 1.8853; 36.5171; 38.5030; 24.9102; 18 ..... 0; ..... 0;   ..... 0; ..... 0; ..... 0; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 10.00 NoiseSD 0.061 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0064; 0.0187; 0.2288; 1.1916; 1.2417; 0.6701; 50pc 0.0021; 0.0047; 0.0043; 0.0099; 0.0116; 0.0053; 95pc 0.0234; 0.0748; 0.3239; 3.6852; 4.2520; 0.9907; 99pc 0.1026; 0.1919; 1.9990; 26.2156; 26.2156; 15.0542; Max 0.1604; 2.1525; 92.2390; 106.6274; 106.6274; 106.6274; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 5.00 NoiseSD 0.122 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0125; 0.0281; 0.2706; 2.1147; 2.2608; 0.9889; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 3.30 NoiseSD 0.186 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0211; 0.0382; 0.2834; 2.7250; 3.3901; 1.4695; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 2.50 NoiseSD 0.245 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0288; 0.0543; 0.3746; 11.0609; 12.3142; 1.8985; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 2.00 NoiseSD 0.306 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0085; 0.0115; 0.0137; 0.3339; 0.3204; 0.3212; 50pc 0.0076; 0.0094; 0.0093; 0.0110; 0.0106; 0.0106; Max 0.0621; 0.1866; 2.0669; 122.5578; 122.5578; 122.5578; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 1.00 NoiseSD 0.612 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0301; 0.0525; 0.1094; 0.1180; 0.0697; 0.0684; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.50 NoiseSD 1.225 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0921; 0.1581; 0.1409; 3.2680; 3.1393; 3.1387; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.25 NoiseSD 2.449 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1755; 0.2703; 0.1863; 1.2425; 1.1484; 1.1897; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.18 NoiseSD 3.402 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.2394; 0.3867; 0.2454; 13.1019; 0.7891; 0.8816; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.13 NoiseSD 4.711 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.3508; 0.6396; 0.3554; 2.9207; 1.7376; 2.4722; 
4.2 A logarithmic functionHere, the target function is a segment of the Log function close to its divergence at zero. It is therefore quite difficult to approximate well with a polynomial of modest degree. The full results are presented.  y = log (x + 1.01); Target mean = 0.275, SD about mean = 0.927, SD about 0 = 0.967 Minimum = 4.6052, Maximum = 0.6981 ,,,,,,,''''''' ,,,,,''''' ,,,,''''' ,,,''' ,,''' ,,'' ,' ,'' , ,' , ' ' , ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 30.00 NoiseSD 0.032 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0117; 0.0510; 0.5492; 1.4009; 1.4758; 1.0794; SD 0.0263; 0.2047; 8.2916; 11.0360; 11.1183; 10.6686; 5pc 0.0006; 0.0018; 0.0014; 0.0013; 0.0012; 0.0013; 25pc 0.0016; 0.0063; 0.0045; 0.0065; 0.0065; 0.0048; 50pc 0.0038; 0.0146; 0.0132; 0.0227; 0.0246; 0.0163; 75pc 0.0109; 0.0409; 0.0371; 0.1179; 0.1399; 0.0590; 95pc 0.0475; 0.1525; 0.2229; 2.6842; 3.0647; 1.2720; 99pc 0.1138; 0.5033; 7.3907; 23.5344; 24.0526; 13.8202; Max 0.4116; 4.0862; 240.2782; 240.2782; 240.2782; 240.2782;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 0.363 2; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.175 2; 0.168 13; 0.151 6; 0.156 3; 0.156 3; 0.156 3; 3 0.088 16; 0.069 137; 0.084 54; 0.084 22; 0.069 15; 0.088 32; 4 0.037 41; 0.029 234; 0.035 151; 0.038 68; 0.053 58; 0.040 82; 5 0.029 70; 0.063 216; 0.020 216; 0.030 127; 0.028 114; 0.027 165; 6 0.012 137; 0.047 191; 0.114 192; 0.125 106; 0.155 100; 0.087 146; 7 0.009 189; 0.029 102; 0.871 162; 0.700 149; 0.723 145; 0.600 171; 8 0.006 192; 0.046 69; 2.321 116; 4.370 121; 3.881 137; 4.014 126; 9 0.004 149; 0.102 19; 1.126 73; 1.640 125; 1.678 127; 1.420 118; 10 0.003 104; 0.075 10; 0.775 25; 1.865 90; 1.820 96; 0.888 68; 11 0.003 58; 0.050 7; 0.066 5; 2.735 80; 2.504 88; 4.020 40; 12 0.003 27; 0.089 1;   1.219 40; 2.212 46; 1.063 24; 13 0.002 10; 0.504 1;   1.207 39; 1.246 38; 2.173 13; 14 0.002 2; ..... 0;   1.783 16; 1.586 18; 0.561 7; 15 0.001 1; ..... 0;   0.537 9; 0.537 9; 0.097 3; 16 ..... 0; ..... 0;   0.049 1; 0.251 2; ..... 0; 17 ..... 0; ..... 0;   8.295 3; 8.295 3; 0.229 2; 18 ..... 0; ..... 0;   0.004 1; 0.004 1; ..... 0;
::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 300, S/N ratio 10.00 NoiseSD 0.097 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0006; 0.0013; 0.0012; 0.0009; 0.0009; 0.0009; In this test, MML and VC performed about equally, but were bettered by all the "classical" methods. These methods should not be dismissed in situations when there are abundant training data, although in no case that we have found do the classical methods give notably lower errors than MML. Note also that in the test above, the classical methods fairly often used degree 20, and may well have chosen higher degrees had the program permitted. That is, the limitation to degree 20 may well have been the only thing preventing these methods from the serious overfitting which they exhibit in other tests. 4.3 A Function with Discontinuous DerivativeThis target, a shifted version of the absolute value function, was used to test performance on a nonanalytic form.  y = fabs (x + 0.3)  0.3; Target mean = 0.245, SD about mean = 0.355, SD about 0 = 0.432 Minimum = 0.3000, Maximum = 1.0000 , ,' ,' ,' ,' ,' ,' ,' ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ',' ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Moderate data available. MML somewhat better than VC, both better than classical methods.) 1000 Cases, N = 50, S/N ratio 10.00 NoiseSD 0.043 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0007; 0.0024; 0.0093; 0.0963; 0.0965; 0.0531; ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Plentiful data, all methods comparable.) 1000 Cases, N = 300, S/N ratio 3.00 NoiseSD 0.144 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0008; 0.0012; 0.0015; 0.0012; 0.0011; 0.0012; ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Sparse data, MML much better than VC, classical methods poor.) 1000 Cases, N = 20, S/N ratio 10.00 NoiseSD 0.043 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0024; 0.0110; 0.1448; 1.2638; 1.4441; 0.7647; SD 0.0030; 0.0281; 1.6201; 9.0421; 9.1974; 8.2531; 5pc 0.0007; 0.0011; 0.0010; 0.0011; 0.0012; 0.0010; 25pc 0.0011; 0.0025; 0.0019; 0.0028; 0.0035; 0.0021; 50pc 0.0016; 0.0036; 0.0034; 0.0166; 0.0260; 0.0054; 75pc 0.0026; 0.0062; 0.0081; 0.1836; 0.2495; 0.0467; 95pc 0.0062; 0.0444; 0.1684; 4.0199; 4.8170; 1.8431; 99pc 0.0151; 0.1589; 2.0720; 23.9615; 28.7447; 12.3919; Max 0.0444; 0.3620; 33.7324; 238.2127; 238.2127; 238.2127; (Note the huge maximum errors of the classical methods.)  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.015 12; ..... 0; 0.080 1; ..... 0; ..... 0; ..... 0; 3 0.006 111; 0.005 510; 0.005 252; 0.006 29; 0.005 17; 0.006 64; 4 0.004 43; 0.032 112; 0.042 87; 0.060 36; 0.060 26; 0.047 54; 5 0.002 369; 0.003 278; 0.012 419; 0.042 180; 0.050 148; 0.026 294; 6 0.002 120; 0.036 59; 0.410 126; 0.439 124; 0.469 118; 0.361 151; 7 0.001 167; 0.045 34; 0.213 72; 2.332 132; 2.336 132; 2.286 132; 8 0.001 103; 0.072 5; 0.250 30; 0.702 135; 1.029 144; 0.551 107; 9 0.001 37; 0.114 1; 5.028 11; 2.095 110; 2.010 124; 2.277 79; 10 0.001 27; 0.018 1; 2.466 2; 0.851 83; 1.010 88; 0.586 48; 11 0.001 9; ..... 0; ..... 0; 2.372 60; 2.677 71; 0.480 29; 12 0.001 2; ..... 0;   3.225 60; 3.159 69; 3.684 22; 13 ..... 0; ..... 0;   2.496 28; 2.318 37; 0.673 12; 14 ..... 0; ..... 0;   2.366 8; 2.021 10; 8.503 2; 15 ..... 0; ..... 0;   4.235 10; 4.235 10; 1.436 3; 16 ..... 0; ..... 0;   7.112 4; 5.719 5; 2.161 3; 17 ..... 0; ..... 0;   0.166 1; 0.166 1; ..... 0; 18 ..... 0; ..... 0;   ..... 0; ..... 0; ..... 0; (Note the large VC errors for degrees 6...9, especially 9.) ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Large amount of noisy data. All methods do well, but VC surprisingly the worst, even at the median.) 1000 Cases, N = 900, S/N ratio 1.00 NoiseSD 0.432 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0018; 0.0029; 0.0040; 0.0026; 0.0025; 0.0026; SD 0.0019; 0.0054; 0.0403; 0.0951; 0.0959; 0.0909; 5pc 0.0008; 0.0013; 0.0025; 0.0011; 0.0011; 0.0011; 25pc 0.0012; 0.0024; 0.0029; 0.0017; 0.0017; 0.0017; 50pc 0.0016; 0.0030; 0.0032; 0.0023; 0.0025; 0.0023; 75pc 0.0022; 0.0035; 0.0038; 0.0032; 0.0032; 0.0032; 95pc 0.0032; 0.0044; 0.0098; 0.0052; 0.0043; 0.0051; 99pc 0.0039; 0.0054; 0.0106; 0.0069; 0.0053; 0.0069; Max 0.0054; 0.0062; 0.0119; 0.0175; 0.0091; 0.0175;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 ..... 0; ..... 0; 0.010 113; ..... 0; ..... 0; ..... 0; 3 0.003 9; 0.003 588; 0.003 882; 0.003 63; 0.003 243; 0.003 65; 4 0.003 4; 0.004 58; 0.006 1; 0.003 39; 0.003 67; 0.003 40; 5 0.002 319; 0.002 315; 0.004 4; 0.002 375; 0.002 485; 0.002 382; 6 0.002 106; 0.003 17; ..... 0; 0.002 96; 0.003 55; 0.002 95; 7 0.002 162; 0.004 18; ..... 0; 0.002 140; 0.003 90; 0.002 138; 8 0.002 262; 0.004 4; ..... 0; 0.002 90; 0.003 35; 0.002 88; 9 0.002 39; ..... 0; ..... 0; 0.003 41; 0.003 9; 0.003 41; 10 0.002 77; ..... 0; ..... 0; 0.003 56; 0.004 9; 0.003 55; 11 0.002 12; ..... 0; ..... 0; 0.004 30; 0.005 4; 0.004 28; 12 0.001 7; ..... 0; ..... 0; 0.005 15; 0.006 3; 0.005 15; 13 0.001 1; ..... 0; ..... 0; 0.004 18; ..... 0; 0.004 18; 14 ..... 0; ..... 0; ..... 0; 0.005 7; ..... 0; 0.005 7; 15 ..... 0; ..... 0; ..... 0; 0.006 12; ..... 0; 0.006 11; 16 0.001 1; ..... 0; ..... 0; 0.006 4; ..... 0; 0.006 4; 17 ..... 0; ..... 0; ..... 0; 0.006 6; ..... 0; 0.006 5; 18 0.002 1; ..... 0; ..... 0; 0.006 3; ..... 0; 0.006 3; 19 ..... 0; ..... 0; ..... 0; 0.012 3; ..... 0; 0.012 3; 20 ..... 0; ..... 0; ..... 0; 0.010 2; ..... 0; 0.010 2; (VC appears unduly reluctant to use degrees over 3.) 4.4 A Discontinuous functionThis discontinuous function gives real problems for a polynomial approximation.  if (x < 0.0) y = 0.1; else y = 2*x  1; Target mean = 0.050, SD about mean = 0.411, SD about 0 = 0.414 Minimum = 1.0000, Maximum = 1.0000 , ,' , ,' , ,' , ,' ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , ,' , ' ,' ,' , ' ,' ,' , ,' ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 50, S/N ratio 10.00 NoiseSD 0.041 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0225; 0.0482; 2.6485; 4.3638; 4.3657; 4.0354; SD 0.0115; 0.0653; 13.6189; 22.4116; 22.4113; 22.1482; 5pc 0.0091; 0.0124; 0.0142; 0.0131; 0.0131; 0.0129; 25pc 0.0147; 0.0197; 0.0241; 0.0307; 0.0307; 0.0276; 50pc 0.0203; 0.0283; 0.0468; 0.1728; 0.1751; 0.1327; 75pc 0.0277; 0.0524; 0.4985; 1.5407; 1.5407; 1.1032; 95pc 0.0431; 0.1567; 12.2793; 17.2450; 17.2450; 15.9468; 99pc 0.0667; 0.2634; 46.1423; 67.2347; 67.2347; 64.1314; Max 0.0914; 0.9774; 265.7027; 515.2273; 515.2273; 515.2273;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.083 3; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 3 0.068 5; ..... 0; 0.063 1; ..... 0; ..... 0; ..... 0; 4 0.053 16; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 5 0.040 51; 0.072 34; 0.034 31; ..... 0; ..... 0; ..... 0; 6 0.035 20; 0.115 4; 0.066 4; ..... 0; ..... 0; ..... 0; 7 0.030 92; 0.034 76; 0.038 93; 0.025 5; 0.022 4; 0.026 11; 8 0.029 36; 0.033 3; 3.592 10; 8.971 4; 8.971 4; 8.971 4; 9 0.024 105; 0.057 247; 0.147 125; 0.898 18; 0.898 18; 0.519 32; 10 0.025 32; 0.067 11; 5.823 28; 9.092 20; 9.092 20; 9.608 19; 11 0.021 121; 0.044 120; 0.873 100; 1.886 45; 1.973 43; 1.115 79; 12 0.021 35; 0.050 20; 12.382 46; 14.389 52; 14.670 51; 13.100 56; 13 0.018 118; 0.047 185; 0.685 117; 1.027 115; 1.035 114; 0.869 128; 14 0.019 44; 0.062 32; 14.069 51; 22.168 64; 21.839 65; 20.454 69; 15 0.016 105; 0.034 87; 1.334 84; 1.579 129; 1.628 125; 1.378 134; 16 0.018 38; 0.044 30; 6.240 56; 8.462 64; 8.098 67; 6.298 62; 17 0.014 66; 0.045 75; 1.482 79; 1.947 124; 1.916 126; 1.851 112; 18 0.015 25; 0.052 17; 1.834 51; 3.954 82; 3.954 82; 3.613 79; 19 0.015 61; 0.038 41; 0.583 81; 0.478 158; 0.478 158; 0.523 135; 20 0.016 27; 0.043 18; 5.869 43; 3.113 120; 3.041 123; 3.967 80; This test shows all methods other than MML making very poor choices among the available polynomials. Even VC gives an average error about 50 times worse than MML, and a median error twice as bad. VC's poor choices are not always in the models of highest degree: it gives poor average errors for all degrees above 7. MML often chose higher degrees, but managed to keep low average errors. Similar results were found on this target function for other test conditions. 5 Discussion5.1 Comparison with CMVOur results largely corroborate CMV regarding the comparison of VC and "classical" methods. Except when there is abundant information in the data, VC is more reliable and gives lower median and average errors. CMV claims that their implementation of the VC principle minimizes a guaranteed bound on the error which is exceeded with some specified risk. CMV does not explicitly state what level of risk is accepted in their implementation, but hints in the last paragraph that the risk is 5%, i.e., that they seek to minimize a bound on the 95th percentile of the error distribution. Our results show the VC method usually gives the lowest median (50th percentile) error of all methods, but its 95th percentile often exceeds that of MML. It seems that either the CMV implementation is designed with a 50% risk in mind, or the bound on the 5% risk derived from the VC principle is not very tight. The CMV report does not show average errors for the methods compared, only the 5, 25, 50, 75 and 95% percentiles. Their failure to compute average errors conceals the tendency of VC to give a small number of extremely large errors. The CMV statement that "... the best worstcase estimates generally imply the best averagecase estimates" (of error) cannot be supported. 5.2 The MML methodIt may be thought that the obvious superiority of MML over a wide range of target functions and test conditions is due to its being a Bayesian method, and as such, (unfairly?) using knowledge not available in the data. This is a misconception. In fact, the implementation of MML tested here was designed to
assume as little 'prior' information as possible. The 'prior' assumed for the
parameters of the polynomial model is not actually a genuine prior: the scale
assumed for the parameter distributions is determined by the variance V of the
data yvalues, and so is determined by the data themselves, not any prior
beliefs. The only genuinely prior beliefs used in the MML method are the
assumptions that
In the majority of tests, the best polynomial is of modest degree, and more importantly, the coefficients of the best polynomial decrease in size with increasing power of x. Only in the last, discontinuous, target function is the spectrum of coefficients anything like uniform. MML does use information not used by the other methods. It uses the variance of the observed yvalues (as a natural scale for encoding estimates of the parameters of the model) and the determinant of the covariance matrix M of the given xvalues (as indicating the precision to which the MML message should state the estimated parameters). This information is inherent in the data, not imported from any prior knowledge of what the best approximating polynomial is likely to be. If, as our results strongly suggest, use of these features of the given data improves the quality of the approximating model, there seems no reason for refusing to use them. Indeed, use of an inference principle which ignores them seems almost perverse. There is an argument in favour of using only the information from the errors produced by each of the candidate polynomial models. It can be argued that, if a large, loworder polynomial (e.g. a linear function of x) is added to the target, keeping the absolute noise variance unchanged, then the estimates of the existence and coefficients of higherorder polynomial components should be unaffected. In other words, if much of the variance of the given data can be explained by some loworder polynomial, our estimates of the higherorder components should be based only on what is left unexplained. The classical and VC methods have this "additive" behaviour, whereas MML does not. In the present implementation, the prior used in coding the coefficient of a highorder polynomial component is independent of the values estimated for lowerorder components. To the extent that this prior affects the messagelength cost of including and coding a highorder coefficient, the MML choice of whether or not to choose the higher order will be affected by the size of the lowerorder components of the target function, which contribute to the data variance and hence to the prior used for higher orders. The MML implementation could easily be modified to restore the "additive" property if this argument is found convincing. The prior used for the coefficient of order 0 could be based on the observed data variance about zero, as at present, but the prior used for the order1 component could then be based on the residual variance of the data about the estimated order0 model, and so on. That is, the prior density assumed for the coefficient a_{j} would be conditional on the values, not only of the observed yvalue variance V, but also on the estimated values of the coefficients of degree less than j. The prior assumed for the unexplained variance v in a model of degree d would be conditional on V and all estimated coefficients. However, we doubt if such a modification would much affect the results on the tests reported here. If anything, the MML performance should be improved, since the above conditional prior on the parameters is in closer accord with with the parameters of the best models than is the independant prior form assumed in this work. 6 AcknowledgementsThis work was assisted by an Australian Research Council grant A49330662 and with facilities and assistance provided by Royal Holloway College. [© Chris Wallace,
1997]
7. References


↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Wednesday, 05Aug2020 20:26:27 EDT. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 