Multiple Factor Analysis by MML EstimationC. S. Wallace 

Abstract: Previous work by Wallace and Freeman (1992) on the application of Minimum Message Length (MML) estimation to the Factor Analysis model of a multivariate Gaussian population is extended to allow several common factors. The extension is not trivial, raising problems in the choice of an appropriate prior for the factors, and in the evaluation of the Fisher information of the model. However, the resulting complications appear to cancel out, giving an estimator similar to the singlefactor case. The estimator has been extensively tested on simulated data, and compared with the maximum likelihood and Akaike Information Criterion (AIC) estimator. The MML estimator is found to be substantially more accurate, to provide consistent estimates of factor scores, and to recover the number of common factors more reliably than a likelihoodratio test among maximum likelihood models. Keywords: Minimum Message Length, MML, factor analysis, multivariate Gaussian.
1. Introduction1.1 The modelWe have developed and tested a Minimum Message Length (MML) estimator for a version of the well known Factor Analysis model for a multivariate population, here assumed to have a Gaussian density. The work extends an earlier work (Wallace and Freeman 1992), which addressed the restricted problem with a single factor. Here, we consider models with an unknown number of factors. In some respects the earlier work generalizes in a straightforward way, and we assume familiarity with it. For consistency, we retain and extend the perhaps unusual notation of the singlefactor paper. Hereafter, that paper is referred to as WF. The data is a sample of N data vectors
where parameter μ_{k} is the mean in dimension k, parameter σ_{k} is the "specific variation" in dimension k, parameters {a_{j}, j=1...J} are a set of J "Factor load" vectors each of dimension K, parameters {v_{j}, j=1...J} are a set of J "score" vectors each of dimension N, and the NK variates {r_{nk}, n=1...N, k=1...K} are regarded as having independent N(0,1) distributions (Harman 1967). The conceptual model underlying factor analysis is that there exist some hidden, or latent, variables which have a linear effect on the K variables actually measured. The aim of a factor estimation method is to discover, to the extent revealed by the data, the number J of these variables, their values {v_{nj}} for each member of the sample, and their effects on the measured variables, represented in the model by the factor load vectors {a_{j}}. The model further assumes that each measured variable is subject, independently of all other variables, to some Normallydistributed variation or measurement error, r_{nk}, which is not affected by the hidden variables. The magnitude of this "specific" variance is modelled by the set of parameters {σ_{k}}. In Maximum likelihood (ML) treatment of the factor model, the model actually estimated is less complete. As ML is incapable of consistent simultaneous estimation of both factor loads and factor scores, the latter are treated not as parameters but as random variables with N(0,1) distributions, and the inferred model contains no inference of their individual values. The model (1.1) contains ambiguities. First, the factor load vector a_{j} and factor score vector v_{j} appear in the model only via their product. Thus the data cannot reveal their individual lengths. (By "length", we mean throughout the Euclidean vector length √(a^{T}a) = a.) This ambiguity is conventionally resolved by assuming the scores {v_{nj}} to be parameters with independent unit Normal priors. An alternative resolution is to constrain the scores so that Σ_{n} v_{nj}^{2}=N for all j. This alternative is mathematically more difficult and will not be pursued. A more troublesome ambiguity is that the data cannot reveal
the individual factor loads. For each load vector
a_{j},
define the scaled load t_{j} given by
1.2 MMLMinimum Message Length (Wallace and Freeman 1987) or MML is a
form of Bayesian model selection and estimation principle.
Using the result from Information Theory that the most
efficient binary encoding of an event or value of
probability P has length MML chooses the model and parameter estimates which minimize the total message length. Among its virtues are its invariance under nonlinear transformations of the parameter space, and its avoidance of the overfitting to which Maximum Likelihood (ML) is prone. For convergence theorems see Barron and Cover (1991), Wallace (1989). If the set of models contemplated is a family of models of
increasing order and number of parameters, as is the case in
the Factor problem, MML chooses the order and parameter
values in one step.
If regarded simply as an
orderselection process, it shares with Schwartz (1978) the
property of penalizing the loglikelihood by an amount of order
Both Bozdogan and Schwartz thus argue that the choice of
order should be based on maximizing a quantity like
For certain parameters of the factor model,
The aim of this paper is not to show how prior information can be included in a Factor analysis (although the work presented here could be extended to do so) but rather to show that even when an uninformative prior is assumed, MML gives results different from, and superior to, the results from maximizing the likelihood. 1.3 Related WorkPress and Shigemasu (1989) surveyed several Bayesian approaches to Factor analysis. A property shared by them all is that point estimates of factor scores and loads are obtained as either modes or means of the posterior density, or, in most methods, modes of a marginal posterior density. Such estimators are not invariant under nonlinear transformations of the parameter space. For instance, they would give different results were the factor load vectors to be specified in Polar rather than Cartesian coordinates. Press and Shigemasu described a new formulation which allows an explicit expression for the marginal posterior of factor scores. Adopting its mean as an estimate, they find a posterior density for the loads, and are able to give credibility intervals for elements of the load vectors. The method appears to require an informative prior for the loads, and thus bypasses questions of the number and identification of latent factors. Akaike (1987) describes his Information Criterion (AIC) as
a means of choosing a model which has maximal predictive
accuracy in the sense of minimizing the expected KullbackLeibler
distance between the population and model distributions.
He proposed AIC, basically
2. PriorsIn this paper, we attempt to adopt priors which are as nearly as possible uninformative. In considering the choice of a prior for the latent variable effects (the "factor loadings" a_{j}) the estimate of σ_{k} for each measured variable may be taken, in default of any other measure, as determining a natural scale for that variable. The effects of hidden variable j on the measured variables, when expressed in these natural scales, are given by the vector t_{j} and the entire factor model is most "naturally" expressed in terms of measured and hidden variables represented in the "naturally scaled" space. We therefore suppose the prior density over the length and direction of a factor load vector to be colourless in the scaled space. In particular, we suppose t_{j} to have a spherically symmetric prior density, and each of the J such vectors to be an independent realization of this density. This assumption does not carry over to the priors assumed for the estimated OLV vectors {β_{j}}. If the "true" scaled effects of the hidden variables are given by the TLV vectors {t_{j}}, we have no reason to suppose that these vectors will be mutually orthogonal. Indeed, we assume their directions to be independently uniformly distributed, this being the vaguest possible prior for their directions. Pro tem, assume the TLVs to have independent spherically symmetric prior densities of Kdimensional Normal form with scale parameter ρ. If a set of J vectors {u_{j}} are drawn from this density, the joint density of their lengths {u_{j}} is
If now a set of J mutuallyorthogonal vectors {c_{j}} is constructed such that
the joint distribution of the lengths {c_{j}} is proportional to (Muirhead, 1982, p.107)
The density (2.2) has the same form as (2.1) save for a "correction" factor which suppresses the density if two or more vectors have similar length, and inflates the density when one or more vectors are small. Thus, starting from the rather vague prior over the TLVs, namely that they have independent sphericallysymmetric Normal densities, we are led to conclude that an appropriate prior for the OLVs actually estimated shows their lengths not to be independently distributed. It has been suggested that one might prefer to model the OLV prior directly, presumably assuming the OLV lengths to be independent. However, our conceptual model is that the observed data are linearly affected by several distinct hidden variables, and there seems no prior reason to suppose that the effects of these variables on the measured variables should bear any relation to one another such as orthogonality. If this argument is accepted, then we expect the lengths of the OLVs rarely to be close to equality. A pair of OLVs can have equal length only in the unlikely coincidence that a pair of hidden variables (TLVs) happens to have equal lengths and to be orthogonal. Form (2.2) expresses the prior belief that such coincidences are unlikely. The lengths of the v and β
vectors for an OLV are confounded in the model,
as only their product affects the likelihood.
Thus, the estimator really estimates the product
where C is a normalization constant independent of ρ. The factor 2^{J} J! is included because the model is unchanged by simultaneous negation of β_{j} and v_{j}, and the labelling of the OLVs is arbitrary. An estimator was constructed using the prior (2.3).
It performed well in trials on simulated data provided the
lengths of the load vectors were not greatly dissimilar.
However, when the lengths ranged over a ratio of five or more,
the estimator tended to overestimate the shortest and
underestimate the longest. This bias arises because, for large K,
the multivariate Normal prior implies a TLV
length prior (2.1) with a relatively narrow peak at
Unfortunately, the Normal TLV prior is the only form for which we were able to find an explicit expression for the corresponding OLV prior. As a compromise, we adopted the following OLV prior for load and score lengths:
and C_{NKJ} is a normalization constant discussed later. This form assumes a very diffuse load length prior, and retains a "correction" factor similar to that of the Normal form (2.3). Simple geometric arguments show that the joint OLV length density derived from any smooth, independent, TLV length density will exhibit a "correction factor" of roughly this form. The prior (2.4) is a generalization of the "Cauchy" prior used in WF, and is far more diffuse than the Normal form. The joint prior density of the directions of the OLVs is assumed to have the uniform value
is the surface area of the unit Dsphere, and we similarly assume the joint prior density of the directions of the score vectors to be uniformly
As in WF, we assume the specific variances {σ_{k}} to have independent prior densities proportional to 1/σ_{k} in some finite range, and the means {μ_{k}} to have independent uniform priors in some finite range. The assumed priors import little useful information into the analysis. Those for means and specific variances have the conventional "uninformative" forms, the priors of load and score vectors presume no preference for any direction, and the prior for load strengths expresses only the vaguest expectation that they will be of the same order as the specific variations. 3. The Information MatrixMML chooses the model and parameter estimate θ which minimize
where h(θ) is the prior density of the parameter vector θ,
and 1/√I(θ)
is an approximation to
the volume in θspace spanned by the
expected estimation error.
I(θ) is the
determinant of the matrix of expected partial second derivatives of T
with respect to the components of θ. For many
estimation problems, I(θ)
is well approximated
by the Fisher Information, but for the factor model,
the derivatives of h() cannot be neglected.
As in WF (section 3), we take into account the derivatives of
Using Σ_{n} v_{n} = 0 from WF p.198, and changing to scaled components gives
The calculation in WF generalizes directly to J>1 factors giving
Note that (3.2) is the determinant of a matrix which includes derivatives with respect to all K components of each OLV β_{j} and all N components of each score vector v_{j}. Thus, it includes the sensitivity of T to differential changes of {β_{j}} and {v_{j}} which do not preserve the mutual orthogonality of OLVs and score vectors. However, in computing (3.2), use has been made of the fact that the derivatives are evaluated for mutuallyorthogonal vectors. The question of enforcing the derivatives to preserve orthogonality is addressed below. First, we switch to the polar parameterization of the model, using length and direction rather than Cartesian components for the OLVs and score vectors. In this change of parameters, the sensitivity determinant transforms as the square of a density, giving
Consider some pair of OLVs β_{j} and β_{l}, and their associated score vectors v_{j} and v_{l}. The expression (3.3) contains contributions from the sensitivity of T to changes in direction of β_{j}, β_{l}, v_{j} and v_{l}. Let us consider changes in the directions of β_{j} and β_{l} corresponding to differential rotations of these vectors in the plane containing them both. Let θ_{j} be an angle giving the direction of β_{j} in the plane, and θ_{l} be an angle giving the direction of β_{l} in the plane. Similarly, let φ_{j}, φ_{l} be angles giving the directions of v_{j}, v_{l} in the plane containing them both. Then I_{3}(θ) contains a contribution due to the expected second differentials of T with respect to θ_{j}, θ_{l}, φ_{j} and φ_{l}. It can be shown that this contribution amounts to a multiplicative factor of I_{3} given by
and that there are no crossderivatives with other parameters of the model with nonzero expectation. As noted above, (3.4) arises from the unconstrained variation of the four parameters θ_{j}, θ_{l}, φ_{j} and φ_{l}. In fact, the four parameters are constrained by the requirement of orthogonality, so that there are really only two parameters, say θ and φ, with
When T is expressed in terms of these two parameters, and its second differentials with respect to them calculated, it is found that together they contribute to the sensitivity determinant a multiplicative factor of the form
rather than the form (3.4). The above expression shows that when factors j and l have equal size, a rotation can be applied to the OLVs and score vectors which has no effect on the likelihood or the priors, and hence no effect on T. The reader might be surprised by the fact that this degeneracy becomes apparent only when the orthogonality constraint is taken into account, since such rotations are possible, and have no effect on T, whether or not the OLVs are explicitly constrained to be orthogonal. The reason that the degeneracy is not apparent in the expression I_{3}(θ) is that I_{3}(θ) indicates the effects of perturbations of the parameters on T only to second order. When the OLVs and score vectors are not explicitly constrained, simultaneous variation of two OLVs and two score vectors is required to produce a perturbation having no effect on T, and so the degeneracy would be revealed only by a fourthorder analysis treating differentials up to the fourth. Once the orthogonality constraint is applied, however, the simultaneous variation involves only two, rather than four, dimensions of parameter space, and so is revealed by a secondorder analysis using second partial differentials. Modifying I_{3}(θ) by replacing a factor of form (3.4) by a factor of form (3.6) for every pair of OLVs gives the final sensitivity determinant as
4. The MML EstimatorIn this section, σ_{k} , β_{j} , v_{j}, etc. refer to estimates rather than "true" parameter values. The MML estimator is chosen to minimize
Omitting constant terms, we have
where {y_{nk}}
are the scaled data
Note that the factors of the form
This cancellation is not an accident, as these factors arise from a singularity in the mapping of TLVs onto OLVs which affects both the prior and I in similar ways. Consider the example of a model with only two latent vectors (J=2) in K dimensions. The space of possible TLVs has 2K dimensions, but the space of possible OLVs has only 2K1, as the OLVs lose one degree of freedom by being constrained to be orthogonal. For each OLV pair, there is a onedimensional manifold of TLV pairs which map onto the OLV pair. Normally, the mapping is unique: such a manifold of TLV pairs maps to a single OLV pair. However, consider a set of TLV pairs in which all pairs lie in the same fixed plane of Kspace, both TLVs of all pairs have the same fixed length, and the two TLVs of each pair are orthogonal. This set is a 1dimensional manifold, as there is only one degree of freedom: the angular orientation of the TLV pair in the fixed plane. The matrix
has the same value for all pairs in the set. However, rather than mapping into a single OLV pair, such a special set of TLV pairs maps into a 1dimensional manifold of OLV pairs identical to the set of TLV pairs. Normally, the prior density in a 1dimensional manifold of TLV pairs condenses onto a single OLV pair, but for a special TLV set, as above, the prior density is spread over a manifold of OLV pairs, so these OLV pairs acquire a zero prior density. Similarly, it is normally the case that if two OLV pairs differ, then they are images of two different TLV manifolds, giving rise to two different data distributions. However, if the two OLV pairs are members of the image of the same special TLV set, they give rise to the same data distribution, and so the perturbation which changes one pair into the other (the θ of (3.6)) has no effect on T, and so I is zero. An OLV pair is in the image of a special TLV set just when the two OLVs have equal length, so just in this case the prior density of the pair drops to zero, and I becomes zero.
We now proceed to describe the MML estimator.
Then the estimates minimizing MessLen satisfy (Wallace and Freeman 1992):
Equations (4.2) to (4.6) may be solved by functional iteration. After each iteration, the revised vectors {β_{j}} are reorthogonalized by evaluating
To start the iteration one may set
As in WF, the iteration may drive an OLV to zero if the data are consistent with a model having fewer than J factors. In that case, the iteration is restarted with reduced J. The iteration does not directly require or yield estimates of the score vectors. These may be calculated after convergence as
5. Message LengthThe full calculation of the message length is not needed in estimating the parameters of a model with a specified number of OLVs. It is needed for the comparison of models with differing numbers of OLVs, and can be used for comparison of a factor model with a model of different functional form. The full expression for the message length of data vector x using a model with estimated vector parameter θ, prior density h(θ), senstivity matrix I(θ) and probability law P(xθ) is (Wallace and Freeman 1987).
where D is the number of scalar parameters, i.e. the dimension of θ. Using (2.4, 2.5, 2.6, 3.7 and 4.1), we obtain
where D = 2K + J (N+KJ+1), S_{m} is defined in (2.5), y_{nk} in (4.1), and Y is defined before (4.2). The finite ranges used in normalizing the priors for {μ_{k}} and {σ_{k}} are omitted from (5.2) as they affect all models equally. The normalization constant C_{NKJ} (from (2.4)) is not known in closed form. Numerical integration of (2.4) for various N, K and J has shown that log C_{NKJ} depends little on N for N>>1, and for N≥40 is well approximated by the empirical expresssion
where the constants A_{J}, B_{J} are:
6. Testing6.1 Artificial DataTest runs were made on artificial data sets randomly selected from populations with known TLV (and hence OLV) structure. In each run, 1000 populations were tried, all with the same number K of dimensions, the same number J of TLVs and all having the same set of TLV lengths. With no loss of generality, all specific variances {σ_{k}^{2}} were set to one. The populations differed in that, in each population, the TLV directions were randomly and independently selected from a uniform distribution over the Ksphere. For each population, a sample of constant size N was selected and the covariance matrix about the sample mean computed. This matrix was then analysed by both ML and MML methods, seeking estimated models with J1, J and J+1 OLVs. The MML method would sometimes fail to find a model with the requested number of factors, instead finding a reduced number. For most runs, we chose K=16, N=300, and J in the range 2 to 5. The sets of factor lengths were chosen arbitrarily, with most runs including at least one factor small enough to give some difficulty in estimating it. Note that the TLV lengths, which were held constant in each of the 1000 trials in a run, were not selected from the assumed prior distribution of TLV lengths. Hence the population OLV lengths, which varied from population to population, were not drawn from the assumed OLV length prior. In some runs, the population OLVs of each population were generated directly, with a fixed set of equal lengths and random mutuallyorthogonal directions. These "orthogonal" runs were designed to test the MML estimator on populations for which the assumed prior OLV density (2.4) was zero. The score vectors {v_{j}} were not explicitly estimated. ML cannot consistently estimate them together with the load vectors {β_{j}} and specific variances {σ_{k}^{2}}. MML can do so, but the directions of the score vectors were eliminated from the iteration equations (4.24.7), and only the covariance matrix made available to the two estimators. The estimated OLVs and specific variances estimated by the two methods for each population were compared with the population values using criteria described below, and the average error criteria for the run compared. 6.2 Error CriteriaWe chose criteria which are meaningful even if the
estimated number of OLVs J differs from the
population number J.
Recall that all test populations
have S_{1} = Σ_{k} log σ_{k} is a measure of general over or underestimation of specific variance (and so under or overestimation of OLV effects.) S_{2} = Σ_{k} (log σ_{k})^{2} is a simple measure of error in estimation of specific variances. Both S_{1} and S_{2} are zero for exact estimates. In comparing ML and MML estimators, S_{1} and S_{2} are relevant because both estimators agree in estimating the OLVs as eigenvectors of the σscaled covariance Y. Hence good estimates of {σ_{k}} are essential for their estimation of OLVs. KL = ∫ dx P(xθ) [log P(xθ)  log P(xθ)] is the nonsymmetric KullbackLeibler distance between the true population data density and that implied by the estimate θ. It is zero for an exact estimate, otherwise positive. It does not require P(xθ) and P(xθ) to have the samea number of parameters, and is invariant under nonlinear transformations of the parameter space. For each run, the ML and MML estimators were compared on
two bases.
First ("fixed J"), we compare the
estimates produced when each method was asked to find a
model with the true number of J of OLVs.
The MML method
sometimes returned a model with fewer OLVs. Second
("chosen J"), we compared the results when the number
of OLVs was estimated.
The three ML models for
6.3 Results on Artificial DataAll runs showed MML to give the lower average values for
all three error criteria on both fixed J and chosen
The criterion S_{2}, measuring squared error in {σ_{k}}, was typically less than half as large for MML as for ML. The KL criterion was typically about 0.01 to 0.03 less
for MML. This may seem a small difference. However, a
difference of 0.02 shows that if the log likelihoods of the
ML and MML models were compared on another sample with N=300
from the same population, we would expect the MML model to
have the greater likelihood, by a factor of about
Overall, the chosen J and fixed J comparisons gave similar results, with the MML advantage increasing with increasing J. Using the "chosen J" basis, we compared the estimates
J of the number of factors. Generally, ML was far
more likely than MML to find
This result might suggest that ML is superior to MML in
estimating the number of TLVs. However, closer examination
shows the apparent ML advantage to be an illusion. When
the 347 trials were examined in which ML estimated
For runs with strong population TLVs of markedly differing
lengths, it is possible to identify estimated OLVs with
population OLVs with little confusion. In several such
runs, additional error criteria were assessed. One summed
the squared differences between the (unscaled) population
and estimated OLVs, another summed the squared sine of the
angles between population and estimated OLV directions.
When Runs in which the population TLVs had equal length and were mutually orthogonal (and hence gave equallength OLVs) showed no unusual behaviour, despite the fact that such OLV sets have zero prior density in our scheme. The superiority of MML remained evident. Details of the above tests are available in a Technical Report (Wallace 1995). 6.4 An ExamplePress and Shigemasu (1989) analyse some aptitude data from Kendall (1980) with N=48, K=15, using their explicit Bayesian method. They assume four factors, with an informative prior on the load vectors based on experience with similar data. After determining by unspecified methods some hyperparameters appearing in their model priors, they find estimates of the factor scores and loads. (We denote their results "P&S".) The same data (p.282) was analysed by ML and our MML methods, attempting to find models with 3, 4 and 5 factors. The three ML models gave AIC values of 512.0, 543.6 and 540.8 respectively, weakly preferring four factors over five. MML found no 5factor solution, and gave message lengths for 3 and 4 factors respectively 195.3 and 197.1 less than the nofactor message length. Thus MML also prefers four factors, but the posterior odds in favour of four factors over three is only e^{1.8} or about 6:1. The agreement among all three fourfactor models was assessed by evaluating the RMS sine of the angles (in unscaled space) between corresponding load vectors of two models, where the correspondence was chosen to minimize the RMS sine. The angles given by these RMS sines for each pair of models, and the average cosines between matched loads, were:
There is reasonable agreement between the MML and PressShigemasu results, with little doubt as to the correspondence between load vectors. The ML model is hard to reconcile with either of the other two, and has one load vector deviating by at least 65° from all loads of the other two models. This result, details of which are given in Wallace (1995) suggests the two Bayesian methods, despite their different priors and approaches, recover similar views of the data. The weak MML preference for four, rather than three, factors, together with the behaviour of ML on artificial data, gives grounds for doubting that the aberrant ML factor has any validity. 7. Discussion7.1 The Impact of the PriorsThe MML method has been shown by extensive testing to be more accurate than Maximum Likelihood estimation when the data is drawn from a population consistent with the Normal Factor model. It may be asked whether the improved accuracy is partly or wholly the result of the prior information assumed in MML and other Bayesian methods. To some extent, the assumption of a proper prior for TLV lengths must help suppress the improper solutions sometimes observed using ML, as these improper solutions are characterized by large β_{j} and small σ_{k}. However, the effect of our prior is small. It is very diffuse, having infinite second moments, and at least for N>>K, its effect on the message length is swamped by the effects of the sensitivity determinant. We have found experimentally that even when an improper flat prior for TLV lengths is assumed, MML still outperforms ML, and still gives on average larger estimates of specific variance despite the 1/σ prior. The assumed prior density for TLV lengths could have
been of little assistance to the MML method on the test
runs reported in section 6.3, which mostly used data
drawn from populations with TLV lengths of 1.0 or 2.0.
Under the assumed prior, with K=16, the prior
probability of a TLV length being 2.0 or less is only
about 0.06, and for lengths 1.0 or less is only 0.001.
Hence the prior was misleading rather than helpful
under the test conditions. On the same data, a
modified MML method assuming the Kdimensional Normal
prior (2.1) was also tested. This prior has a
hyperparameter ρ determining the length scale of
the prior, and the method estimated ρ from the
data. The modified method gave almost the same
estimates as the original MML method, but because
factor lengths around 1.0 are less severely penalized
by this prior (with ρ estimated at about 0.25), the
minimum message length was more frequently achieved with
The reason MML suppresses improper models is not that they occur in regions of parameter space having low prior density. Rather, it is that their high likelihood is achieved only in a very small region of parameter space. A message using such an improper solution to encode the data would thus have to specify the parameters to high accuracy (many digits of precision) making the message long. No other aspect of our assumed priors can be reasonably considered as importing useful information. The prior over load vector directions is uniform, representing total ignorance. The Normal prior assumed for factor scores serves only to disambiguate the confusion between load and score lengths, and is in any case tacitly assumed in the conventional ML Normal model. 7.2 Choice of Number of FactorsMML appears to provide a fairly sound basis for estimating
the number of factors. It is conservative, erring on the
side of rejecting factors for which the evidence is weak.
If an MML solution with J factors yields a correlation
matrix among the residuals
When compared with AIC selection among ML solutions, MML may more frequently select a small number of factors, but none the less still gives a better model of the population density by KL measure. With the prior here assumed, the MML choice of J may underestimate J when an OLV is small, simply because the prior probability assigned to small (scaled) OLV lengths is small. Experiments reported in (Wallace 1995) showed that, rather than choosing the MML model of smallest message length, one may with advantage choose the MML model with the highest J for which a solution exists. For populations with scaled OLV lengths around 1.0, this scheme may find a model with an additional factor (and slightly higher message length) and on average was slightly better both at estimating J and on our error measures. Where this scheme found a different model, the message length increase over the minimum message length model was on average only about 2.0, which is of little significance. For data with larger factors, both schemes almost always agree. That is, the model with the smallest message length is almost always the model with the largest number of factors found by the MML iteration. 7.3 Assessment of ErrorUnlike some other Bayesian methods, MML does not directly compute or use the posterior density. However, the elements of the sensitivity matrix may be used as indications of probable estimation error. For instance, it is easily shown that the probable error in estimating any one factor score v_{nj} is of order 1/√{1+β_{j}^{2}}. 8. AcknowledgementThis work owes much to the assistance of References


↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Friday, 23Aug2019 08:28:32 EDT. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 