T. Edgoose, & L. Allison,
Statistics and Computing,
General purpose un-supervised classification programs have
typically assumed independence between observations in
the data they analyse. In this paper we report on an
extension to the MML classifier Snob which enables the
program to take advantage of some of the extra information implicit in
ordered datasets (such as time-series). Specifically the
data is modelled as if it were generated from a first order Markov process
with as many states as there are classes of observation.
The state of such a process at any point in the sequence determines the
class from which the corresponding observation is generated. Such a
model is commonly referred to as a Hidden Markov Model. The MML calculation
for the expected length of a near optimal two-part message stating
a specific model of this type and a dataset given this model is
presented. Such an estimate enables us to fairly compare models which
differ in the number of classes they specify which in turn can
guide a robust un-supervised search of the model space.
The new program, tSnob, is tested against both `synthetic' data and
a large `real world' dataset and is found to make unbiased estimates
of model parameters and to conduct an effective search of
the extended model space.