## Decision Trees / Classification Trees

 LA home Computing MML  Glossary  Structured   Mixtures   HMM   DTree   DGraph   Supervised   Unsupervised   Graphs   D-trees also see  More  WWW7

A Decision Tree, more properly a classification tree, is used to learn a classification function which predicts the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables). This solves a problem known as supervised classification because the dependent attribute and the number of classes (values) that it may have are given.

We could try to learn a complex tree that best fits the training data. However, such trees do not generalise well, i.e. they over-fit the training data, they are fitting noise. By calculating a message length for a tree, we will provide a natural stopping criterion that can trade-off the complexity of a tree against its fit to the training data. It may be that the data only justifies a tree with a simple structure.

### Binary Attributes

First consider the case of decision trees over binary attributes (true/false, yes/no, head/tail ...), e.g. my car won't work:

```         @1         @2          @3          @4
in tank?  work?       turns over   problem

case 1:   y          y           n           y
case 2:   n          y           y           n
case 3:   y          n           n           n
...       ... etc.
```
— Cases —

A number of cases or training examples may have been produced from historical records and/or by quizzing a (human) expert.

A decision tree, more properly a classification tree, can predict one attribute (variable) from the other attributes.

```                       [@1]
.   .
y.       .n
.           .
.               .
[@2]              [3:17]
.    .
y.        .n
.            .
.                .
[@3]             [8:2]
.  .
y.    .n
.      .
.        .
[2:6]    [6:4]
```
— Decision Tree —

The tree consists of fork nodes, each labelled with an attribute number (to test), and leaf nodes representing sets of cases, i.e. 2-state distrubutions over the dependent variable (here @4).

To calculate the message length of a decision tree, we must encode its structure, the test (fork) attributes and the leaf-node distributions.

 ``` F . . . . . . F L . . . . . . F L . . . . . . L L ``` — Structure —

#### Structure

Right: Label each node 'F' for fork or 'L' for leaf:

Write out the prefix traversal of the tree:

```F F F L L L L
```

Note that there is exactly one more 'L' than 'F' in the string. Consider all strings of 'F's and 'L's that contain exactly one more 'L' than 'F', and where no prefix of the string has this property. Each such string corresponds to one binary tree and v.v.. We see that setting P(F)=P(L)=0.5, and encoding 'F' by a '1' and 'L' by a '0' gives an efficient code for the structure of the tree. This is the Wallace tree code, already introduced as a code for [integers].

#### Attributes

Given the tree's structure, any one of the 'n' attributes can be encoded in log2(n) bits. However, once a discrete attribute has been selected, it will not be selected again in a sub-tree of that node, so the effective value of n reduces by one in the sub-trees.  i.e. log23 for the root fork, log22 = 1 bit for the next level, log21 = 0 bits for the next level.

The tree description now becomes

``` F @1 F @2 F @3 L L L L
```

Note that coding @1 takes log23-bits,  @2 takes 1-bit, and  @3 takes zero bits.

#### Leaves

 Recall that the MML estimator for a [2-state] distribution is py = (#y+1/2)/(#cases+1) pn = (#n+1/2)/(#cases+1) = 1-py — Leaves —

The "input" attribute (here @1, @2, @3) values are common knowledge – known to transmitter and receiver. Transmitter and receiver can both identify the leaf to which a case belongs by carrying out the tests specified in the fork nodes.

The value of the dependent attribute (here @4) must be coded efficiently for each case, using the decision tree. This requires good estimates of p = P(@4="y"), q = (1-p) = P(@4="n"), in each leaf. Each leaf node represents a [2-state] probability distribution, inferred from the training data. We know how to estimate p and how to calculate the message length (complexity) of such a distribution, i.e. how accurately to state P(@4="y") in each leaf.
Done.

`F` @1 `F` @2 `F` @3 `L` 2state[2:6] `L` 2state[6:4] `L` 2state[8:2] `L` 2state[3:17]

#### Tree Message Length

The message length of a complete tree is the sum of the message lengths of (i) the structure, (ii) the fork attributes, (iii) the leaf distributions.

This lets us compute the length of a two part message

```msgLen(tree) + msgLen(dep' attr' | tree)
```

This gives a criterion to optimise when searching for the best decision tree.

#### Code Efficiency

It might be objected that the code for the structure of the tree allows arbitrarily many trees to be transmitted but that only finitely many trees are possible for a given number of attributes. It is possible in principle to renormalise the distribution (that corresponds to the code) but a moment's thought shows that the amount of probablity "lost" is tiny and quite insignificant.

### m-ary Attributes

If all forks in a tree have 'm' sub-trees, then  #leaves = #forks `x` (m-1) + 1.  Asymptotically, the fraction of nodes that are forks is 1/m, and the fraction of nodes that are leaves is (m-1)/m. Therefore, one would code a fork, F, in log2m bits and a leaf, L, in log2(m/(m-1)) bits.

#### Mixed Arity

Wallace and Patrick (1993) use the arity of the parent of a node to predict the probability that the node is a fork or a leaf. Note that the receiver knows the arity of the parent node if the parent's switching attribute is transmitted before the child trees are transmitted.

### Continuous Valued Attributes

In the case of an input attribute, c, being continuous valued, one of its values from the training data is used as a cut-value.

```      [@c < ? ]
.  .
.      .
<.          . >=
.              .
.                  .
```

Sort the values of the attribute in the training data, e.g.

```   v1 <= v2 <= v3 <= ... <=  v15
```

We need to pick one of these values as the cut value. The training data will tell which gives the best fit to the dependent attribute values, but we still need a prior over c's values.

A uniform prior over the index numbers (here 1..15) is one possibility. However, it is more appealing to bias the choice towards the median, then the quartiles . . .   Note that 1/2 + 2.(1/8) + 4.(1/32) + 8.(1/128) + ... = 1.

```            quartile   median   quartile
|
|         |         |
----|----|----|----|----|----|----|----
Pr:      1/32 1/8  1/32 1/2  1/32 1/8  1/32

MsgLen:   5    3    5    1    5    3    5  bits
```

Renormalise for finite numbers of training values.

### Search

The MML criterion gives an objective function to optimise when searching for the best tree, but it does not tell us how to carry out the search. If the number of attributes is small, it may be possible to enumerate all trees, evaluate each one, and choose the best, but this is infeasible in general.

#### Greedy Search

Given a "current" decision tree, it could be expanded by replacing a leaf node by a fork node on some available attribute. The pay-off, in terms of change in total message length can be evaluated for each such possibility, and the best one chosen. The process starts with a tree consisting of one leaf node. It stops when no further split leads to an improvement.

This greedy strategy is reasonably efficient, but can get trapped in local optima.