
Jon Oliver devised decision graphs (classification graphs) [Oli92, OW92]
to remove the difficulty that
decision trees (classification trees)
have with modelling "disjunctive" functions, for example,
 independent variables
x_{1}, x_{2}, x_{3}, x_{4},
 dependent variable y,
 function to be learned
 y = (x_{1} ∧ x_{2}) ∨
(x_{3} ∧ x_{4})
A decision tree can certainly describe such a function,
 x_{1}  
(x_{1}=T)
 
(x_{1}=F)

but it requires two copies of the subtree for (here)
x_{3} and x_{4}.
If the learning problem is made slightly harder, for example,
 independent variables x_{i}, i=1..9,
 dependent variable y,
 function to be learned
 y =
(x_{1}∧x_{2}∧x_{3}) ∨
(x_{4}∧x_{5}∧x_{6}) ∨
(x_{7}∧x_{8}∧x_{9})
the decision treetree to describe
it becomes much larger [exercise].
Such trees can be learned, but it generally requires a great deal of
data compared to the intuitive complexity of the function to be learned.
The solution is to "factor out" any common subtrees and
replace the tree by an equivalent
directed acyclic graph (DAG).
Such a DAG may be much smaller than the corresponding tree,
have a smaller messagelength, and
therefore can be justified by much less data (evidence).
Decisiontrees and decisiongraphs describe exactly the same set of function
but decisiongraphs model disjunctive functions more efficiently.
(There is a small penalty for describing a DAG that is in fact a tree.)
The tree data structure is extended, to allow DAGs,
by adding a new kind of node, a joinnode, in addition
to a tree's forks (tests) and leaves.


 Join_{1}:

(The examples above use binary variables
but decision graphs and trees apply to arbitrary
discrete and continuous variables in the obvious way.)
Message Length of a Decision Graph
In learning with MML,
the twopart message length consists of
(i) the message length
of the hypothesis (here a decision graph), plus
(ii) the message length of the data given the hypothesis.
The 2nd part is calculated, as for trees,
using the distribution of the leaf into which a datum falls.
The calculation of the message length of the decision graph
is generalised from that for decision trees:
 Perform a prefix traversal of the decision graph,
encoding forks and leaves (as for decision trees) and
join nodes, but not traversing below any join nodes.
 The traversal may have output j≥0 join nodes.
Note that, if j>0, at least two, but not necessarily all,
of the join nodes output must join with each other[Oli92].
Encode the actual pattern of joining together.
 If some join nodes have been joined together in 2,
continue traversing and encoding the graph from them,
as in 1.
Otherwise stop.
In step 1, cf trees, there are now 3 kinds of node.
Oliver suggests using some fixed probability for pr(join_node);
the tree method is used for a fork or a leaf given that a node is not a join.
In step 2, it is necessary to calculate the number
of possible ways, w, in which at least two of the the dangling joins
can be linked together, and encode the actual way,
log w nits.
Search
The search for an optimal decision graph is harder than
that for an optimal decision tree. Wallace and Oliver suggest
a greedy search with some amount of lookahead.
References
 [OW92] J. Oliver & C. S. Wallace.
Inferring Decision Graphs.
TR 92/170, Dept. of Computer Science [*],
Monash University, Australia 3800.
 Also in,
Proc. of the 1992 Aust. Joint Conf. on Artificial Intelligence,
pp.361367, 1992.

 [Oli92] J. Oliver.
Decision Graphs  An Extension of Decision Trees.
TR 92/173, Dept. of Computer Science [*],
Monash University, Australia 3800.
 Also in,
4th Int. Conf. Artificial Intelligence and Statistics,
Florida, pp.343350, 1993.

 [Hicss93] D. Dow, J. Oliver, T. I. Dix, L. Allison & C. S. Wallace,
A decision graph explanation of protein secondary structure prediction.
26th Hawaii Int. Conf. Sys. Sci., 1, pp.669678, 1993.
 Also see Bioinformatics.

 [*] Became Faculty of Information Technology, Monash.

