|
|
Encoding Rooted (strict-) Binary Trees
- A strict binary tree, where every node either has
two subtrees (is a Fork), or has zero subtrees (is a Leaf),
can be encoded in 1-bit per edge:
- Perform a prefix traversal of the tree,
- output 'F' for a Fork, 'L' for a Leaf,
- i.e.,
- data Tree = Leaf | Fork Tree Tree
-- data type defn
- code Leaf = "L"
- code (Fork Left Right) = "F"
++ code Left
++ code Right
- e.g.,
- F L F F L L L
-
- Note that the method gives a prefix-code for trees;
the end of the string for a tree can be detected when #L=#F+1.
It is reasonably to give L and F 1-bit codes because the
numbers of Ls and Fs are approximately equal.
- The tree can be reconstructed from the traversal.
-
- Note that encoding then decoding preserves the ordering of sub-trees from
left to right; if this is unnecessary a shorter encoding must be possible.
-
- The method can be extended to non-binary trees
provided the degree of each node is known, for example,
if this is a known constant, or if the degrees are also encoded.
For a tree of constant degree k, #L=(k-1)×#F+1
(so the code-word lengths change).
-
- This method of encoding trees has been used
in Minimum Message Length (MML) inference for
decision- (classification-) trees, and as
the basis of a universal code for
positive integers.
Encoding Rooted n-ary Trees.
- Arbitrary n-ary trees can be encoded in about 2-bits per edge:
- Perform a prefix traversal of the tree,
- output "d", for down, when descending an edge, and
- output "u", for up, when returning up an edge.
- e.g.,
- tree as above,
- d u d d d u d u u d u u
-
- Each edge corresponds to one 'd' and one 'u'.
- The encoding requires about 2-bits per edge.
- The method applies to trees of any degree.
-
- The method gives a prefix-code for trees provided that the end of
the string can be detected; the end can be indicated by appending an
extra 'u' because the root has no ancestor,
- d u d d d u d u u d u u u
- this is unnecessary if the degree of the root is otherwise known.
- The tree can be reconstructed from the traversal.
-
- Note that encoding then decoding preserves the ordering of sub-trees from
left to right; if this is unnecessary a shorter encoding must be possible.
Unrooted n-ary Trees
- An unrooted n-ary tree can be encoded by choosing an
arbitrary vertex (node) as the root and then encoding the tree
as a rooted tree.
This transmits log2|V|-bits of extra information
which can be subtracted.
OuterPlanar Graphs
- An outerplanar graph can be drawn in the plane
so that its vertices lie on a circle;
some edges may run along the circle, and
other edges may run across the interior of the circle without intersecting.
- A tree is an outerplanar graph, and an outerplanar graph is a planar graph.
Encoding Planar Graphs (Plane Graphs)
- A plane graph (a planar graph embedded in the plane)
can be encoded in about 4-bits per edge [Tur84],
in fact in log2(14) bits per edge [Via08].
- graph (after [Tur84])
-
- with spanning tree & traversal (bold),
- encoding of spanning tree: d d u d u u d d u d d u u u
-
- Sweeping around each vertex anti-clockwise,
- encode each "cross-tree" edge by ( . . . ), e.g.,
- d d ( ( u d ) ( ( u ( ( u d ) d ) ) ( u d ) d ) u u ) u
-
- The number of ‘d’s and ‘u’s are approximately equal,
the numbers of (s and )s are equal, and the blandest
assumption is that the number of ‘u’s and (s will be about the same.
The alphabet {d, u, (, )} has size 4 (2-bit codes), and
each edge, of either kind, corresponds to two letters (4-bits).
Any known skew in the alphabet could be used to give better compression.
-
- The graph is planar so edges do not cross.
The graph can be reconstructed by
reconstructing the spanning tree, as before, and
matching up corresponding (s and )s
in a last-in, first-out manner.
-
- [Via08] tightened up the encoding (log214 < 4)
by recognising that certain combinations cannot be produced in the encoding.
-
- Note that the construction encodes more than the planar graph --
it encodes a particular embedding of the graph in the plane, a plane graph.
It also distinguishes the start ("root") vertex;
log2|V|-bits can be saved if this is arbitrary.
-
- If it is possible that the graph is not connected,
the number of connected components must be encoded followed by
each connected component as above.
General Graphs
- G = 〈V, E〉
-
- If loops (〈vi,vi〉) and multi-edges
are not allowed, an
- Undirected Graph has |V|×(|V|-1)/2 possible edges,
- the adjacency matrix can be coded in that many bits if
pr(〈vi,vj〉 exists)=0.5
∀i,j>i,
- and a
- Directed Graph has |V|×(|V|-1) possible edges,
- the adjacency matrix can be coded in that many bits if
pr(〈vi,vj〉 exists)=0.5
∀i,j≠i.
-
- The binomial distribution
can be used if the probability of edge existence is to be
inferred from the graph.
-
- If G is sparse,
pr(edge from vi exists) = p = k / |V| << 1,
- (i) a row of the adjacency matrix can be encoded in about
(|V| - 1) × ( - p log p - (1-p) log(1-p)) bits,
- approximately - log p bits per edge,
- (ii) adjacency lists can
be encoded in about log|V| bits per edge,
- must also encode the length of each list, and
the list should be sorted (so -log(#perms)),
- average list takes about k log|V| - k log k bits.
- Sanity check, (i) matrix, and (ii) lists should be ≡,
- k log p = k ( log|V| - log k ), plausible.
-
- (Similar considerations apply to a dense graph.)
- Regular-enough subgraphs, once identified, imply certain edges
do or do not exist, e.g.,
| | Km |
Km',m'' |
|
| Km | 1s |
? | ? |
| Km',m'' |
? |
0s | 1s |
? |
| 1s | 0s |
| | ? |
? | ? |
- (if Km',m'' disallows within-part edges)
- To be worthwhile, the saving on edges must outweigh the cost of
encoding the membership of a subgraph.
- Finding such subgraphs is hard.
- If the vertex ordering is arbitrary, any special subgraphs
might as well come first and it is only necessary
to encode their kinds and sizes.
If the vertex ordering is not arbitrary,
it takes about log |V|Cm bits
to specify the members of Km, for example.
Motifs
- If a graph is "sufficiently regular" [Tur84], or
if it contains repeated motifs, it must be possible
to obtain a more efficient encoding of the graph using this fact.
-
- E.g., [FM95] suggest that if a graph contains
a bipartite clique, Km,n, the m×n edges of the b.clique
can be replaced by a new "special vertex" and m+n edges involving it.
This transformation, G↔G', results in a graph, G', that has
more vertices and fewer edges (is sparser) but it is not necessarily
compressible unless Km,n subgraphs are frequent in G
compared to "random" graphs, and any compression might be had more directly.
Finding maximal bipartite cliques is hard, but
finding large ones quickly may suffice.
- Other "regular" subgraphs, if frequent in G,
could be used for the same purpose.
Notes
- [FM95] T. Feder & R. Motwani,
Clique partitions, graph compression and speeding up algorithms,
Jrnl. Comp. and Sys. Sci., 51(2), pp.261-272, 1995.
-
- [Tur84] G. Turan,
On the succinct representation of graphs,
Discr. Applied Maths., 8, pp.289-294, 1984.
-
- [Via08] R. Viana,
Quick encoding of plane graphs in log2(14) bits per edge,
Inf. Proc. Lett., 108, pp.150-154, 2008.
|
|