This section considers positive integers, n≥1, unless otherwise stated
(if you have non-negative integers, n≥0, just add one first.)
This is the first, and perhaps the most fundamental, infinite data-space.
A code over positive integers can be used to transmit data from any
enumerable data-space.
The Geometric and Poisson
distribtions are examples of parameteric probability distributions, the
probability distributions below have no parameters.
If you know that an integer, n, lies in the interval [1,N] (or in [0,N-1])
then it can be encoded in log2(N) bits,
(and this is an optimal code if
the probability distribution is uniform).
What to do when there is no such bound N?
Obviously transmit the length of the code word for n first.
But how to transmit the length?
Transmit its length first, of course!
A sound code can in fact be based on this intuitive idea;
note that logk(n)
decreases very rapidly as k increases.
The leading bit of n is necessarily ``1''
so there is no need to transmit it, except that
it can be used as a flag to determine whether the current value
is a length or the final value of n proper;
lengths are thus given a leading ``0''.
Such a prefix code can be used to code integers of arbitrary size.
Unfortunately the length of a code word as a function of n
is neither convex nor smooth although it is
monotonic increasing+.
integer
components
code word
1
1
1
2
2,2
00 10
3
2,3
00 11
4
2,3,4
00 01 100,
e.g., ~ code'(3)++100
5
2,3,5
00 01 101
6
2,3,6
00 01 110
7
2,3,7
00 01 111
8
2,3,4,8
00 01 000 1000,
e.g., ~ code'(4)++1000
9
2,3,4,9
00 01 000 1001
10
2,3,4,10
00 01 000 1010
...
15
2,3,4,15
00 01 000 1111
16
2,3,5,16
00 01 001 10000,
e.g., ~ code'(5)++10000
...
Note, the spaces in the above code words are only there
to make the components stand out.
The code above is valid, but not at all efficient.
In fact we can do better,
i.e. achieve a non-redundant code,
by using lengths minus one:
The probability distribution P(n) has an infinite expectation:
the probability of n is greater than under the 1/(n.(n+1))
distribution (which has an infinite expectation) for large n.
Use the HTML FORM below to encode an integer `n':
Rissanen (1983) gives, r(n),
r(n) = log2*(n) + log2(2.865)
where log2*(n) = log2 n + log2 log2 n + ... NB. +ve terms only
P(n) = 2-r(n)
as a continuous approximation to code-word lengths and
advocates 2-r(n) as a "universal prior" for integers.
The "2.865" is a normalisation constant to make
the distribution (of the approximation) sum to 1.0.
Note that r(n) is continuous but it (the area under the curve) is not convex
being concave at n=2, 4, 16, 216, ... etc.
This can cause problems for some
optimisation algorithms (Allison & Yee 1990).
Notes
P. Elias.
Universal Codeword Sets and Representations for the Integers.
IEEE Trans. Inform. Theory IT-21 pp.194-203, 1975
Introduced this kind of code for the integers
(- Farr 1999)
S. K. Leung-Yan-Cheong & T. M. Cover.
Some Equivalences between Shannon Entropy and Kolmogorov Complexity.
IEEE Trans. Inform. Theory IT-24 pp.331-338, 1978
Investigated this kind of code for the integers
(- Farr 1999)
J. Rissanen.
A Universal Prior for Integers and Estimation by
Minimum Description Length.
Annals of Statistics 11(2) pp.416-431, 1983.
Advocates the use of the log* distribution
and code.
G. Farr. Information Theory and MML Inference.
School of Computer Science and Software Engineering,
Monash University, 1999.
An excellent source on "universal" codes
(and other things).
Note, a code-word always has one more zero than it has ones.
This allows the end of a code word to be detected.
It also allows the word to be decoded.
Note for example that `1' and `10' are not code words.
The code is efficient, that is to say
the sum over all words, w, of 2-|w| is one.
The code is equivalent to giving a tree with code w
a probability of 2-|w|:
This would be difficult to prove combinatorially, but
consider all infinitely long strings over {0,1}.
We make them all equally probable in the sense that if
they are truncated after k-bits,
then the 2k truncated strings are all equally likely,
i.e. of probability 1/2k, for all k.
The sum of the probabilities of the truncated strings of
an arbitrary length k is clearly 1.
Now 0 (pr=0.5) and 1 (pr=0.5) can be taken as the steps in a random walk.
It is known that a one-dimensional random walk returns to the starting point
with probability 1.0.
Taking the set of infinitely long strings over {0,1},
almost every one of them, i.e. all but a subset having total probability 0.0,
will at some point have one more `0' than `1's.
Truncate each such string at the first such point.
The probability of such a truncated string, w, is 1/2|w|.
This also equals the sum of the probabilities
of all of its infinite extensions.
i.e. the sum of the probabilities of all such code words is 1.0.
This can be used as the basis of a code for positive integers:
Enumerate the code words in order of increasing length
and within that, for a given length, lexicographically say:
integer
code word
probability
1
0
1/2
2
100
1/8
3
10100
1/32
4
11000
1/32
5
1010100
1/128 etc.
6
1011000
7
1100100
8
1101000
9
1110000
10
101010100
1/512 etc.
11
101011000
12
101100100
13
101101000
14
101110000
15
110010100
16
110011000
17
110100100
18
110101000
19
110110000
20
111000100
21
111001000
22
111010000
23
111100000
24
10101010100
1/2048
...
...
...
The first code word of length 2k+3 is 1(01)k00 and
the last is 1k+10k+2.
We see that the code-word lengths
increase in smaller and more regular jumps
than is the case for the log* code.
Notes
C. S. Wallace & J. D. Patrick. Coding Decision Trees.
Machine Learning, 11, pp.7-22, 1993.
The tree-code is used to code the topology of
classification trees (also known as
[decision-trees]).
This allows simple trees and complex trees to be compared fairly.
Other information held in the nodes of the trees is also coded.