Information 

Chris Wallace proposed this (slightly) simplified taxonomy of science to place ‘information’ in perspective.
ExamplesI toss a coin and tell you that it came down ‘heads’. I have told you some information. How much? A computer scientist immediately says ‘one bit’ (I hope). I roll a foursided "dice" with faces labelled {a,c,g,t} and tell you that it landed on ‘c’: two bits of information. Suppose we have a trick coin with two heads and this is common knowledge. I toss the coin and tell you that it came down ... heads. How much information? Nothing, zero, of course. You knew it would come down heads and I have told you nothing that you did not already know. But if you had not known that it was a trick coin, you would have learned one bit, so the information learned depends on your prior knowledge. We have a biased coin; it has a head and a tail but comes down heads about 75 in 100 times and tails about 25 in 100 times, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you say? (Only 0.42 bits in fact.) I pull a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare, and you may never even have seen one before. So if something is certain then it is no information to learn that it has occurred. The less probable, the less likely, that an event is, the more information is learned by being told of its happening. Information: DefinitionThe amount of information in learning of an event ‘A’ which has probability P(A) is
EntropyEntropy tells us the average information in a probability distribution over a sample space S. It is defined to be
ExamplesThe fair coin
That biased coin, P(head)=0.75, P(tail)=0.25
A biased foursided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8
Theorem H1(The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).) If (p_{i})_{i=1..N} and (q_{i})_{i=1..N} are probability distributions, i.e. each nonnegative and sums to one, then the expression
Proof:
First note that to minimise f(a,b,c) subject to g(a,b,c)=0,
we consider f(a,b,c)+λ.g(a,b,c).
We have to do this because a, b & c are not independent;
they are constrained by g(a,b,c)=0.
If we were just to set d/da{f(a,b,c)} to zero
we would miss any effects that ‘a’ has
on b & c through g( ).
We don't know how important these effects are in advance,
but λ will tell us.
Corollary (Information Inequality)
KullbackLeibler DistanceThe lefthand side of the information inequality
Exercise
Notes
 L. Allison 1999
Thanks to Dean McKenzie for the KL ref's. 

↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Friday, 23Aug2019 08:31:05 EDT. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 