Dynamic Programming

1-D Dynamic Programming Algorithm

The 1-D dynamic programming problem is: Given vertices, v_lo, v_lo+1, ..., v_hi-1, v_hi, directed edges (v_i,v_j) for all i and j where lo≤i<j≤hi, and an edge cost function, edgeCost i j, find an optimal path, [v_i₀, v_i₁, ..., v_{i_m}], from v_lo to v_hi.

This is a special case of the shortest-path problem; here the graph is a complete, directed acyclic graph as determined by the ordering on vertices. It can therefore be solved by a variation on the dynamic programming strategy which is also the basis of Prim's minimum spanning tree algorithm and Dijkstra's single-source shortest paths algorithm.

The algorithm maintains paths from v_lo to a set of "undone" vertices, initially containing v_lo+1, ..., v_hi; it also keeps a set of "done" vertices, initially empty. The starting path from v_lo to v_i consists of the single edge (v_lo,v_i) and is unlikely to be optimal. The algorithm repeatedly finds the best undone vertex, v_b, that is closest to v_lo and moves it to the "done" set. v_b may also reveal shorter paths from v_lo via v_b to remaining undone vertices. Enough information is kept to trace an optimal path backwards from v_hi to v_lo when v_hi becomes done.


module DPA1D (dpa1D) where

dpa1D lo hi edgeCost =
-- Given vertices  lo, lo+1, ..., hi -- and directed,
-- +vely weighted edges (i,j,edgeCost i j) for all i < j,
-- return an optimal path from lo to hi with its cost,
-- i.e. ([lo, i, j, ..., hi], c) where lo < i < j < ... < hi.
 let
  dpa undone done =
    let (b@(_,v,_)) = best undone -- closest (to lo) of the undone
    in if v == hi then b:done
       else dpa (removeAndUpdate b undone) (b:done)

  undone = map (\j -> (lo, j, edgeCost lo j)) [lo+1..hi]  -- initially

  removeAndUpdate (edge@(i,j,c)) edges =
   let
    rmvEdge ((edge'@(_,j',_)):es) =
      if j==j' then map update es  -- drop edge'; seek shortcuts
      else edge' : (rmvEdge es)    -- else j > j'
    -- assumes edge (i,j) => i < j, and that unDone sorted on j.

    update edge'@(_,j',c') =       -- does j help j' ?
      let viaJ = c + edgeCost j j' -- ...(_,j,c)(j,j',c') ?
      in if c' > viaJ then (j,j',viaJ) -- improvement or...
         else edge'                    -- ...not

   in rmvEdge edges

  better (e@(_,_,c)) (e'@(_,_,c')) = if c' >= c then e else e'
  best = foldl1 better

  trace k edges path =         -- trace back optimal soln...
    if k <= lo then k:path     -- ...from k down to lo
    else let ((i,j,c):edges') = edges
         in if j == k then trace i edges' (k:path)
         else trace k edges' path

  (edges @ ((e1@(i,j,cost)):_)) = dpa undone []

 in (trace hi edges [], cost)
-- ---------------------------------------LA--UWA-&-U.York--2004--

(About [Haskell, ghc, *.hs].)

This problem appears in various disguises, for example in inserting line breaks for paragraph layout and in polygon fitting. It also appears in the inference and data-compression problem (Farr & Wallace 2002) described in the next section.

SMML Code-Book for the 2-State (~Binomial) Distribution

A transmitter and a receiver want to share the results of n Bernouilli trials, i.e. n throws of a coin, carried out by the transmitter, by sending them over a data link. They could use a fixed code, e.g. based on pr(H)=0.5 in which case the data would be sent at 1-bit per throw. But they might be able to do better than this if the coin is biased.

The transmitter, who is the one making the trials, is in a position to estimate θ=pr(H) and to send the results using a code based on θ, but the receiver cannot decode such data unless told the estimate in a header to the message. The transmitter and receiver need a "code-book" for (the parameter of) the 2-state distribution.

Just considering the number of heads, #H, that could turn up, this quantity lies in [0..n]; there are n+1 possible values for #H. It is clearly impossible to infer pr(H) more precisely than something like ±1/(2×(n+1)) and it turns out that a sensible precision is much coarser than this.

The code-book contains a partition of [0..n] (which implies a partition of [0.0..1.0]). The transmitter conducts the trials and selects the entry for the part of the partition into which #H falls. The message header is the code-word for that part. The transmitter then transmits the trial results using the code for {H,T} based on the estimate corresponding to the part. The receiver now has the estimate and can decode the results of the trials.

The 1-D DPA can be used to partition [0..n] optimally(F&W 2002). It is convenient to set lo=-1 and hi=n. This example illustrates the correspondence between a path and a partition:

path	[ -1, 1, 2,5 ]
edges	[(-1,1), (1,2), (2,5)]
-----> ----------> ---------------> -1 0 1 2 3 4 5
partition	[[0..1], [2..2], [3..5]]

The cost of an edge (i,j), i.e. part [i+1..j], is the expected message length over those outcomes containing more than i and at most j heads. A prior probability distribution on θ is needed to complete the calculation; the uniform distribution over [0.0..1.0] is used below for simplicity.


module Binomial (module Binomial) where
import DPA1D

binomial n =  -- compute the SMML code-book for n Bernouilli trials
 let
  n' = fromIntegral n  -- often need n in real ratios etc.
  n1'= n'+1            -- ditto, |[0..n]|

  codeTrials p h n =   -- p = pr(h), h = #head, n = #trials
    let t = n-h
    in -(if h > 0 then (fromIntegral h)*log p     else 0.0)
       -(if t > 0 then (fromIntegral t)*log (1-p) else 0.0) -- no 0*inf 

  interval lo hi =  -- expected msg len due to single interval [lo..hi]
    let cases  = hi-lo+1           -- |[lo..hi]|
        cases' = fromIntegral cases
        est    = fromIntegral(lo+hi) / (2.0 * n')
    in (cases' * log(n1'/cases')                   -- uniform prior
        + codeTrials est (cases*(hi+lo) `div` 2) (cases*n)  -- m.l. 
       ) / n1'                                    -- in expectation

  edgeCost i j = interval (i+1) j  -- (i,j) covers [i+1..j]

  (path, c) = dpa1D (-1) n edgeCost
  parts = zip (map (+1) path) (tail path)  -- partition [0..n]
 in (parts, c)
-- --------------------------------------------LA--UWA-&-U.York--2004--

A simple main program runs the algorithm through its paces:

 
module Main where
import DPA1D
import Binomial

main =
  putStrLn ("SMML code-book for 2-state (~binomial) distribution\n" ++
         "key: n, partition, Expected msg length (bits)")
 >> putStrLn( foldr (\n -> \theRest ->
      let (p, msgl) = binomial n
      in (show n) ++ ", " ++ (show p) ++ ", "
         ++ (show (msgl/log 2)) ++ "\n" ++ theRest
      ) ""
     [1..30] )  -- a range of n
-- ---------------------------------------LA--UWA-&-U.York--2004--

E.g.

SMML code-book for 2-state (~binomial) distribution
key: n, partition, Expected msg length (bits)
1, [(0,1)], 1.0
2, [(0,2)], 2.0
3, [(0,0),(1,3)], 2.88
4, [(0,0),(1,3),(4,4)], 3.77
5, [(0,0),(1,4),(5,5)], 4.58
6, [(0,0),(1,5),(6,6)], 5.43
7, [(0,3),(4,7)], 6.25
8, [(0,0),(1,5),(6,8)], 7.04
9, [(0,0),(1,5),(6,9)], 7.83
10, [(0,0),(1,4),(5,9),(10,10)], 8.63
...

Farr and Wallace (2002) show that the SMML code-book problem is feasible, that is computable in polynomial time, for probability distributions having 1-D discrete parameter spaces. "The problem in general is [...] NP-hard." "The complexity of the trinomial [3-state] case remains open" but they give a good heuristic for it; the trinomial problem has two parameters, θ₀=pr(0) and θ₁=pr(1), pr(2)=1-θ₀-θ₁.

Approximate Code-Book for the Normal (Gaussian) Distribution

The normal distribution causes us two difficulties: Its parameter space, (μ,σ), is two dimensional and both parameters are continuous. We can appoximate its code-book by considering μ|σ and by quantising the range of σ. This is to make two simplifying assumptions: An approximate code-book entry corresponds to a rectangle in parameter space which is suboptimal in general, and σ is quantised.

μ is an "origin" parameter: There is no nett effect on the likelihood if μ and all data are increased by the same value x. σ is a "scale" parameter: If μ=0 (see above), there is no nett effect on the likelihood if σ and all data are multiplied by the same value x>0. Taking advantage of these facts, σ quanta are chosen so that their log values are equally spaced and, for a given σ interval, the range of μ is divided into some number of equal sized intervals.

The expected "cost" of using one estimate N_mEst,sEst per "patch" (mLo,mHi)×(sLo,sHi) is the cost of sending the estimate plus the expected KL-distance from the (untransmittable) maximum likelihood estimate to it. This expected KL-distance be calculated by integration of the KL-distance from N_maxLH to N_mEst,sEst over the patch. Making the patches smaller reduces this cost to the data but it also increases the cost of stating the estimates, giving a trade-off to be optimised.


module Normal (module Normal) where
import DPA1D

normal nData mLwb mUpb sLwb sUpb nS =
 let      -- m mean,   s sigma
  nS' = fromIntegral nS
  sRange  = sUpb - sLwb
  logSlwb = log sLwb
  dLogS   = (log sUpb - logSlwb) / nS'
  sVal n  = exp(logSlwb + (fromIntegral n)*dLogS)

  sInterval i j =
   let si = sVal i
       sj = sVal j
   in ( log(sRange/(sj-si))   -- code sEst, uniform prior
      + snd( doMean si sj )
      ) * (sj - si) / sRange  -- expectation due to [i..j]

  doMean sLo sHi =  -- given [sLo...sHi], optimally divide...
   let              -- ...[mLwb...mUpb] into equal parts
    sEst = (sLo + sHi) / 2

    divide n =
      let n' = fromIntegral n
          dM = (mUpb - mLwb) / n'
          mEst = dM / 2
          eMsgL =
           log n'   -- code for mEst, uniform prior
           + nData * (integrateKL 0 dM sLo sHi mEst sEst)
                   / ((sHi - sLo) * dM)
           -- i.e. + nData * avg data "error"
      in (n, eMsgL) -- divide

    best (xc:xcs) =  -- linear-search for the (only) min
      let b (xc@(x,c)) ((xc'@(x',c')):xcs) =
            if c' < c then b xc' xcs else xc
      in b xc xcs

   in best (map divide [1.. ])  -- doMean

  (sNums, c) = dpa1D 0 nS sInterval
  sVals = map sVal sNums
  sIntvls = zip sVals (tail sVals)
  mNs = map (\(sLo,sHi) -> fst(doMean sLo sHi)) sIntvls

 in ((zip sIntvls mNs),  c)   -- normal


integrateKL m1Lo m1Hi s1Lo s1Hi m2 s2 =  -- integrate KL over m1 & s1
 let f m1 = ((m1-m2)**3) / (6*sqr s2) + m1*(log s2 - 0.5)
     g s1 = (s1**3) / (6*sqr s2) - s1*(log s1 - 1)
     -- NB d/dx {x.log x - x} = x/x + log x - 1 = log x
 in (f m1Hi - f m1Lo)*(s1Hi - s1Lo) + (m1Hi-m1Lo)*(g s1Hi - g s1Lo)

sqr x = x*x  -- maybe faster than x**2 ?
-- --------------------------------------------LA--UWA-&-U.York--2004--


module Main where
import DPA1D
import Normal

main =
  putStrLn ("Approximate code-book for Normal (Gaussian) distribution\n"
   ++ "key: ((sigmaLo,sigmHi),nMuIntervals)... E |H|+n*KL_dist (nits)" )
 >> putStrLn(
      let (is,c) = (normal 30  0.0 2.0  0.1 1.0  100)
      in (foldr (\x -> \rest -> (show x) ++ "\n" ++ rest) "" is)
	 ++ show c
    )
-- ---------------------------------------LA--UWA-&-U.York--2004--

E.g. n=30; μ 0.0..2.0; σ 0.1..1.0 (100 quanta):

Approximate code-book for Normal (Gaussian) distribution
key: ((sigmaLo,sigmHi),nMuIntervals)... E |H|+n*KL_dist (nits)
((0.10,0.16),24)
((0.16,0.25),15)
((0.25,0.40),10)
((0.40,0.63),6)
((0.63,1.0), 4)
4.32656332488434

References

G. E. Farr and C. S. Wallace. The complexity of strict minimum message length inference. Comp. J., 45(3), pp.285-292, 2002.: Note that they give expected message lengths for transmitting #H only, excluding the trial results.