Lookup

Lookup Tables

A table is used to store elements. An element may be inserted in the table, searched for (looked up) in the table, and sometimes deleted from the table. The idea of a table is very general and important in information processing systems. For example, a university data base might contain tables of departments, staff, courses, lectures, students, exam results. A compiler might use tables of keywords, variables, types, constants, procedures and functions. A table is a mutable data structure, it has state, it changes with time. Its properties are therefore dynamic and are okup expressed by pre and postconditions (assertions):

Type:

  table of Element_Type

Operations:

  clear  :table
  insert :Element_Type × table
  delete :Element_Type × table
  lookup :Element_Type × table -> boolean
  empty  :table -> boolean

Rules, for all x:Element_Type and T:table :

  1. clear(T)
     assert empty(T)
     assert not lookup(x,T)

  2. assert T=T' and not lookup(x, T)
     insert(x, T)
     assert lookup(y,T) = x=y or lookup(y,T')

  3. assert T=T' and not lookup(x,T')
     insert(x,T)
     delete(x,T)
     assert T=T'

The Table Abstract Data Type.

Clear, insert and delete change the state of the table.

When the table has just been cleared, lookup returns false for any element. When an element is added to the table, lookup returns true for that element and the contents are otherwise unaffected. If a new element is added to the table and then deleted there is no net effect on the table. Two further situations need to be considered. If an element is in the table and an attempt is made to insert it again, the result could either be an error (4a) or no change (4b). If an element is not in the table and an attempt is made to delete it, the result could either be an error (5a) or no change (5b):

4a assert lookup(x,T)      4b assert T=T' and lookup(x,T)
   insert(x, T) % error       insert(x, T)
                              assert T=T'  % no change

5a assert not lookup(x,T)  5b assert T=T' and not lookup(x,T)
   delete(X, T) % error       delete(x, T)
                              assert T=T'  % no change

Application Dependent Cases.

Each possibility is reasonable in some application.

Sometimes the element type consists of a key plus some other attributes or data. The key might be a number or a name (string); it is something unique to identify an element. In this case the lookup operation usually returns the associated attributes rather than just true or false. If there is no extra data we call the element itself the key.

It is important for a program to be efficient: to use little space and to run quickly. How these aims can be achieved depends on the properties of the data and particularly the key, on the operations on the table and on the frequency and pattern of use of the different operations.

We say that the key type is sparse if the key values in use form a very small fraction of the number of possible values in the type. We call it a dense type if most of the possible values come into play. These are only loose definitions and each case must be judged on its merits. For example, if keys are alpha-numeric strings they are necessarily sparse because there are infinitely many possible strings. Even if the string length is limited but reasonably large there are a very great many strings. On the other hand, the days of the week form a dense enumerated type. The sparseness of (some) key-types has a large influence on the method of choice for implementing a table. Indeed many of the data structures described in this book have been devised precisely to deal with sparse types efficiently.

Sets by Tables

A table where the element is all key can also be thought of as a set. If the element type is a dense index type then the table can be implemented as a bit mask. Lookup is done by the membership test, in, addition and deletion by set union and difference. There is a limit of the range of element values in some programming languages the provide a built-in `set' data-type. A large, dense set can be implemented by an array of boolean so this can also be used to implement the corresponding table. Conversely, a sparse set can be implemented by the methods for tables described below.

Table by Unsorted Array

A table to hold a limited number of elements with a sparse key can be implemented using an array and a counter. Initially the table is empty and the counter is set to zero. When a new element is inserted the counter is incremented and the new element stored at that index. The elements are in no particular order other than that of time of insertion. Lookup is by the linear-search algorithm which steps sequentially through the array.

If there are n elements in the table, the search takes O(1) time in the best case and O(n) time in the average and worst cases. Deletion first requires a search to find the element to delete; after that it takes O(1) time to actually remove it: it is over-written with the last element in the array and the counter is decremented. On the other hand, insertion is fast and takes a constant amount of time, provided that the element is known to be absent in which case a search is unnecessary.

Notes.

See the [String Searching] problem which is related to linear search.

Table by Sorted Array

A person does not use linear search to find a name in a telephone directory, except perhaps for Dustin Hoffman's character in `The Rain Man'. Rather the directory is opened at a likely place, and the key compared with names on that page. If the key is alphabetically before the names on the page, the search is continued in the front part of the directory otherwise the back part. The fact that the directory is alphabetically ordered or sorted is essential to this process. The binary-search algorithm makes use of an ordering of the key type in a similar way. The table is implemented by a sorted array. Two "pointers" are used to delimit the search area, initially the occupied part of array. The middle element is compared with the key and either the lower pointer moved up or the upper pointer moved down as appropriate.


eg. Search for 3 in array containing 1 2 3 4 5 6 7 8

  lo              mid            hi
   ?   ?   ?   ?   5   ?   ?   ?

select lower half


  lo      mid     hi
   ?   ?   3   ?   -   -   -   -

select upper half


          lo  mid hi
   -   -   ?   4   -   -   -   -

select lower half


          lo  hi
   -   -   3   -   -   -   -   -
           ^
           3 is found

Example Binary Search.

The process terminates when the search has been restricted to a single element which must be the search key if it is present.


{Search for key in A[1..N]   (Language=Pascal)}

lo := 1;  hi := N+1;

while lo < hi-1 do {i.e. there are at least 2 elements}
begin
{INVARIANT: search key in A[1..N] iff key in A[lo..hi-1]}

   mid := (lo+hi) div 2;
   {assert: lo < mid < hi}

   if key >= A[mid] then {must be in right part}
      lo := mid
   else {i.e. key < A[mid],  search left}
      hi := mid
end;

found := A[lo]=key {?is it there?}

Binary Search

Change the data in the HTML FORM below, click `go', and experiment. The region of the array from `lo' to `hi-1' is [bracketted] and a[mid] is starred `*' in the trace:

Correctness

The algorithm is short but there are some subtle points to observe. Firstly, if the key is present it lies between `lo' inclusive and `hi' exclusive - this is the loop invariant. The invariant is true at the start of the first iteration because lo=1 and hi=n+1. The loop body ensures that the invariant remains true at the start of each subsequent iteration. This is so because of the choice of test in the central if statement. If the mid element is less than or equal to the key and lo is moved up to mid the invariant is still true. The converse is that the mid element is greater than the key and when hi is moved down the invariant is still true. Note the connection between the hi-1 in the invariant and the `>=' test. When hi=lo+1 the search has narrowed to just one element, A[lo..hi-1] = A[lo..lo] = A[lo], which can be compared to the search key for equality.

Termination

The algorithm terminates because the two integer-valued pointers are moving closer together and the exit test must eventually succeed. If hi>=lo+1 then hi>lo+2 and so lo<mid<hi. Therefore regardless of whether `lo' is moved up to mid or hi is moved down to mid, lo and hi move closer together at each iteration and closer to satisfying the termination test.

Testing

It is very easy to get the details of binary search wrong when coding it so it is a good idea to test it on at least the following cases:

an empty array!
an array of one element, (a) equal to the key or (b) less than it or (c) greater than it
several elements with the key less than the 1st or
the key greater than the last element
the key equal to the 1st, a middle, or the last element
the key not in the array and falling between various positions

Complexity

Finally, the algorithm takes O(log(n)) time as each iteration halves the search area. Note that it is not worth testing for equality within the loop and exiting. To do so would add an extra test and time to every iteration. The average number of iterations saved would be limited by


1 × 1/2 + 2 × 1/4 + 3 × 1/8 + ...

which is a convergent series; it converges to two. So the average number of iterations saved would be bounded by two but the (averge and worst-case) loss in time due to the extra test in each iteration executed would be unbounded [Dijkstra].

Binary-search takes O(log(n)) time in all cases which is much faster than linear search. If the table does not change once created, ie. there is no insertion or deletion, it can be sorted once in O(n*log(n)) time; see the chapter on sorting. Insertion and deletion must keep the array sorted and they take O(n) time at worst and on average which is relatively slow. This is because insertion and deletion to the left of position n require elements to be moved up or down to make space or to fill the hole. Using a sorted array may have an extra benefit if the elements are also processed sequentially, for example to print alphabetical lists.

See also the section on binary search trees in the chapter on trees, and solving equations f(x)=0.

Hash Tables

For an apparently crazy idea hashing works remarkably well. A hash function maps key values onto the index type of an array which implements the hash table. Elements are accessed directly using the hash index as an array index. The surprising part is that the best hash functions are essentially pseudo-random functions. The ideal hash function maps the actual key values uniformly onto the index type; unfortunately the ideal is not always achieved.

For many purposes, a "reasonable" hash function on strings is to take the binary value of the 1st character, multiply by three, add the binary value of the 2nd character, multiply by three, and so on, finally adding the binary value of the last character. The hash value is this total modulo the size of the hash table.

The multiplication by three in the hash function ensures that every character contributes to the least significant bit of the hash index. Changing any one character in the string by one will give a different hash index if the size of the hash table is even, and it is often a power of two. Multiplication by an even number would cause the effect of the leading characters of a long string to be lost (exercise: why?).

It is impossible to design a good hash function for all circumstances. In general it is necessary to analyse the statistical properties of the key type and design a hash function to suit. For example, if a hash table is used to manage a table of identifiers in a compiler, (some) programmers are fond of names like x1, x2, x3, y1, ..., z1, ... some of which are liable to hash to the same index. Such an occurrence is called a collision. Collisions are in fact inevitable because the key type is infinite and cannot otherwise be squeezed into a finite index range. The solution is to hash the key and examine the table at that index. If a different key is already at that position then a collision has occurred. A probing strategy then looks at a sequence of positions until a blank is found.

Initial Hash Table: ----- carol ----- bill anne ----- ----- insert("dan") ------>-| |collision, hash("dan")=hash("bill") | | | |probe | | |probe v v v ----- carol ----- bill anne ----- ----- Final Hash table: ----- carol ----- bill anne dan -----: Collisions and Linear Probing.

Linear probing is the simplest probing strategy. It looks at successive elements, modulo the table size, until a blank is found. The new element is inserted there.

The same probing strategy is used in the search.

Entries may cluster together in blocks as a result of collisions as the table fills up. Linear probing amounts to using a linear search on the collision blocks. Its average time complexity is linear in the size of the average block. A good hash function will minimise the size of blocks.

Quadratic probing is a more sophisticated probing strategy.

The step-size increases linearly so the indices generated, i, i+1, i+3, i+6, ..., are described by a quadratic in the step number modulo the table size. Quadratic probing is not guaranteed to probe every location in the table - an insert could fail while there is still an empty location however hash tables are rarely allowed to get full - see later. The same probing strategy is used in the associated search routine.

The virtue of quadratic probing is that the probe locations for two near-miss keys, as opposed to two colliding keys, are not closely related and this reduces the likelihood of large collision blocks forming. With linear probing on the other hand, collision blocks following from similar indices tend to join together slowing the search process. For example, if one set of keys all hash to `6' and another set hash to `7' then the collision blocks of these sets will over-run and merge with each other under linear-probing. With quadratic probing on the other hand, the probes after 6 are at positions 7, 9, 12, 16, ... and the probes after 7 are at positions 8, 10, 13, 17, .... This tends to give small disjoint blocks.

Simulations can give an estimate of the performance of a hash table. The time taken by insertion and lookup is dependent on the number of collisions. The results given here are based on randomly generated keys with a uniform distribution. This is the best possible case; typical results may therefore be somewhat worse.

fill  collisions/look
 10%   0.04
 20%   0.09
 30%   0.15
 40%   0.27
 50%   0.39
 60%   0.68
 70%   0.92
 80%   1.72
 90%   2.29
100%   4.92

Average Collisions per Lookup (Linear Probing)

fill  collisions/look
 10%   0.04
 20%   0.10
 30%   0.15
 40%   0.25
 50%   0.38
 60%   0.53
 70%   0.67
 80%   1.09
 90%   1.40

Average Collisions per Lookup (Quadratic Probing)

It can be seen that with either probing strategy, a hash table less than 60% full is very efficient indeed. Average insertion and search time is essentially constant, O(1). Quadratic probing performs better for a nearly full table.

If the number of entries in the table is liable to grow, a common strategy is to increase the table size when the number of entries reaches a certain percentage of the table's current capacity or when the collision rate reaches some critical level. Elements may have to be re-hashed when moved into the enlarged table. The table may also be shrunk if the number of entries falls below some limit.

Another way of dealing with collisions is known as chaining. Here, each hash table entry contains a pointer to a list of colliding elements. The lists are created in a collection. Deletion is more easily implemented under chaining. A linear search is used within a list. Provided the hash function is good, each list is kept short and the operations are fast. See the chapter on lists for the necessary techniques.

The hash function randomises the order in which elements are stored in the hash table. This makes it difficult to process the elements sequentially and may rule against the use of a hash table in some applications.

Notes

See Rabin's [String Searching] algorithm for an application of hashing.

Tables by Tree Structures

Various tree-based data-structures can be used to implement tables efficiently, e.g.

Exercises

A table is implemented by an array of 1000 elements; initially there are 600 members. What are the approximate relative efficiencies of using
1. linear search and an unsorted array
2. binary search and a sorted array
3. hashing
under the following circumstances
1. 1000 insertions, 1000 deletions and 1000 lookups (interleaved)
2. 100 insertions, 100 deletions and 2800 lookups (interleaved)
Implement a delete operator for a hash table under linear-probing. Think carefully about what happens if x and y collide, and the sequence
```
insert(x,T); insert(y,T); delete(x,T); lookup(y,T)
```
is performed. One possible technique is to mark entries such as x as `deleted'. Search terminates at an empty array slot but continues over a deleted slot. Performance may deteriorate if there are many deletions. Think of other techniques.
Long: Implement a small data base of telephone numbers. The key type is a name, assumed unique. The associated data is a telephone number of up to 10 digits.

The data base commands are:

I name number -- insert a new entry

D name -- delete an entry

P name -- print the number associated with name

(i) Implement a suitable table as a class or module (depending on your programming language) which uses a hash-table internally and exports the table operations.
(ii) Implement an alternative table class/module which uses a sorted array and binary-search internally and exports the table operations.
(iii) Implement a command processor which makes use of a table class/module. Ideally each part should be carried out by a different programmer. Consider the issue of errors, such as trying to print a missing entry, and fully specify the module interface before starting to write code. Test the components in isolation before putting them together into a complete system.