A table is used to store elements. An element may be inserted in the table, searched for (looked up) in the table, and sometimes deleted from the table. The idea of a table is very general and important in information processing systems. For example, a university data base might contain tables of departments, staff, courses, lectures, students, exam results. A compiler might use tables of keywords, variables, types, constants, procedures and functions. A table is a mutable data structure, it has state, it changes with time. Its properties are therefore dynamic and are expressed by pre and postconditions (assertions):
Clear, insert and delete change the state of the table.
When the table has just been cleared,
lookup returns false for any element.
When an element is added to the table,
lookup returns true for that element
and the contents are otherwise unaffected.
If a new element is added to the table and then deleted
there is no net effect on the table.
Two further situations need to be considered.
If an element is in the table and an attempt is made
to insert it again, the result could either be an
Sometimes the element type consists of a key plus some other attributes or data. The key might be a number or a name (string); it is something unique to identify an element. In this case the lookup operation usually returns the associated attributes rather than just true or false. If there is no extra data we call the element itself the key.
It is important for a program to be efficient: to use little space and to run quickly. How these aims can be achieved depends on the properties of the data and particularly the key, on the operations on the table and on the frequency and pattern of use of the different operations.
We say that the key type is sparse if the key values in use form a very small fraction of the number of possible values in the type. We call it a dense type if most of the possible values come into play. These are only loose definitions and each case must be judged on its merits. For example, if keys are alpha-numeric strings they are necessarily sparse because there are infinitely many possible strings. Even if the string length is limited but reasonably large there are a very great many strings. On the other hand, the days of the week form a dense enumerated type. The sparseness of (some) key-types has a large influence on the method of choice for implementing a table. Indeed many of the data structures described in this book have been devised precisely to deal with sparse types efficiently.
A table where the element is all key can also be thought of as a set. If the element type is a dense index type then the table can be implemented as a bit mask. Lookup is done by the membership test, in, addition and deletion by set union and difference. There is a limit of the range of element values in some programming languages the provide a built-in `set' data-type. A large, dense set can be implemented by an array of boolean so this can also be used to implement the corresponding table. Conversely, a sparse set can be implemented by the methods for tables described below.
A table to hold a limited number of elements with a sparse key can be implemented using an array and a counter. Initially the table is empty and the counter is set to zero. When a new element is inserted the counter is incremented and the new element stored at that index. The elements are in no particular order other than that of time of insertion. Lookup is by the linear-search algorithm which steps sequentially through the array.
If there are n elements in the table, the search takes O(1) time in the best case and O(n) time in the average and worst cases. Deletion first requires a search to find the element to delete; after that it takes O(1) time to actually remove it: it is over-written with the last element in the array and the counter is decremented. On the other hand, insertion is fast and takes a constant amount of time, provided that the element is known to be absent in which case a search is unnecessary.
A person does not use linear search to find a name in a telephone directory, except perhaps for Dustin Hoffman's character in `The Rain Man'. Rather the directory is opened at a likely place, and the key compared with names on that page. If the key is alphabetically before the names on the page, the search is continued in the front part of the directory otherwise the back part. The fact that the directory is alphabetically ordered or sorted is essential to this process. The binary-search algorithm makes use of an ordering of the key type in a similar way. The table is implemented by a sorted array. Two "pointers" are used to delimit the search area, initially the occupied part of array. The middle element is compared with the key and either the lower pointer moved up or the upper pointer moved down as appropriate.
Change the data in the HTML FORM below, click `go', and experiment. The region of the array from `lo' to `hi-1' is [bracketted] and a[mid] is starred `*' in the trace:
The algorithm is short but there are some subtle points to observe. Firstly, if the key is present it lies between `lo' inclusive and `hi' exclusive - this is the loop invariant. The invariant is true at the start of the first iteration because lo=1 and hi=n+1. The loop body ensures that the invariant remains true at the start of each subsequent iteration. This is so because of the choice of test in the central if statement. If the mid element is less than or equal to the key and lo is moved up to mid the invariant is still true. The converse is that the mid element is greater than the key and when hi is moved down the invariant is still true. Note the connection between the hi-1 in the invariant and the `>=' test. When hi=lo+1 the search has narrowed to just one element, A[lo..hi-1] = A[lo..lo] = A[lo], which can be compared to the search key for equality.
The algorithm terminates because the two integer-valued pointers are moving closer together and the exit test must eventually succeed. If hi>=lo+1 then hi>lo+2 and so lo<mid<hi. Therefore regardless of whether `lo' is moved up to mid or hi is moved down to mid, lo and hi move closer together at each iteration and closer to satisfying the termination test.
It is very easy to get the details of binary search wrong when coding it so it is a good idea to test it on at least the following cases:
Finally, the algorithm takes O(log(n)) time as each iteration halves the search area. Note that it is not worth testing for equality within the loop and exiting. To do so would add an extra test and time to every iteration. The average number of iterations saved would be limited by
Binary-search takes O(log(n)) time in all cases which is much faster than linear search. If the table does not change once created, ie. there is no insertion or deletion, it can be sorted once in O(n*log(n)) time; see the chapter on sorting. Insertion and deletion must keep the array sorted and they take O(n) time at worst and on average which is relatively slow. This is because insertion and deletion to the left of position n require elements to be moved up or down to make space or to fill the hole. Using a sorted array may have an extra benefit if the elements are also processed sequentially, for example to print alphabetical lists.
For an apparently crazy idea hashing works remarkably well. A hash function maps key values onto the index type of an array which implements the hash table. Elements are accessed directly using the hash index as an array index. The surprising part is that the best hash functions are essentially pseudo-random functions. The ideal hash function maps the actual key values uniformly onto the index type; unfortunately the ideal is not always achieved.
For many purposes, a "reasonable" hash function on strings is to take the binary value of the 1st character, multiply by three, add the binary value of the 2nd character, multiply by three, and so on, finally adding the binary value of the last character. The hash value is this total modulo the size of the hash table.
The multiplication by three in the hash function ensures that every character contributes to the least significant bit of the hash index. Changing any one character in the string by one will give a different hash index if the size of the hash table is even, and it is often a power of two. Multiplication by an even number would cause the effect of the leading characters of a long string to be lost (exercise: why?).
It is impossible to design a good hash function for all circumstances. In general it is necessary to analyse the statistical properties of the key type and design a hash function to suit. For example, if a hash table is used to manage a table of identifiers in a compiler, (some) programmers are fond of names like x1, x2, x3, y1, ..., z1, ... some of which are liable to hash to the same index. Such an occurrence is called a collision. Collisions are in fact inevitable because the key type is infinite and cannot otherwise be squeezed into a finite index range. The solution is to hash the key and examine the table at that index. If a different key is already at that position then a collision has occurred. A probing strategy then looks at a sequence of positions until a blank is found.
Linear probing is the simplest probing strategy. It looks at successive elements, modulo the table size, until a blank is found. The new element is inserted there.
The same probing strategy is used in the search.
Entries may cluster together in blocks as a result of collisions as the table fills up. Linear probing amounts to using a linear search on the collision blocks. Its average time complexity is linear in the size of the average block. A good hash function will minimise the size of blocks.
Quadratic probing is a more sophisticated probing strategy.
The step-size increases linearly so the indices generated, i, i+1, i+3, i+6, ..., are described by a quadratic in the step number modulo the table size. Quadratic probing is not guaranteed to probe every location in the table - an insert could fail while there is still an empty location however hash tables are rarely allowed to get full - see later. The same probing strategy is used in the associated search routine.
The virtue of quadratic probing is that the probe locations for two near-miss keys, as opposed to two colliding keys, are not closely related and this reduces the likelihood of large collision blocks forming. With linear probing on the other hand, collision blocks following from similar indices tend to join together slowing the search process. For example, if one set of keys all hash to `6' and another set hash to `7' then the collision blocks of these sets will over-run and merge with each other under linear-probing. With quadratic probing on the other hand, the probes after 6 are at positions 7, 9, 12, 16, ... and the probes after 7 are at positions 8, 10, 13, 17, .... This tends to give small disjoint blocks.
Simulations can give an estimate of the performance of a hash table. The time taken by insertion and lookup is dependent on the number of collisions. The results given here are based on randomly generated keys with a uniform distribution. This is the best possible case; typical results may therefore be somewhat worse.
If the number of entries in the table is liable to grow, a common strategy is to increase the table size when the number of entries reaches a certain percentage of the table's current capacity or when the collision rate reaches some critical level. Elements may have to be re-hashed when moved into the enlarged table. The table may also be shrunk if the number of entries falls below some limit.
Another way of dealing with collisions is known as chaining. Here, each hash table entry contains a pointer to a list of colliding elements. The lists are created in a collection. Deletion is more easily implemented under chaining. A linear search is used within a list. Provided the hash function is good, each list is kept short and the operations are fast. See the chapter on lists for the necessary techniques.
The hash function randomises the order in which elements are stored in the hash table. This makes it difficult to process the elements sequentially and may rule against the use of a hash table in some applications.
Various tree-based data-structures can be used to implement tables efficiently, e.g.