A Bit-String Longest-Common-Subsequence Algorithm
Inf. Proc. Lett., Vol.23, Dec' 1986, pp.305-310, [doi:10.1016/0020-0190(86)90091-8].
L. Allison*, T.I. Dix*, Department of Computer Science, University of Western Australia, Nedlands, Western Australia 6009.
Abstract: A longest-common-subsequence algorithm is described which operates in terms of bit or bit-string operations. It offers a speedup of the order of the word-length on a conventional computer.
Keywords: longest common subsequence, edit distance, bit string
The longest-common-subsequence (LCS) problem is to find the maximum possible length of a common subsequence of two strings, 'a' of length |a| and 'b' of length |b|. Usually an actual LCS is also required. For example, using the alphabet A, C, G and T of genetic bases, an LCS of 'GCTAT' and 'CGATTA' is 'GTT' of length three.
Here an algorithm which requires O(|a|*|b|) operations on single bits or O(ceiling(|a|/w)*|b|) operations on w-bit computer words or O(|b|) operations on |a|-bit bit-strings, for a fixed finite alphabet, is presented. Although falling into the same complexity class as simple LCS algorithms, if w is greater than any additional multiplicative cost, this algorithm will be faster. If |a|<=w the algorithm is effectively linear in |b|. (An alphabet larger than |b| can be effectively reduced to |b| by sorting 'b' in time O(|b|*log(|b|)) and using index positions in the sorted string.)
The LCS problem is related to the edit-distance  or evolutionary-distance problem   where the minimum cost of editing string 'a' to string 'b' must be found. The elementary edit operations are to insert or delete one character or to change one character. There is a cost function, d(.,.), which can be extended to strings in the obvious way. A common choice for the elementary edit-costs is
The LCS problem and the edit-distance problem find applications in computer-science and molecular-biology     . An LCS algorithm can compare two files and, by finding what they have in common, compute their differences. It can be used to correct spelling errors and to compare two genetic sequences for homology (similarity).
Under plausible assumptions  , the worst case complexity of an LCS or edit-distance algorithm on strings over an infinite alphabet must be O(|a|*|b|) time. Masek and Paterson  give an O(|a|*|b|/log(min(|a|,|b|))) edit-distance algorithm for strings over a finite alphabet with modest restrictions on d(.,.). However this is faster than simple algorithms only for strings longer than about 200,000 characters. For special cases there are faster algorithms: for similar strings  , and for strings with few matching characters , but their worst-case running time is at least O(|a|*|b|).
The algorithm presented here certainly has complexity O(|a|*|b|) for a finite alphabet, but asymptotically the improvement in speed is close to w. The algorithm is insensitive to the number of matches between strings and to their similarity.
1.1. Matrix L
Conventionally a matrix Lij is defined:
This leads to simple dynamic-programming algorithms  upon which this one is based. A reflected representation of matrix L is typically used; the above representation is chosen to make the reading of certain bit-strings easier.
L has many interesting properties:
The rows and columns of Lij contain ascending values for increasing i and j. This prompted Hunt and Szymanski  and Hirschberg  to "contour" L and develop fast algorithms for special cases.
2. Bit-String Algorithm
The values in the rows of L, increase by at most one. This makes it possible to represent the information in L by a bit-matrix:
Rowi has either the same number of bits set or one more bit set than the previous row, rowi-1. New bits are set towards the left of a row. The length of an LCS of 'a' and 'b' is the number of bits in the top row. The 1's in a particular row tend to drift to the right as we look (up) at the next row. This is because as more of 'b' is used, an LCS of a given length can be found using no more of, and possibly less of, 'a'. The 1's mark the contour lines of matrix L.
Each letter in the alphabet defines a bit-string when it is compared with the elements of 'a':
Precomputing these alphabet-strings contributes O(|alphabet|*ceiling(|a|/w)+|a|) to the time complexity. For a fixed alphabet this is O(|a|); for a non-fixed alphabet this could be O(|a|*|b|) at worst. If the alphabet is small, |alphabet|<<|b|, the contribution to the total time can be ignored.
2.2 Matrix M
To calculate rowi, we use the ith character in string 'b', bi, to select the bi-string from the set of alphabet-strings. The 1's in rowi-1 "cut" the bi-string into segments:
Each segment extends from the position of a 1 in rowi-1 rightward to the position to the left of the next 1. If the left-hand bit of rowi-1 is zero, the left-most segment extends from the left of the bi-string to the position left of the first 1. Rowi is formed by setting only the right-most 1 bits in each segment of the bi-string. If a segment is all zero, the bit defining the left end of the segment should be set in rowi (that is the segmenting bit in rowi-1). This can be ensured by first or-ing rowi-1 into the bi-string.
It may be convenient to imagine an invisible 1 at position |a|+1 for a left-most segment which is zero, but this is unnecessary for the algorithm.
A 1 in rowi-1 marks the shortest piece of 'a' that can be used to make an LCS of a certain length with b1..i-1. Bringing bi into use, the best that can be done to extend that LCS is to use the first bi in 'a' to the left of the 1, if possible. This is marked by the right-most 1 bit in the next segment to the left.
In more detail, let x = rowi-1 Or bi-string.
Each segment of x can be read as a binary number. There is a "folk-technique" that decrementing a binary number changes the low-order bits  up to and including the least-significant 1. The correct decrement for each segment can be found by a logical left-shift of rowi-1 with a 1 carry-in:
A non-equivalence of the result and x sets the changed bits:
And-ing this with x gives the right-most bits in each segment:
This is rowi, row11 in this example.
rowi = x And ((x - (rowi-1 << 1)) != x) where x = rowi-1 Or bi-string Or And != bit-string operations << logical left-shift bit-string; right-hand bit set - |a|-bit integer subtraction
The bit-string operations can be done in units of a word-length. The shift and subtraction can also be done one word at a time, taking care that the carry-out-bit and the borrow-bit are propagated.
The number of bits set in a word can be counted in O(log2(w)) time; this is attributed to D.Muller in 1954 in . The number of 1's in the final row therefore can be calculated in O(ceiling(|a|/w)*log2(w)).
2.3 An LCS
An LCS, as opposed to just its length, can be obtained in at least two ways. Hirschberg's recursive technique  can be used to find an LCS in linear-space at the cost of slowing the algorithm by a factor of two.
Alternatively all rows of M can be kept, the space required is |a|*|b| bits, ceiling(|a|/w)*|b| words. This is quite practical for strings of the order of 1000 characters. Then an LCS can be recovered by finding the "corners" of the contours in L which stand out as patterns in M (see Fig 2). This takes O(|a|+|b|) time. The emboldened characters in Gig. 2 indicate an LCS.
3. Simulation Results
Both the simple O(|a|*|b|) algorithm   and the bit-string algorithm to calculate the length of an LCS have been implemented in C and run on a Vax-11/750 with a 32-bit word. For random strings and an alphabet of size 4 the resulting speedups were:
For comparison, an alphabet of size 256 resulted in:
A similar number of operations are in the inner loop of each algorithm. The bit-string algorithm is well suited to optimization since it works across rows of matrix M and alphabet-strings. For small alphabets, the bit-string algorithm clearly provides a considerable speedup.
The time complexity of the bit-string LCS algorithm on a computer with w-bit words for a fixed finite alphabet is
This gives a speedup over simple O(|a|*|b|) algorithms. The time does not depend on properties of 'a' and 'b'. For random strings and moderate alphabets, the bit-string LCS algorithm will be faster than the special case algorithms.
The algorithm might also be programmed on a vector-computer if carry-out and borrow bits can be propagated from word to word in a vector-shift and a vector-subtract operation. It could be "pipelined" in a diagonal fashion on a vector or parallel machine because the least significant bit (or word) of rowi+1 can be calculated as soon as the least significant bit (or word) of rowi has been calculated.
NB. The C-code is [here].
* as of 2005:
* as of 2001:
L. Allison & T. I. Dix. A Bit-String Longest-Common-Subsequence Algorithm. Inf. Proc. Letters, Vol.23, pp.305-310, Dec 1986.
Some later literature refers to 'bit vectors' and/or applies them to edit distance (approximate matching) v. longest common subsequence (LCS, LCSS) problems but the technique remains very similar.
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated).
Created with "vi (Linux)", charset=iso-8859-1, fetched Monday, 12-Nov-2018 22:40:44 EST.
Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.