A BitString LongestCommonSubsequence Algorithm 

Inf. Proc. Lett., Vol.23, Dec' 1986, pp.305310, [doi:10.1016/00200190(86)900918].L. Allison^{*}, T.I. Dix^{*}, Department of Computer Science, University of Western Australia, Nedlands, Western Australia 6009.Abstract: A longestcommonsubsequence algorithm is described which operates in terms of bit or bitstring operations. It offers a speedup of the order of the wordlength on a conventional computer. Keywords: longest common subsequence, edit distance, bit string 1. IntroductionThe longestcommonsubsequence (LCS) problem is to find the maximum possible length of a common subsequence of two strings, 'a' of length a and 'b' of length b. Usually an actual LCS is also required. For example, using the alphabet A, C, G and T of genetic bases, an LCS of 'GCTAT' and 'CGATTA' is 'GTT' of length three. Here an algorithm which requires O(a*b) operations on single bits or O(ceiling(a/w)*b) operations on wbit computer words or O(b) operations on abit bitstrings, for a fixed finite alphabet, is presented. Although falling into the same complexity class as simple LCS algorithms, if w is greater than any additional multiplicative cost, this algorithm will be faster. If a<=w the algorithm is effectively linear in b. (An alphabet larger than b can be effectively reduced to b by sorting 'b' in time O(b*log(b)) and using index positions in the sorted string.) The LCS problem is related to the editdistance [11] or evolutionarydistance problem [9] [10] where the minimum cost of editing string 'a' to string 'b' must be found. The elementary edit operations are to insert or delete one character or to change one character. There is a cost function, d(.,.), which can be extended to strings in the obvious way. A common choice for the elementary editcosts is
The LCS problem and the editdistance problem find applications in computerscience and molecularbiology [8] [9] [10] [11] [12]. An LCS algorithm can compare two files and, by finding what they have in common, compute their differences. It can be used to correct spelling errors and to compare two genetic sequences for homology (similarity). Under plausible assumptions [1] [13], the worst case complexity of an LCS or editdistance algorithm on strings over an infinite alphabet must be O(a*b) time. Masek and Paterson [5] give an O(a*b/log(min(a,b))) editdistance algorithm for strings over a finite alphabet with modest restrictions on d(.,.). However this is faster than simple algorithms only for strings longer than about 200,000 characters. For special cases there are faster algorithms: for similar strings [3] [6], and for strings with few matching characters [4], but their worstcase running time is at least O(a*b). The algorithm presented here certainly has complexity O(a*b) for a finite alphabet, but asymptotically the improvement in speed is close to w. The algorithm is insensitive to the number of matches between strings and to their similarity. 1.1. Matrix LConventionally a matrix L_{ij} is defined:
This leads to simple dynamicprogramming algorithms [8] upon which this one is based. A reflected representation of matrix L is typically used; the above representation is chosen to make the reading of certain bitstrings easier. L has many interesting properties:
The rows and columns of L_{ij} contain ascending values for increasing i and j. This prompted Hunt and Szymanski [4] and Hirschberg [3] to "contour" L and develop fast algorithms for special cases. 2. BitString AlgorithmThe values in the rows of L, increase by at most one. This makes it possible to represent the information in L by a bitmatrix:
Row_{i} has either the same number of bits set or one more bit set than the previous row, row_{i1}. New bits are set towards the left of a row. The length of an LCS of 'a' and 'b' is the number of bits in the top row. The 1's in a particular row tend to drift to the right as we look (up) at the next row. This is because as more of 'b' is used, an LCS of a given length can be found using no more of, and possibly less of, 'a'. The 1's mark the contour lines of matrix L. 2.1 AlphabetStringsEach letter in the alphabet defines a bitstring when it is compared with the elements of 'a':
Precomputing these alphabetstrings contributes O(alphabet*ceiling(a/w)+a) to the time complexity. For a fixed alphabet this is O(a); for a nonfixed alphabet this could be O(a*b) at worst. If the alphabet is small, alphabet<<b, the contribution to the total time can be ignored. 2.2 Matrix MTo calculate row_{i}, we use the i^{th} character in string 'b', b_{i}, to select the b_{i}string from the set of alphabetstrings. The 1's in row_{i1} "cut" the b_{i}string into segments:
Each segment extends from the position of a 1 in row_{i1} rightward to the position to the left of the next 1. If the lefthand bit of row_{i1} is zero, the leftmost segment extends from the left of the b_{i}string to the position left of the first 1. Row_{i} is formed by setting only the rightmost 1 bits in each segment of the b_{i}string. If a segment is all zero, the bit defining the left end of the segment should be set in row_{i} (that is the segmenting bit in row_{i1}). This can be ensured by first oring row_{i1} into the b_{i}string.
It may be convenient to imagine an invisible 1 at position a+1 for a leftmost segment which is zero, but this is unnecessary for the algorithm. A 1 in row_{i1} marks the shortest piece of 'a' that can be used to make an LCS of a certain length with b_{1..i1}. Bringing b_{i} into use, the best that can be done to extend that LCS is to use the first b_{i} in 'a' to the left of the 1, if possible. This is marked by the rightmost 1 bit in the next segment to the left. In more detail, let x = row_{i1} Or b_{i}string.
Each segment of x can be read as a binary number. There is a "folktechnique" that decrementing a binary number changes the loworder bits [7] up to and including the leastsignificant 1. The correct decrement for each segment can be found by a logical leftshift of row_{i1} with a 1 carryin:
A nonequivalence of the result and x sets the changed bits:
Anding this with x gives the rightmost bits in each segment:
This is row_{i}, row_{11} in this example. In summary: row_{i} = x And ((x  (row_{i1} << 1)) != x) where x = row_{i1} Or b_{i}string Or And != bitstring operations << logical leftshift bitstring; righthand bit set  abit integer subtraction The bitstring operations can be done in units of a wordlength. The shift and subtraction can also be done one word at a time, taking care that the carryoutbit and the borrowbit are propagated. The number of bits set in a word can be counted in O(log_{2}(w)) time; this is attributed to D.Muller in 1954 in [7]. The number of 1's in the final row therefore can be calculated in O(ceiling(a/w)*log_{2}(w)). 2.3 An LCSAn LCS, as opposed to just its length, can be obtained in at least two ways. Hirschberg's recursive technique [2] can be used to find an LCS in linearspace at the cost of slowing the algorithm by a factor of two. Alternatively all rows of M can be kept, the space required is a*b bits, ceiling(a/w)*b words. This is quite practical for strings of the order of 1000 characters. Then an LCS can be recovered by finding the "corners" of the contours in L which stand out as patterns in M (see Fig 2). This takes O(a+b) time. The emboldened characters in Gig. 2 indicate an LCS. 3. Simulation Results
Both the simple O(a*b) algorithm [2] [8] and the bitstring algorithm to calculate the length of an LCS have been implemented in C and run on a Vax11/750 with a 32bit word. For random strings and an alphabet of size 4 the resulting speedups were:
For comparison, an alphabet of size 256 resulted in:
A similar number of operations are in the inner loop of each algorithm. The bitstring algorithm is well suited to optimization since it works across rows of matrix M and alphabetstrings. For small alphabets, the bitstring algorithm clearly provides a considerable speedup. 4. ConclusionsThe time complexity of the bitstring LCS algorithm on a computer with wbit words for a fixed finite alphabet is
This gives a speedup over simple O(a*b) algorithms. The time does not depend on properties of 'a' and 'b'. For random strings and moderate alphabets, the bitstring LCS algorithm will be faster than the special case algorithms. The algorithm might also be programmed on a vectorcomputer if carryout and borrow bits can be propagated from word to word in a vectorshift and a vectorsubtract operation. It could be "pipelined" in a diagonal fashion on a vector or parallel machine because the least significant bit (or word) of row_{i+1} can be calculated as soon as the least significant bit (or word) of row_{i} has been calculated. References
NB. The Ccode is [here]. * as of 2005:
* as of 2001:
L. Allison & T. I. Dix. A BitString LongestCommonSubsequence Algorithm. Inf. Proc. Letters, Vol.23, pp.305310, Dec 1986. Some later literature refers to 'bit vectors' and/or applies them to edit distance (approximate matching) v. longest common subsequence (LCS, LCSS) problems but the technique remains very similar. 

↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Monday, 12Nov2018 22:40:44 EST. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 