Hirschberg's Algorithm 

EditDistance RevisitedAn earlier section described the dynamic programming algorithm (DPA) which calculates the editdistance of two strings s1 and s2. This algorithm takes O(s1*s2) time. An optimal alignment which displays an actual sequence of operations editing s1 into s2 can be recovered from the distance matrix `m' using O(s1*s2) space. Hirschberg (CACM 18(6) pp.341343 1975) showed that an optimal alignment can be recovered in O(s1*s2) time and only O(s2) space using binaryrecursion. If A contains zero characters or one character, an alignment is easily found. Otherwise the approach is to cut string s1 in the middle to form s1a and s1b and to find the corresponding place to cut string s2 into s2a and s2b. Note, s2a and s2b may have quite different lengths. The alignment problem is then solved recursively on s1a and s2a and on s1b and s2b. This is an example of the divide and conquer paradigm. Recall that one row of `m' can be calculated from just the previous row. This allows the last row of `m' to be calculated in O(s2) space. By adding a boundary row m(s1+1, ) and a boundary column m( ,s2+1), the DPA can also be run in reverse from m(s1+1,s2+1) to m(1,1). Note that D(reverse(s1), reverse(s2)) = D(s1,s2). This allows the DPA to be run both forwards and backwards. The DPA is run forward on s1a and on all of s2 with a step of +1 and in reverse on s1b and all of s2 with a step of 1. An optimal alignment of s1 and s2 could be partitioned between s1a and s1b. This would give s2a and s2b. The editdistances of s1a and all possible s2a's are contained in the result of comparing s1a and s2. Similarly, the editdistances of s1b and all possible s2b's are contained in the result of comparing s1b and s2. The last or bottom row for the forward calculation and last or top row for the reverse calculation are found in O(s2) space. Adding corresponding elements of these two rows together gives the total cost of editing s1 into s2 for each possible way of dividing s2 into s2a and s2b. Choosing a minimum sum gives an optimal s2a and s2b. The subproblems of aligning s1a with s2a and s1b with s2b are solved recursively.
ComplexityThe space used is dominated by the rows of the edit matrix and is O(s2). The time used to calculate a value for D(s1,s2) is O(s1*s2) but the recursion must be considered. The two, toplevel recursive calls each solve a smaller problem. Together these add up to half of the size of the original problem, and take O(s1*s2/2) time in total (think about the area of matrix). At the next recursive level, four problems in total one quarter of the original size are solved, and so on. Therefore the total time taken is O(s1*s2*(1+1/2+1/4+...)) which is O(2*s1*s2) which is O(s1*s2). This use of recursion reduces the space used from O(s1*s2) to O(s2) and roughly doubles the time taken. A decision to use the technique can only be made with regard to the length of typical strings and the memory size and speed of the computer. Strings from the molecularbiology application are typically a few hundred to a few thousand characters long which can make its use worthwhile.
© L. A.
Notes


↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso88591, fetched Sunday, 12Jul2020 06:26:38 EDT. Free: Linux, Ubuntu operatingsys, OpenOffice officesuite, The GIMP ~photoshop, Firefox webbrowser, FlashBlock flash on/off. 