Strings

LA home
Computing
Algorithms
 glossary
 Strings
  Suffix array
  BWT
  Factors
  Suffix tree

also see
Bioinformatics

String Searching

The problem is to search for a pattern string, pat[1..m], in a text string txt[1..n]. Usually n>>m, and txt might be very long indeed, although this is not necessarily so. This problem occurs in text-editors and many other computer applications.

Naive Search

The naive string searching algorithm is to examine each position, i>=1, in txt, trying for equality of pat[1..m] with txt[i..i+m-1]. If there is inequality, position i+1 is tried, and so on.

The worst-case time-complexity of the naive algorithm is seen to be O(m*n),
e.g., pat=am-1b and txt=an-1b

Mercifully there are faster algorithms!

Demonstration

The HTML FORM below allows you to search for a pattern string pat in a text string txt. It uses three algorithms: the naive algorithm (above), Rabin's algorithm and the Boyer-Moore algorithm (below). Change the values of pat and txt, run the algorithms and experiment:

©
L
.
A
l
l
i
s
o
n
pat:
txt:
-trace

Rabin's Algorithm

Rabin's string searching algorithm uses a "rolling hash". Hashing is an idea borrowed from hash- [tables] but used here without the "table". In Rabin's algorithm, the hash function is defined to compute a number in [0..p-1] for a large prime number p. The integers modulo p, Zp, behave as a mathematical field. That is, they behave just as the usual unbounded integers do (see [intro']) with respect to addition and multiplication, and in particular there are no "divisors of zero", that is no non-zero values j & k such that j*k=0.

At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.,

patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m])
          ) mod p


txtHash1= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m])
          ) mod p


txtHashi= ((txtHashi-1 - ord(txt[i-1])*power)*3
                       + ord(txt[i+m-1])
          ) mod p,                              where i>1


where power = 3m-1
and   ord(ch) is the character code of ch

If the values are equal then a match has almost certainly been found. A character by character check must be made to be absolutely sure.

Time Complexity

Rabin's algorithm is (almost always) fast, i.e. O(m+n) average-case time-complexity, because hash(txt[i..i+m-1]) can be computed in O(1) time - i.e. by two multiplications, a subtraction, an addition and a `mod' - given its predecessor hash(txt[i-1..i-1+m-1]).

The worst-case time-complexity does however remain at O(m*n) because of the possibility of false-positive matches on the basis of the hash numbers, although these are very rare indeed.

Space Complexity

Space complexity is O(1) for a few scalar variables.

Boyer-Moore Algorithm

The Boyer Moore algorithm is always fast having worst-case time-complexity O(m+n), but on natural-language text it actually gets faster as m increases to a certain extent, e.g., Boyer and Moore suggesting O(n/4)-time on average for m=5.

The first key idea, and it is a good one, is that if the mth character ch=txt[m] (numbered from `1' upwards) does not occur in pat at all then any instance of pat in txt must start at position m+1 or later.
E.g., if searching for pat=`freddy', in txt=`I floated lonely as a cloud', the 6th character of txt is `a' which is not in {d,e,f,r,y} and there is no need to consider positions 1 to 6 of txt any more! In this way, we can move along txt in steps of m positions, i.e. k=m, 2m, 3m, ..., provided we are lucky.

The second idea is that if the last occurrence of the character txt[k] in pat is delta1[ch] positions from the right hand end of pat, then pat can be slid that many positions to the right before we might get a match in txt. If ch=txt[k] does not occur in pat, set delta1[ch] equal to m.
E.g., delta1['e']=3, delta1['f']=5, delta1['d']=1, delta1['r']=4, delta1['y']=0.

In general if the last j characters match, i.e. txt[k-j..k]=pat[m-j..m], but txt[k-j-1]~=pat[m-j-1], then pat can be "slid" a certain distance along txt depending on where, if at all, there is an earlier instance of pat[m-j..m] within pat; e.g., consider pat=abracadabra or, e.g., pat=fababab. An array, delta2[1..m], is used to hold these precomputed distances.

Time Complexity

The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).

The algorithm often runs in O(n/m)-time on natural-language text for small values of m. Note that if txt is in slow, block-access, backing store, it is generally not possible to bring just every mth character into main memory, so the time-complexity is then O(n).

Space Complexity

The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].

Preprocessing txt.

It is intuitively obvious that the worst-case for searching for an arbitrary pattern, pat, in a text, txt, must take at least O(n)-time, where n=|txt|. However, if some time is spent pre-processing txt, then individual searches can be made more quickly, e.g., in O(m)-time using a suffix-tree, where m=|pat|. It takes O(n)-time to build a suffix tree for txt.

Notes

  • R. S. Boyer and J. S. Moore. A Fast String Searching Algorithm. Comm. A.C.M. 20(10) p762-772 Oct 1977.
    The paper has a rather nice section on how the algorithm was developed, over quite some time, from the initial idea, with contributions from various people.
  • D. E. Knuth, J. H. Morris and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Jrnl. Comput. 6(2) p323-350 Jun 1997.
  • R. M. Karp and M. O. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Jrnl. Res. and Dev. 31(2) p249-260 Mar 1987.
  • G. de v. Smit. A Comparison of Three String Matching Algorithms. Software Practice and Experience 12 p57-66 Jan 1982.
    Compared (i) a straightforward algorithm, (ii) Knuth Morris Pratt algorithm, and (iii) the Boyer Moore algorithm. The last of these was usually best, by far.
  • There are slight programming variations depending on whether strings are indexed from zero or from one.
  • Beware: Some implementations of `m mod n' in some languages and systems are incorrect when m is negative which can cause a nasty surprise when implementing Rabin's algorithm!
  • The [Suffix Tree] data structure for txt can be built in O(|txt|)-time and can then be used to find a substring, pat, in O(|pat|)-time.
  • Also see the [edit-distance] problem, a.k.a. approximate string matching.
www #ad:

↑ © L. Allison, www.allisons.org/ll/   (or as otherwise indicated).
Created with "vi (Linux)",  charset=iso-8859-1,   fetched Tuesday, 19-Mar-2024 06:57:38 UTC.

Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.