6 de marzo de 2013

Karp-Rabin Algorithm

Information Theory and Coding Methods
Extra Assignment

The Karp-Rabin algorithm is a string searching algorithm created by Michael Rabin and Richard Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text.

For text of length m and p patterns of combined length n, its average and best case running time is O(m+n), but its worst case time is O(mn).

Arithmetic replaces comparisons


Definition: For a test string $T$, let $T^n_{r}$ denote the n-length substring of $T$ starting at character r. Usually, n is known by context, and $T^n_{r}$ will be replaced by $T_{r}$.

Definition: For the binary pattern $P$, let
$$H(P) = \sum^{i=n}_{i=1} 2^{n-1}P(i)$$ Similarly, let
$$H(T_r) = \sum^{i=n}_{i=1} 2^{n-1}T(r+i-1)$$
That is, consider $P$ to be an n-bit binary number. Similarly, consider $T^n_{r}$ to be an n-bit binary number.

Theorem: There is an occurrence of $P$ staring at position $r$ of $T$ if and only if $H(P) = H(T_r)$.

Code


I implemented the Karp-Rabin algorithm using python. I try to strictly follow the mathematical method, but it some confusing because in the mathematical method we know that the index 1 of the substring is the first element of it, but thinking in how write code for this, we need to change some parts because normally the indexes of a string or list of elements start at 0, so as we can see in the code, some parts doesn't seems to be the same that the mathematical method, but it happens because I use the indexes starting at 0, like python do regularly.

In the first code I am using the definitions mentioned above only for get the result of $H(P)$ and $H(T_r)$. The program asks for the value of $r$ at the beginning and then calculate the result of both.

If the $r$ value is the start position of an occurrence in the complete string, we will get the same result in both definitions.



And the next code shows the complete implementation of the algorithm, where I calculate the hash for $H(P)$, then in the iteration I compare if $H(P)$ is equals to $H(T_r)$, if it is true, I add the index in a list, and at the end the matching positions of the substring in the complete string are printed.



References:
Algorithms on Strings, Trees, and Sequences, Dan Guesfiel, pp. 77-80
Rabin-Karp String Searching

1 comentario:

Nota: solo los miembros de este blog pueden publicar comentarios.