There are always those situations, when all of us are faced with the problem of finding a substring inside a string, or string in a text, or a set of pattern strings on a text (whatever flavour you like). Being this a common "problem" chances are your language of choice already has a very good implementation that you can use.
But, it might be the case, that you really have to implement it yourself or you just want to understand how such a thing might work. There are several alternatives such as Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm, Rabin-Karp algorithm, just to name a few. The first two are probably more "famous" and so we will focus on the third one. We'll find out how it works because, although it has a slower worst case scenario, it's the algorithm of choice when we need multi-pattern search.
Let's think about a simple, naive solution. What we could do is go through our string comparing our substring, incrementing the initial comparisson position as we go, until we either found a match our we reached the end.
def naive_search(substring, string):for i in range(len(string)-len(substring)+1):if string[i:i+len(substring)] == substring:return ireturn -1
Quite simple and it would work fine for small strings, but with larger strings it wouldn't work so well. And now imagine if we had to search for multiple patterns and/or multiple times.
At the core of Rabin-Karp algorithm, is a hash function. For our purposes, what this means is that, we will have a function that takes a string and transforms that into a number. Cool, uh? We'll dive into that later. Let's take a look at a possible implementation for our Rabin-Karp algorithm.
def rabin_karp(substring, string):if substring == None or string == None:return -1if substring == "" or string == "":return -1if len(substring) > len(string):return -1hs = RollingHash(string, len(substring))hsub = RollingHash(substring, len(substring))hsub.update()for i in range(len(string)-len(substring)+1):if hs.digest() == hsub.digest():if hs.text() == substring:return ihs.update()return -1
So what's happening here? First, we do a few checks before we get our hands dirty. Then we create two hash objects for both our substring and string. The object is called RollinHash (there's a reason for that name; more on that later) and it has a few methods to help us:
- digest returns the hash value;
- text returns, as expected, the actual text.
By now you might be thinking: why the hell do we need the actual text, if we're using the "all mighty" hash? The thing is, our hash function suffers from collisions. What is that, you might ask? We'll let's analyse the code below and understand what's going on. We'll analyse how our hashing really works, what do we mean about collisions and why it's called RollinHash (where the magic happens).
class RollingHash:def __init__(self, string, size):self.str = stringself.hash = 0for i in xrange(0, size):self.hash += ord(self.str[i])self.init = 0self.end = sizedef update(self):if self.end <= len(self.str) -1:self.hash -= ord(self.str[self.init])self.hash += ord(self.str[self.end])self.init += 1self.end += 1def digest(self):return self.hashdef text(self):return self.str[self.init:self.end]
See that for loop in the init method? That's the real hashing. What we're doing is very simple: we get the ascii code of each letter and add them. You're already seeing where the collisions come from don't you? Well, words with the same letters will have the same hash value! Although that might be a problem, it really isn't for our purpose because, if that happens, we check the actual text (which doesn't happen often).
So, where does the Rolling part come from? Let's take a look ate the update method. If we just hashed as in the init method we would be following the naive mode. But because we're already dealing with a value we can improve that: subtract the ascii value of the first letter and add the ascii value of the next one. This way we don't have to compute the hole think. Confuse? Let's analyse this with a little more detail:
garbageLelogarbage -> hash value: 412g**arb**ageLelogarbage -> current_hash_value - ord('g') + ord('a')
garbageLelogarbage -> hash value: 406
Let's assume that we are searching for the substring Lelo. We will be considering substrings of size 4. Hashing the first 4 letters (garb) would yield the value 412, but we won't have a match (Lelo hash value: 396). In order to move forward we don't need to recompute everything because the next substring of size 4 will have 3 letters of the previous one. So, we subtract the value of 'g' of and add the value of 'a'. We're saving time and effort.
The trick here is the rolling part. This is at the centre of the Rabin-Karp algorithm, and although it's a simple "trick" it reveals itself as being quite powerful.