Understanding KMP (Knuth-Morris-Pratt) Algorithm

I was trying to look for a C# function yesterday that I can use to locate a substring inside a StringBuilder object, but it seems there is no native support for that, and I came across this StackOverflow accepted answer which uses a KMP table, which I’ve never known before. I struggled to understand it from the start. Some of the detailed explanations here, here and here are too technical for me. Luckily, I have come across a clearer explanation at the Computer Science Source blog. But even in that blog, the “table”, which I would call the “FailedCharBackshift” table, has not been discussed in detail, and even with Wikipedia pseudo-code, some parts of the function are hard to understand, specially the part where a sub-pattern is broken inside the pattern (the second if condition inside the while clause), so I had a step through the function itself how the table is generated, and had a few patterns experimentation. I think I know now the idea behind the table, and since I can’t possibly remember what I know now until forever, I have to write it down. Looking at the blog post I mentioned, it should have discussed the table at the start, not after discussing the KMP algorithm. So here is another explanation of the KMP table algorithm for those who are reading about this algorithm for the first time.

A little Background

So the KMP algorithm is an improvement over the Naive Approach in searching a pattern inside a text. The improvement in KMP is that, while matching each character of the pattern and the text, if there is any mismatch along the character sequence, the KMP algorithm uses a table to determine how many character positions to the right it has to shift, and does not automatically shift by 1 character as the Naive Approach does.

Before we look at the formula KMP uses to determine how many positions to move the pattern from its current location in the text, we need a few symbols, such as the following:

t = the text from which the pattern is to be searched

p = the pattern to be searched

j =  the character position within t starting from which we match for p

i =  the character position within p that we are currently matching

o = the number of characters of the onset that is repeated to the immediate left of the failed character. The onset is the sub-pattern found at the beginning of the pattern and repeated to the immediate left of the failed character.

T = the table containing how many characters to the left we need to backtrack inside p starting from i if i fails to match, for all the characters in p.

So the formula used by KMP to determine how many characters to shift from its current position is:

 j = j + iT[i] :   We can add parentheses in this formula to group the terms, such as:

 j = (j + i )T[i]  :  By adding j and i first, we are determining the number of character shift for the pattern using the failed character index on the text t as base.

j = j + (iT[i])    : By adding i and T[i], we are determining the number of character shift for the pattern using the current position of the pattern p as base.

The FailedCharBackshift table

So we must understand this table T first before looking at how KMP algorithm proceeds. You can always look at the mentioned blog’s table for “abracadabra”. I experimented with a few patterns and I will be using the pattern “aaacaaacaaaaabra” here to illustrate better the workings of the table.

image

To read the above table, the blue box represents the onset from the start of the pattern, which is repeated right next to the left of the character that failed, which I highlighted using a red box. The arrows indicate the number and direction of the shift from the index of the failed character. The dot means no shift.

As you can see on the above, the table’s main purpose is to indicate for each character position, how many character position it has to shift left, right, or no shift if the character in that position fails to match. A negative number indicates a shift to the right, 0 means no shift, and positive means shift to the left, starting from the index of the failed character.

At the pattern’s character position of 0, the value is “-1”, which means that if the 1st character fails to match, move 1 position to the right the whole pattern, which is arrived at by using the formula j = j + (iT[i]) , and by substitution j = j + (0T[0])  becoming j = j + (0(-1)) and j = j + 1.

If the pattern fails to match at the 2nd character, the whole pattern is moved 1 to the right as well: j = j + (iT[i]) , and by substitution j = j + (1T[1])  becoming j = j + (1(0)) and j = j + 1. These two positions’ values are fixed for the algorithm to work properly.

For all the other positions on the pattern, the values are determined by checking how many characters to its immediate left matches that many characters at the start of the pattern (the onset, which applies for the indexes 2,3,5,6,7,8,9,10 and 11 ), or a possible match within the reduced onset if the match is broken (for indexes 12 and 13). If the character on the immediate left of the failed character does not exist in the onset, then the shift is zero (for indexes 4, 14, 15).

To determine the onset, we need to keep track of the value of o, which is simply the number of characters of the onset, which is initially set to zero, incremented as each character is matched along inside the pattern. If the sub-pattern is broken, the o is changed to the backshift value of the last onset character that matched. So o’s value is not always zero at each index position and can go down and up, or just down if the character to the left can not be matched against an onset.The handling  of the broken matched pattern is a bit tricky to understand, especially when I was using the pattern “abracadabra”, so I had to play around with different patterns.

The algorithm for the FailedCharBackShift table

So here is the algorithm in C# with comments:

public static int[] KMPTable(string pattern)
       {
           int patternCharNumber = pattern.Length;

           //Create an array with elements that corresponds to each of the chars of the pattern
           int[] FailedCharBackshift = new int[patternCharNumber];
          
           //This is the value that the index must have so that if the
           //matching fails at this index, the pattern slides to the next char.           
           FailedCharBackshift[0] = -1;            
           FailedCharBackshift[1] = 0;
          
           // nucleus is the index inside the pattern under consideration,
           //which now starts at 2 after setting 0 and 1 above.
           int nucleus = 2;

           //onset is the char(s) at the start of the pattern repeated inside the pattern
           //onsetLastChar is the index of the last char of the onset to the
           //immediate left of nucleus, initially set at zero
           int onsetLastChar = 0; 

           //until the last char of the pattern
           while (nucleus < patternCharNumber)
           {
               //compare the char to the immediate left of the nucleus
               //with the last char of the onset.
               if (pattern[nucleus - 1] == pattern[onsetLastChar])
               {
                   //if the same, increment the number of onset chars
                   //assign it as the value of the nucleus
                   //and increment the nucleus
                   FailedCharBackshift[nucleus++] = ++onsetLastChar;
               }
                   //if not the same (the onset match is broken on the left of the nucleus),                    
               else if (onsetLastChar > 0)
               {             
                   //it recursively reduces the onset by resetting it
                   //to the backshift value of the last char of the revised onset.                   
                   //this is the tricky part to understand, so need to step through      
                   //the code to really understand it and experiment with different patterns
                   onsetLastChar = FailedCharBackshift[onsetLastChar];
               }
                   //if not the same and at the start of the loop
                   //or it reached the start of the onset (onsetLastChar = 0)                   
               else
               {
                   //then set it at zero
                   FailedCharBackshift[nucleus++] = 0;
               }
           }

           return FailedCharBackshift;
       }

As for the KMP algorithm itself, the blog Computer Science Source blog already discussed it clearly, I think.

Advertisements