C# Boyer-Moore

Arrow indicates movement

Boyer-Moore is a string searching algorithm. Compared to IndexOf, this is a faster way to search for substrings in a string. It avoids checking most positions in the source string. The implementation here uses constant characters.

IndexOf

Example

Note

First, this program demonstrates how to use the Boyer-Moore style algorithm. It uses a lookup table to determine the next possible location of a string to find. The implementation here is hard-coded.

Info:This enhances the method's performance, but detracts from its usefulness in programs.

The program has methods that search for the characters "ABCD" uppercase in that order in the source string and returns true or false. Contains1 is a method that implements parts of the Boyer-Moore string searching algorithm.

Program that uses simple Boyer-Moore search: C#

using System;

class Program
{
    static void Main()
    {
	const string input1 = "There were ABCD Perls";
	const string input2 = "ABCD string";
	const string input3 = "No ABC dee string";
	const string input4 = "Test BAC";

	Console.WriteLine(Contains1(input1));
	Console.WriteLine(Contains1(input2));
	Console.WriteLine(Contains1(input3));
	Console.WriteLine(Contains1(input4));

	Console.WriteLine(Contains2(input1));
	Console.WriteLine(Contains2(input2));
	Console.WriteLine(Contains2(input3));
	Console.WriteLine(Contains2(input4));
    }

    static bool Contains1(string value)
    {
	// Searches for 4-letter constant string using Boyer-Moore style algorithm.
	// ... Uses switch as lookup table.
	int i = 3; // First index to check.
	int length = value.Length;
	while (i < length)
	{
	    switch (value[i])
	    {
		case 'D':
		    // Last character in pattern found.
		    // ... Check for definite match.
		    if (value[i - 1] == 'C' &&
			value[i - 2] == 'B' &&
			value[i - 3] == 'A')
		    {
			return true;
		    }
		    // Must be at least 4 characters away.
		    i += 4;
		    continue;
		case 'C':
		    // Must be at least 1 character away.
		    i += 1;
		    continue;
		case 'B':
		    // Must be at least 2 characters away.
		    i += 2;
		    continue;
		case 'A':
		    // Must be at least 3 characters away.
		    i += 3;
		    continue;
		default:
		    // Must be at least 4 characters away.
		    i += 4;
		    continue;
	    }
	}
	// Nothing found.
	return false;
    }

    static bool Contains2(string value)
    {
	// Searches for 4-letter constant string with IndexOf.
	return value.IndexOf("ABCD", StringComparison.Ordinal) != -1;
    }
}

Output

True
True
False
False
True
True
False
False
Note

In this example, the input string is passed to the method, and the pattern string, which here contains four constant uppercase letters, is searched. The final character in the pattern (D) is searched for at each place.

Switch

And:If D is found, we use an if-statement to check the previous three characters for C, B and A.

IfLetters of the alphabet: ABC

If the character is any of ABCD, we advance the maximum number of places where the string might start (one to four). If the character is not one of ABCD, then we know that ABCD cannot be found unless we advance four places.

Tip:This special logic often reduces the number of character checks by 75% in this program.

Discussion

Warning: exclamation mark

Unfortunately, this constant character checking code is only useful in limited situations. If you have a program that must scan for a certain important word or name in many strings, then you could use this code to do that.

So:For example, you could use this code to check for values in HTTP headers, such as to detect browser user agents.

Framework: NET

However, the inflexibility of this code can be seen as an advantage. In the .NET Framework, loading a constant onto the evaluation stack is much faster than loading a variable reference or field.

Therefore:A more advanced method that is configurable would likely be slower because it would have to access memory more often.

Benchmark

Performance optimization

Let's look at the benchmark results for the Boyer-Moore style string searching algorithm versus the Contains2 method, which uses the IndexOf method. The Boyer-Moore style algorithm is about 86% faster than the IndexOf method.

Note:The Boyer-Moore algorithm is often faster than other direct character testing expressions in the C# language.

Method calls tested: separate loops

Contains1("This is an A example test of ABCD and more");
Contains2("This is an A example test of ABCD and more");

Benchmark results

Contains1:  23.60 ns
Contains2: 166.98 ns

Summary

The C# programming language

We implemented and demonstrated a Boyer-Moore string searching method that uses constant characters. We noted how the principles of the Boyer-Moore algorithm can be used to improve the performance of string searching for substrings.

Therefore:The method here could be adapted for fast searching of other constant substrings, but is limited in its uses.


C#: Method: Algorithm