C# Word Count

Count words in a way similar to Microsoft Word. Analyze the results for correctness.
Word count. Strings often contain multiple words. These words can be counted in a C# method. The total count must be similar to Microsoft Word 2007. Words are broken by punctuation, a space, or by being at the start or the end of string.
Example. First, here we see two word counting methods, both of which yield fairly similar results to Microsoft Word from Microsoft Office 2007. The example program first executes the Regex word count function, and then the loop-based one.Regex.Matches

Static: This code is ideally contained in static methods because it doesn't maintain state or any data. You can think of it as an action, not an object.


CountWords1: This is shorter and simpler to maintain, and is also more accurate. The backslash-S characters (\S) mean characters that are not spaces.

So: CountWords1 considers each non-letter character to be part of a word, similar to Microsoft Word.

CountWords2: This version of the code uses a for-loop, and tries to correctly count word breaking characters.

C# program that counts words using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "To be or not to be, that is the question."; Console.WriteLine(WordCounting.CountWords1(t1)); Console.WriteLine(WordCounting.CountWords2(t1)); const string t2 = "Mary had a little lamb."; Console.WriteLine(WordCounting.CountWords1(t2)); Console.WriteLine(WordCounting.CountWords2(t2)); } } /// <summary> /// Contains methods for counting words. /// </summary> public static class WordCounting { /// <summary> /// Count words with Regex. /// </summary> public static int CountWords1(string s) { MatchCollection collection = Regex.Matches(s, @"[\S]+"); return collection.Count; } /// <summary> /// Count word with loop and character tests. /// </summary> public static int CountWords2(string s) { int c = 0; for (int i = 1; i < s.Length; i++) { if (char.IsWhiteSpace(s[i - 1]) == true) { if (char.IsLetterOrDigit(s[i]) == true || char.IsPunctuation(s[i])) { c++; } } } if (s.Length > 2) { c++; } return c; } } Output 10 10 5 5
Accuracy. Microsoft Office dominates the business world, so I will provide some stats about the results of these two algorithms versus Microsoft Word 2007. The Regex method, has results that differ by about 0.02% from Microsoft Word.
Accuracy of word counting methods: Document A Microsoft Word: 4007 words Regex method: 3990 words [closest] Loop method: 3973 words Document B Microsoft Word: 1414 words Regex method: 1414 words [closest] Loop method: 1399 words Document C Microsoft Word: 462 words Regex method: 463 words [closest] Loop method: 459 words Document D Microsoft Word: 470 words Regex method: 470 words [closest] Loop method: 465 words Document E Microsoft Word: 2742 words Regex method: 2738 words [closest] Loop method: 2710 words Example input and output Input: To be or not to be, that is the question. Mary had a little lamb. Word count: 10 5
Performance. The second method, which tests each character in a loop, would be many times faster if carefully benchmarked. It is nearly optimal, while the Regex-based method would draw in far more computation. Regular expressions are relatively slow.

However: Their greater ease of use and clarity is often more important. In scripting languages, regular expressions often perform better.

Tip: You can store the Regex object it uses as an instance member or field of the class.

Then: You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.

Example 2. What should you do if you need to specify that a certain character, such as the pound sign (#), is also a word separator? In this addition to the article, we use character ranges to specify valid word characters.

Note: If you omit a character from the ranges, that character is considered a word separator.

Here: You can see that with this version of the Regex, the substring "is#the#question" is treated as three separate words.

Tip: This is because the pound sign is not included in the ranges of valid characters in the pattern.

And: With this form of the Regex pattern, you can more easily change which characters are valid and which are not.

C# program that uses modified Regex using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "To be or not to be, that is#the#question."; Console.WriteLine(CountWordsModified(t1)); } static int CountWordsModified(string s) { return Regex.Matches(s, @"[A-Za-z0-9]+").Count; } } Output 10
Summary. We saw two word count methods, both of which provide results similar to Microsoft Word 2007. The first method, the Regex-using one, is considerably closer to Microsoft Word's results. However, there is a small percentage difference.
Dot Net Perls
© 2007-2020 Sam Allen. Every person is special and unique. Send bug reports to