Home
Search
C# Word CountCount words using loops or Regex and analyze the results for correctness.
Word count. Strings in C# programs often contain multiple words. These words can be counted in a method. The total count must be similar to common tools.
Word separators. Words are broken by punctuation, a space, or by being at the start or end of a string. We must detect these separators in C# code.
Example code. First, here we see 2 word-counting methods. The example program first executes the Regex word count function, and then the loop-based one.
Regex.Matches
Static This code is ideally contained in static methods. It doesn't maintain state or any data.
CountWords1 This is shorter and simpler to maintain, and is also more accurate. The "\S" means characters that are not spaces.
CountWords2 This version of the code uses a for-loop, and tries to correctly count word breaking characters.
For
Char
C# program that counts words
using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "Cat, bird and dog."; int res1 = WordCounting.CountWords1(t1); int res2 = WordCounting.CountWords2(t1); Console.WriteLine(res1); Console.WriteLine(res2); } } /// <summary> /// Contains methods for counting words. /// </summary> public static class WordCounting { /// <summary> /// Count words with Regex. /// </summary> public static int CountWords1(string s) { MatchCollection collection = Regex.Matches(s, @"[\S]+"); return collection.Count; } /// <summary> /// Count word with loop and character tests. /// </summary> public static int CountWords2(string s) { int c = 0; for (int i = 1; i < s.Length; i++) { if (char.IsWhiteSpace(s[i - 1]) == true) { if (char.IsLetterOrDigit(s[i]) == true || char.IsPunctuation(s[i])) { c++; } } } if (s.Length > 2) { c++; } return c; } }
4 4
Accuracy. Here I provide some statistics about these 2 algorithms versus Microsoft Word. The Regex method has results that differ by about 0.02% from Microsoft Word.
Accuracy of word counting methods:
Document A Microsoft Word: 4007 words Regex method: 3990 words [closest] Loop method: 3973 words Document B Microsoft Word: 1414 words Regex method: 1414 words [closest] Loop method: 1399 words Document C Microsoft Word: 462 words Regex method: 463 words [closest] Loop method: 459 words Document D Microsoft Word: 470 words Regex method: 470 words [closest] Loop method: 465 words Document E Microsoft Word: 2742 words Regex method: 2738 words [closest] Loop method: 2710 words
Input: To be or not to be, that is the question. Mary had a little lamb. Word count: 10 5
Example 2. Here we specify that a certain character (like "#") is also a word separator. We use character ranges to specify valid word characters.
Note If you omit a character from the ranges, that character is considered a word separator.
Here You can see that with this version of the Regex, the substring "is#the#question" is treated as 3 separate words.
Tip This is because the pound sign is not included in the ranges of valid characters in the pattern.
And With this form of the Regex pattern, you can more easily change which characters are valid and which are not.
C# program that uses modified Regex
using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "To be or not to be, that is#the#question."; Console.WriteLine(CountWordsModified(t1)); } static int CountWordsModified(string s) { return Regex.Matches(s, @"[A-Za-z0-9]+").Count; } }
10
Performance. Testing each character in a loop would be faster. It is nearly optimal, while the Regex-based method would draw in far more computation.
Tip You can store the Regex object it uses as an instance member or field of the class.
Then You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.
Regex
A summary. We saw 2 word count methods, both of which provide results similar to Microsoft Word. The Regex-using one is closer to Microsoft Word's results.
Home
© 2007-2021 sam allen.
see site info on the changelog.