C# Word Count

String type

Strings often contain multiple words. These words can be counted in a C# method. The total count must be similar to Microsoft Word 2007. Words are broken by punctuation, a space, or by being at the start or the end of string.

Strings
Accuracy of word counting methods

Document A
    Microsoft Word: 4007 words
    Regex method:   3990 words [closest]
    Loop method:    3973 words

Document B
    Microsoft Word: 1414 words
    Regex method:   1414 words [closest]
    Loop method:    1399 words

Document C
    Microsoft Word: 462 words
    Regex method:   463 words [closest]
    Loop method:    459 words

Document D
    Microsoft Word: 470 words
    Regex method:   470 words [closest]
    Loop method:    465 words

Document E
    Microsoft Word: 2742 words
    Regex method:   2738 words [closest]
    Loop method:    2710 words

Example input and output

Input:      To be or not to be, that is the question.
	    Mary had a little lamb.

Word count: 10
	    5

Example

First, here we see two word counting methods, both of which yield fairly similar results to Microsoft Word from Microsoft Office 2007. The example program first executes the Regex word count function, and then the loop-based one.

Program that counts words: C#

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	const string t1 = "To be or not to be, that is the question.";
	Console.WriteLine(WordCounting.CountWords1(t1));
	Console.WriteLine(WordCounting.CountWords2(t1));

	const string t2 = "Mary had a little lamb.";
	Console.WriteLine(WordCounting.CountWords1(t2));
	Console.WriteLine(WordCounting.CountWords2(t2));
    }
}

/// <summary>
/// Contains methods for counting words.
/// </summary>
public static class WordCounting
{
    /// <summary>
    /// Count words with Regex.
    /// </summary>
    public static int CountWords1(string s)
    {
	MatchCollection collection = Regex.Matches(s, @"[\S]+");
	return collection.Count;
    }

    /// <summary>
    /// Count word with loop and character tests.
    /// </summary>
    public static int CountWords2(string s)
    {
	int c = 0;
	for (int i = 1; i < s.Length; i++)
	{
	    if (char.IsWhiteSpace(s[i - 1]) == true)
	    {
		if (char.IsLetterOrDigit(s[i]) == true ||
		    char.IsPunctuation(s[i]))
		{
		    c++;
		}
	    }
	}
	if (s.Length > 2)
	{
	    c++;
	}
	return c;
    }
}

Output

10
10
5
5
Method

We see static methods. This code is ideally contained in static methods because it doesn't maintain state or any data. You can think of it as an action, not an object. The methods each receive a string.

Static Method

Note:Both approaches above receive a string and return an integer equal to the number of words they calculate.

Regex type

CountWords1 is better in every way except perhaps performance. It is shorter and simpler to maintain, and is also considerably more accurate. The backslash-S characters (\S) mean characters that are not spaces.

So:The first method considers each non-letter character to be part of a word, similar to Microsoft Word.

Regex.Matches Method

Accuracy. Microsoft Office dominates the business world, so I will provide some stats about the results of these two algorithms versus Microsoft Word 2007. The Regex method, has results that differ by about 0.02% from Microsoft Word.

Performance

Performance optimization

The second method, which tests each character in a loop, would be many times faster if carefully benchmarked. It is nearly optimal, while the Regex-based method would draw in far more computation. Regular expressions are relatively slow.

However:Their greater ease of use and clarity is often more important. In scripting languages, regular expressions often perform better.

Tip:You can store the Regex object it uses as an instance member or field of the class.

Then:You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.

Example 2

Pound symbol

What should you do if you need to specify that a certain character, such as the pound sign (#), is also a word separator? In this addition to the article, we look at how you can use character ranges to specify valid word characters.

Note:If you omit a character from the ranges, that character is considered a word separator.

Program with modified Regex: C#

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	const string t1 = "To be or not to be, that is#the#question.";
	Console.WriteLine(CountWordsModified(t1));
    }

    static int CountWordsModified(string s)
    {
	return Regex.Matches(s, @"[A-Za-z0-9]+").Count;
    }
}

Output

10
Note

You can see that with this version of the Regex, the substring "is#the#question" is treated as three separate words. This is because the pound sign is not included in the ranges of valid characters in the pattern.

Tip:With this form of the Regex pattern, you can more easily change which characters are valid and which are not.

Summary

C# programming language

We saw two word count methods, both of which provide results similar to Microsoft Word 2007. The first method, the Regex-using one, is considerably closer to Microsoft Word's results. However, there is a small percentage difference.

Also:The algorithms here could be improved to offer even better compatibility with Microsoft Office.


C#: Regex