C# Regex

Array Class Collections File Keyword String .NET ASP.NET Cast Compression Data Delegate Directive Enum Exception If Interface LINQ Loop Method Number Process Property Regex Sort StringBuilder Struct Switch Time Windows WPF

Regex: regular expression

In text,
we often discover,
and must process,
textual patterns. A regular expression describes a text-based transformation. The Regex class handles regular expressions. We specify patterns as string arguments.

Match

Question

This program introduces the Regex type. We use its constructor and the Match method, and then handle the returned Match object. All of these types are found in the System.Text.RegularExpressions namespace.

Pattern:The Regex uses a pattern that indicates one or more digits. The characters "55" match this pattern.

Success:The returned Match object has a bool property called Success. If it equals true, we found a match.

Based on:

.NET 4.5.1

Program that uses Match, Regex: C#

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	Regex regex = new Regex(@"\d+");
	Match match = regex.Match("Dot 55 Perls");
	if (match.Success)
	{
	    Console.WriteLine(match.Value);
	}
    }
}

Output

55
Method call

Often, the Regex.Match method is useful. We describe its use with some example patterns. Match eliminates the need to search and extract strings, as with IndexOf and Substring. This makes code cleaner.

MatchMatch: GroupsMatchesMatches: QuoteIsMatchStar, asterisk character

Star:Also known as a Kleene closure in language theory. It is important to know the difference between the star and the plus.

Star

Words:With Regex we can count words in strings. We compare this method with Microsoft Word's implementation.

Word Count

Replace

Letters of the alphabet: ABC

Sometimes we need to replace a certain pattern of text with some other text.
The Regex.Replace method,
a static method,
helps with this task. We can replace patterns with a string, or with a value determined by a MatchEvaluator object.

String

Replace:We use the Replace method, with strings and MatchEvaluators, to replace text. We replace spaces, numbers and position-based parts.

ReplaceReplace: EndReplace: NumbersReplace: Spaces

Spaces:Whitespace isn't actually white.
But it is often not needed for future processing of data.

Replace: Trim

Here:We use the simplest form of Regex.Replace. We replace all three letter words starting with C and ending in T with "cat."

Program that uses Replace: C#

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	// ... Input string.
	string value = "I pet the cot.";

	// ... Change patterns matching "c.t" to "cat".
	string result = Regex.Replace(value, "c.t", "cat");
	Console.WriteLine(result);
    }
}

Output

I pet the cat.

Split

Split strings

Do you need to extract substrings from your text that contain only certain characters, such as certain digits or letters? The Split method returns a string array that will contain the matching substrings.

Split

Numbers:We can handle certain character types, such as numbers, with the Split method. This is powerful. It handles many variations.

Split: NumbersString

The Split method in Regex is much more powerful than the one on the string type. But it may be slower in common cases.
With splitting strings,
as in many parts of life,
there is a tradeoff.

String Split

Escape

Semicolon

The Escape method can change a user input to a valid Regex pattern. It assumes no metacharacters were intended. The input string should be only literal characters. I have not often needed the Escape and Unescape methods.

Note:With Escape, we don't get out of jail free, but we do change the representation of certain characters in a string.

Escape

Unescape:The term unescape means to do the reverse of escape. It returns character representations to a non-escaped form.

Unescape

Files

File

You will often need to process text files from the disk. The Regex type and its methods are used for this. But you will need to combine a file input type, like StreamReader, with the Regex code.

StreamReader:Here we use StreamReader to stream a file in line-by-line. Then we process each line with the Regex type.

Regex: FilesTitle element in HTML

HTML:Regex can be used to process or extract parts of HTML strings. We focus on title and P elements.

Title: HTMLParagraphs: HTML

Remove HTML:We also remove all HTML tags.
Please be cautious with this article.
It does not work on many HTML pages.

Remove HTML Tags

RegexOptions

Dots: colored circles

There are so many options in life. At the grocery store, I can never figure out what to buy. With the Regex type, the RegexOptions enum is used to modify method behavior. Often I find the IgnoreCase value helpful.

IgnoreCase:Lowercase and uppercase letters are distinct in the Regex text language. IgnoreCase changes this.

IgnoreCase

Multiline:You can change how the Regex type acts upon newlines using the RegexOptions enum. This is often useful.

Multiline

Performance

Performance optimization

Is Regex fast? This question is a topic of great worldwide concern. Sadly Regex often results in slower code than imperative loops. But we can optimize Regex usage with some simple tricks.

1. Compile.Using the RegexOptions.Compiled argument to a Regex instance will make it execute faster. This however has a startup penalty.

RegexOptions.CompiledRegex Performance

2. Replace with loop.Some Regex method calls can be replaced with a loop.
The loop is much faster.

Regex vs. Loop

3. Use static fields.You can cache a Regex instance as a static field.
This often improves performance.

Static Regex

Patterns

Algorithm: shuffle elements

The study of regular expressions is fascinating. A regular expression can describe any "regular" language. These languages are ones where complexity is finite: there is a limited number of possibilities.

Caution:Some languages, like HTML, are not regular languages. This means you cannot fully parse them with traditional regular expressions.

Automaton:A regular expression is based on finite state machines. These automata encode states and possible transitions to new states.

These expressions are commonly used to describe patterns. Regular expressions are built from single characters, using union, concatenation, and the Kleene closure, or any-number-of, operator. Aho et al., p. 187

Summary

Copyright

Regular expressions are a concise way to process text data. This comes at a cost.
For performance,
we can rewrite Regex calls,
using low-level char methods. Regex is a high-level representation of the same logic.

C#