Example. We can extract the contents of the TITLE element from HTML. This is important for making sure your HTML is correct. After the code, we see the Regex parts in detail and more factors.
File.ReadAllText
First This console application first gets the first TITLE element from the HTML file.
Then The program prints the title to the console. The application must have the specified HTML file present in the current directory.
Pattern This looks for a start tag and an end tag. It ignores whitespace between the inner parts of the tags and the string.
C# program that gets TITLE element from HTML
using System;
using System.IO;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// Read in an HTML file.
string html = File.ReadAllText(
"Problem.html");
// Get the title of the HTML.
Console.WriteLine(GetTitle(html));
// End.
Console.ReadLine();
}
/// <summary>
/// Get title from an HTML string.
/// </summary>
static string
GetTitle(string file)
{
Match m = Regex.Match(file,
@"<title>\s*(.+?)\s*</title>");
if (m.Success)
{
return m.Groups[1].Value;
}
else
{
return
"";
}
}
}
Title of the Page
@ Uses special string syntax.
\s* Matches 0 or more spaces.
(.+?) Matches text but isn't greedy.
Stops as soon as it can.
\s* Matches 0 or more spaces.
Match C# regular expression object.
Groups[1] First group found in input.
Starts at 1.
Value String value of Group.
Errors. This code is not flexible enough for some HTML documents. For example, the program won't work for complicated HTML, such as HTML that heavily uses attributes. But the code that matches TITLE should work for all XHTML.
Also They assume the HTML is lowercase, although this could be easily changed.
Paragraphs You can use regular expressions like these for reading important elements from your HTML.