C# Paragraph HTML Regex

P tag in HTML

HTML pages have paragraphs in them. We can match these with Regex. This is useful for extracting summaries from many pages or articles. This simple method extracts and matches the first paragraph element in an HTML document. It uses the regular expression library included in the .NET Framework.

Example

Note

Here we look at a useful method that will scan an entire HTML file and extract text in between a paragraph opening tag and closing tag (<p></p>). You can put this method, GetFirstParagraph, in a utility class that is static and reuse it in different projects.

This C# example program shows how to use Regex to get HTML paragraphs.

Program that matches paragraph from HTML [C#]

using System;
using System.IO;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	// Read in an HTML file.
	string html = File.ReadAllText("Problem.html");

	// Get the first paragraph.
	Console.Write(GetFirstParagraph(html));

	// End.
	Console.ReadLine();
    }

    /// <summary>
    /// Get first paragraph between P tags.
    /// </summary>
    static string GetFirstParagraph(string file)
    {
	Match m = Regex.Match(file, @"<p>\s*(.+?)\s*</p>");
	if (m.Success)
	{
	    return m.Groups[1].Value;
	}
	else
	{
	    return "";
	}
    }
}

Output

This is the first paragraph...
Main method

Description. The program defines the Main entry point and then calls into the GetFirstParagraph static method. Internally, the GetFirstParagraph method uses the static Regex.Match method declared in the System.Text.RegularExpressions namespace. The pattern used is described next.

Notes

Regex type

Understanding regular expressions can be difficult, but this one is fairly simple. It simply looks for the characters < and > with the letter p in between them. It then skips zero or more whitespace characters inside those tags. Finally, it captures the minimum number of characters between the start tag and end tag. Both tags must be found for the match to proceed.

Summary

The C# programming language

We looked at how you can match the paragraph element in your HTML files using the C# language. This is useful code that I run several times a day, and it functions correctly. Note that it is not extremely flexible; it is hard to parse HTML correctly all the time without an HTML parser.

Title From HTML HTML Articles
.NET