C# Remove HTML Tags

Hypertext markup language: HTML

A string contains HTML tags. We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics. We remove no actual textual content.

Caution:A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.

Example

First here is a static class that tests three ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that have no HTML tags.

Static ClassesTitle element in HTML

Note:The methods have different performance characteristics. As a reminder, HTML tags start with < and end with >.

HtmlRemoval static class: C#

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Remove HTML from string with Regex.
    /// </summary>
    public static string StripTagsRegex(string source)
    {
	return Regex.Replace(source, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripTagsRegexCompiled(string source)
    {
	return _htmlRegex.Replace(source, string.Empty);
    }

    /// <summary>
    /// Remove HTML tags from string using char array.
    /// </summary>
    public static string StripTagsCharArray(string source)
    {
	char[] array = new char[source.Length];
	int arrayIndex = 0;
	bool inside = false;

	for (int i = 0; i < source.Length; i++)
	{
	    char let = source[i];
	    if (let == '<')
	    {
		inside = true;
		continue;
	    }
	    if (let == '>')
	    {
		inside = false;
		continue;
	    }
	    if (!inside)
	    {
		array[arrayIndex] = let;
		arrayIndex++;
	    }
	}
	return new string(array, 0, arrayIndex);
    }
}
Squares

The example is a public static class that saves no state. You can call into the class using the code HtmlRemoval.StripTags. Normally, you can put this class in a separate file named HtmlRemoval.cs. It is useful for many programs.

StripTagsRegex uses a static call to Regex.Replace, and therefore the expression is not compiled. For this reason, this method could be optimized by pulling the Regex out of the method, such as in the second method.

Regex.Replace

Regex:This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.

Regex type

StripTagsRegexCompiled. This method does the same thing as the previous method. Its regular expression is pulled out of the method call. The regular expression (Regex) object is stored in the static class.

Tip:I recommend this method for most programs, as it is very simple to inspect and considerably faster than the first method.

Array type

StripTagsCharArray. This method is a heavily-optimized version of an approach that could instead use StringBuilder. In most benchmarks, this method is faster and is appropriate for when you need to strip lots of HTML files.

Char Array

And:A detailed description of the method's body is available below. It was designed for performance.

Tests

Steps

We run these methods through a simple test. The three methods work identically on valid HTML. The char array method will strip anything that follows a <, but the Regex methods will require a > before they strip the tag.

Program that tests HTML removal: C#

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	const string html = "<p>There was a <b>.NET</b> programmer " +
	    "and he stripped the <i>HTML</i> tags.</p>";

	Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
	Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
	Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
    }
}

Output

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

Benchmarks

Performance optimization

First, regular expressions are usually not the fastest way to process test. I wrote an algorithm that uses a combination of char arrays and the new string constructor to strip HTML tags, filling the requirement and often performing better.

The benchmark for these methods stripped 10000 HTML files of around 8000 characters in tight loops. The file was read in from File.ReadAllText. The result was that the char array method was considerably faster.

File.ReadAllText

And:This could be worthwhile to use if you have to strip many files in a script, such as one that preprocesses a large website.

Removing HTML tags from strings

Input:    <p>The <b>dog</b> is <i>cute</i>.</p>
Output:   The dog is cute.

Performance test for HTML removal

HtmlRemoval.StripTagsRegex:         2404 ms
HtmlRemoval.StripTagsRegexCompiled: 1366 ms
HtmlRemoval.StripTagsCharArray:      287 ms [fastest]

File length test for HTML removal

File length before:                 8085 chars
HtmlRemoval.StripTagsRegex:         4382 chars
HtmlRemoval.StripTagsRegexCompiled: 4382 chars
HtmlRemoval.StripTagsCharArray:     4382 chars

Char arrays

Arrow indicates looping

One method here uses char arrays. It is much faster than the other two methods. It uses a neat algorithm for parsing the HTML. It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

It only adds characters to the array buffer if it is not a tag. For performance, it uses char arrays and the new string constructor that accepts a char array and a range. This is faster than using StringBuilder.

String Constructor

Compiled

Programming tip

Using RegexOptions.Compiled and a separate Regex results in better performance than using the Regex static method. But RegexOptions.Compiled has some drawbacks. It can increase startup time by ten times in some cases.

Tip:More material is available pertaining to making Regexes simpler and faster to run.

RegexOptions.CompiledRegex Performance

Self-closing

Note

In XHTML, certain elements such as BR and IMG have no separate closing tag, and instead use the "/>" at the end of the first tag. The test file noted includes these self-closing tags, and the methods correctly handle it.

Next:Here are some HTML tags supported.
Invalid tags may not work in the Regex methods.

Supported tags

<img src="" />
<img src=""/>
<br />
<br/>
< div >
<!-- -->

Comments

This section provides information

The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup. This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.

Caution:The methods shown cannot handle all HTML documents. Please be careful when using them.

Validate

Concept: a discussion topic

There are several ways to validate XHTML using methods similar to the iterative method here. One way you can validate HTML is simply counting the number of < and > tags and making sure the counts match.

Also:You can run the Regex methods and then look for < > characters that are still present.

HTML Brackets: Validation

Further:There are ways to use more complete validation.
An HTML parser can be made very complex.

Summary

The C# programming language

We looked at several methods that can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.

And:We checked the results both by measuring string length and the output itself. This helps establish correct results.


C#: File: HTML