C# String Remove HTML Tags

First example. Here is a class that tests 3 ways of removing HTML tags and their contents. The methods process an HTML string and return new strings that have no HTML tags.

Info StripTagsRegex uses a static call to Regex.Replace, and therefore the expression is not compiled.

Regex.Replace

Next StripTagsRegexCompiled specifies that all sequences matching tag chars with any number of characters are removed.

RegexOptions.Compiled

Finally StripTagsCharArray is an optimized, iterative method. In most benchmarks, this method is faster than Regex.

char Array

using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "Hello world!"; Console.WriteLine(StripTagsRegex(html)); Console.WriteLine(StripTagsRegexCompiled(html)); Console.WriteLine(StripTagsCharArray(html)); } /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }

Hello world! Hello world! Hello world!

A benchmark. Regular expressions are usually not the fastest way to process text. Char arrays and the string constructor can be used instead—this often performs better.

Version 1 This version of the code removes the HTML from the generated string returned by GetHtml().

Version 2 Here we do the same thing as version 2, but use a compiled regular expression for a performance boost.

Detail Here we use a char-array method that loops and tests characters, appending to a buffer as it goes along.

Result The char array method was considerably faster—in 2020, using a char array is still a good choice.

using System; using System.Diagnostics; using System.Linq; using System.Text.RegularExpressions; class Program { static void Main() { string html = GetHtml(); const int m = 10000; Stopwatch s1 = Stopwatch.StartNew(); // Version 1: use Regex. for (int i = 0; i < m; i++) { if (StripTagsRegex(html) == null) { return; } } s1.Stop(); Stopwatch s2 = Stopwatch.StartNew(); // Version 2: use Regex Compiled. for (int i = 0; i < m; i++) { if (StripTagsRegexCompiled(html) == null) { return; } } s2.Stop(); Stopwatch s3 = Stopwatch.StartNew(); // Version 3: use char array. for (int i = 0; i < m; i++) { if (StripTagsCharArray(html) == null) { return; } } s3.Stop(); Console.WriteLine(s1.ElapsedMilliseconds); Console.WriteLine(s2.ElapsedMilliseconds); Console.WriteLine(s3.ElapsedMilliseconds); } static string GetHtml() { var result = Enumerable.Repeat("Hello, friend, how are you?", 100); return string.Join("", result); } public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }

1086 ms StripTagsRegex 694 ms StripTagsRegexCompiled 54 ms StripTagsCharArray

Validate HTML. Here is a way to validate XHTML using methods similar to StripTagsCharArray. We count the number of tag chars and make sure the counts match.

Also We can run the Regex methods and then look for tag bracket characters that are still present.

Further There are ways to use more complete validation. An HTML parser can be made very complex.

Important Because of how HTML works, having unescaped angle brackets is potentially very harmful to a website layout.

using System; class Program { static void Main() { // Test the IsValid method. Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>")); Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html")); Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>")); Console.WriteLine(HtmlUtil.IsValid("<<>>")); Console.WriteLine(HtmlUtil.IsValid("")); } } static class HtmlUtil { enum TagType { SmallerThan, // "<" GreaterThan // ">" } public static bool IsValid(string html) { TagType expected = TagType.SmallerThan; // Must start with "<" for (int i = 0; i < html.Length; i++) // Loop { bool smallerThan = html[i] == '<'; bool greaterThan = html[i] == '>'; if (!smallerThan && !greaterThan) // Common case { continue; } if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue { expected = TagType.GreaterThan; continue; } if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue { expected = TagType.SmallerThan; continue; } return false; // Disallow } return expected == TagType.SmallerThan; // Must expect "<" } }

True False True False True

Notes, char arrays. One method (StripTagsCharArray) uses char arrays. It is much faster than the other 2 methods. It uses an algorithm for parsing the HTML.

Detail It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

And It only adds a char to the array if it is not a tag. It uses char arrays and the string constructor.

String Constructor

Detail Char arrays are faster than using StringBuilder. But StringBuilder can be used with similar results.