HomeSearch

C# Remove HTML Tags

Strip or remove HTML tags from strings with Regex.Replace and char arrays.
Remove HTML tags. A string contains HTML tags. We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics.
A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.Regex

Also: A simple for-loop can be used to validate HTML to see if it is mostly correct (whether its tags have correct syntax).

For
First example. Here is a class that tests 3 ways of removing HTML tags and their contents. The methods process an HTML string and return new strings that have no HTML tags.

StripTagsRegex: This uses a static call to Regex.Replace, and therefore the expression is not compiled.

Regex.Replace

Regex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.

StripTagsRegexCompiled: The regular expression (Regex) object is stored in the static class.

RegexOptions.Compiled

StripTagsCharArray: This method is an optimized, iterative method. In most benchmarks, this method is faster than Regex.

Char Array
C# program that removes HTML tags using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>Hello <b>world</b>!</p>"; Console.WriteLine(StripTagsRegex(html)); Console.WriteLine(StripTagsRegexCompiled(html)); Console.WriteLine(StripTagsCharArray(html)); } /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } } Output Hello world! Hello world! Hello world!
A benchmark. Regular expressions are usually not the fastest way to process text. Char arrays and the string constructor can be used instead—this often performs better.

Version 1: This version of the code removes the HTML from the generated string returned by GetHtml().

Version 2: Here we do the same thing as version 2, but use a compiled regular expression for a performance boost.

Version 3: Here we use a char-array method that loops and tests characters, appending to a buffer as it goes along.

Result: The char array method was considerably faster. In 2020, using a char array is still a good choice.

C# program that times HTML removal methods using System; using System.Diagnostics; using System.Linq; using System.Text.RegularExpressions; class Program { static void Main() { string html = GetHtml(); const int m = 10000; Stopwatch s1 = Stopwatch.StartNew(); // Version 1: use Regex. for (int i = 0; i < m; i++) { if (StripTagsRegex(html) == null) { return; } } s1.Stop(); Stopwatch s2 = Stopwatch.StartNew(); // Version 2: use Regex Compiled. for (int i = 0; i < m; i++) { if (StripTagsRegexCompiled(html) == null) { return; } } s2.Stop(); Stopwatch s3 = Stopwatch.StartNew(); // Version 3: use char array. for (int i = 0; i < m; i++) { if (StripTagsCharArray(html) == null) { return; } } s3.Stop(); Console.WriteLine(s1.ElapsedMilliseconds); Console.WriteLine(s2.ElapsedMilliseconds); Console.WriteLine(s3.ElapsedMilliseconds); } static string GetHtml() { var result = Enumerable.Repeat("<p><b>Hello, friend,</b> how are you?</p>", 100); return string.Join("", result); } public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } } Output 1086 ms StripTagsRegex 694 ms StripTagsRegexCompiled 54 ms StripTagsCharArray
Self-closing. In XHTML, some elements have no separate closing tag, and instead use the "/>" at the end of the first tag. The methods tested on this page correctly handle self-closing tags.

Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.

Supported tags: <img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Validate HTML. Here is a way to validate XHTML using methods similar to StripTagsCharArray. We count the number of < and > tags and make sure the counts match.

Also: We can run the Regex methods and then look for < > characters that are still present.

Further: There are ways to use more complete validation. An HTML parser can be made very complex.

Important: Because of how HTML works, having unescaped angle brackets is potentially very harmful to a website layout.

C# program that validates brackets using System; class Program { static void Main() { // Test the IsValid method. Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>")); Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html")); Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>")); Console.WriteLine(HtmlUtil.IsValid("<<>>")); Console.WriteLine(HtmlUtil.IsValid("")); } } static class HtmlUtil { enum TagType { SmallerThan, // < GreaterThan // > } public static bool IsValid(string html) { TagType expected = TagType.SmallerThan; // Must start with < for (int i = 0; i < html.Length; i++) // Loop { bool smallerThan = html[i] == '<'; bool greaterThan = html[i] == '>'; if (!smallerThan && !greaterThan) // Common case { continue; } if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue { expected = TagType.GreaterThan; continue; } if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue { expected = TagType.SmallerThan; continue; } return false; // Disallow } return expected == TagType.SmallerThan; // Must expect < } } Output True False True False True
Note, comments. The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup.

And: This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.

Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.

Notes, char arrays. One method (StripTagsCharArray) uses char arrays. It is much faster than the other 2 methods. It uses an algorithm for parsing the HTML.

Algorithm: It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

And: It only adds a char to the array if it is not a tag. It uses char arrays and the string constructor.

String Constructor

StringBuilder: Char arrays are faster than using StringBuilder. But StringBuilder can be used with similar results.

A summary. Several methods can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.
© 2007-2020 Sam Allen. Every person is special and unique. Send bug reports to info@dotnetperls.com.
Home
Dot Net Perls