HomeSearch

C# Remove HTML Tags, Validate HTML

Strip or remove HTML tags from strings with Regex.Replace and char arrays.
Remove HTML tags. A string contains HTML tags. We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics.
A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.Regex

Also: A simple for-loop can be used to validate HTML to see if it is mostly correct (whether its tags have correct syntax).

For
First example. Here is a static class that tests 3 ways of removing HTML tags and their contents. The methods process an HTML string and return new strings that have no HTML tags.

Tip: We can call into the class with HtmlRemoval.StripTags(). We can put this class in a separate file named HtmlRemoval.cs.

StripTagsRegex: This uses a static call to Regex.Replace, and therefore the expression is not compiled.

Regex.Replace

Regex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.

StripTagsRegexCompiled: The regular expression (Regex) object is stored in the static class.

StripTagsCharArray: This method is an optimized, iterative method. In most benchmarks, this method is faster than Regex.

Char Array
HtmlRemoval static class: C# using System; using System.Text.RegularExpressions; /// <summary> /// Methods to remove HTML from strings. /// </summary> public static class HtmlRemoval { /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }
Notes, preferred method. I recommend the StripTagsRegexCompiled method for most programs. It is simple to inspect. And it is considerably faster than the non-compiled Regex method.
Tests. We run these methods through a simple test. The char array method will strip anything that follows a <, but the Regex methods will require a > before they strip the tag.
C# program that tests HTML removal using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>There was a <b>.NET</b> programmer " + "and he stripped the <i>HTML</i> tags.</p>"; Console.WriteLine(HtmlRemoval.StripTagsRegex(html)); Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html)); Console.WriteLine(HtmlRemoval.StripTagsCharArray(html)); } } Output There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags.
Benchmarks. Regular expressions are usually not the fastest way to process text. Char arrays and the string constructor can be used instead—this often performs better.

Note: The benchmark stripped 10000 HTML files of around 8000 characters in tight loops. The file was read in from File.ReadAllText.

File.ReadAllText

Result: The char array method was considerably faster. It is a good choice when performance is critical.

Removing HTML tags from strings: Input: <p>The <b>dog</b> is <i>cute</i>.</p> Output: The dog is cute. Performance test for HTML removal: HtmlRemoval.StripTagsRegex: 2404 ms HtmlRemoval.StripTagsRegexCompiled: 1366 ms HtmlRemoval.StripTagsCharArray: 287 ms [fastest] File length test for HTML removal: File length before: 8085 chars HtmlRemoval.StripTagsRegex: 4382 chars HtmlRemoval.StripTagsRegexCompiled: 4382 chars HtmlRemoval.StripTagsCharArray: 4382 chars
Notes, char arrays. One method (StripTagsCharArray) uses char arrays. It is much faster than the other 2 methods. It uses an algorithm for parsing the HTML.

Algorithm: It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

And: It only adds a char to the array if it is not a tag. It uses char arrays and the string constructor.

String Constructor

StringBuilder: Char arrays are faster than using StringBuilder. But StringBuilder can be used with similar results.

Compiled. Using RegexOptions.Compiled and a separate Regex results in better performance than using the Regex static method. But RegexOptions.Compiled has some drawbacks.RegexOptions.Compiled
Self-closing. In XHTML, some elements have no separate closing tag, and instead use the "/>" at the end of the first tag. The methods tested on this page correctly handle self-closing tags.

Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.

Supported tags: <img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Note, comments. The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup.

And: This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.

Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.

Validate HTML. Here is a way to validate XHTML using methods similar to StripTagsCharArray. We count the number of < and > tags and make sure the counts match.

Also: We can run the Regex methods and then look for < > characters that are still present.

Further: There are ways to use more complete validation. An HTML parser can be made very complex.

Important: Because of how HTML works, having unescaped angle brackets is potentially very harmful to a website layout.

C# program that validates brackets using System; class Program { static void Main() { // Test the IsValid method. Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>")); Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html")); Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>")); Console.WriteLine(HtmlUtil.IsValid("<<>>")); Console.WriteLine(HtmlUtil.IsValid("")); } } static class HtmlUtil { enum TagType { SmallerThan, // < GreaterThan // > } public static bool IsValid(string html) { TagType expected = TagType.SmallerThan; // Must start with < for (int i = 0; i < html.Length; i++) // Loop { bool smallerThan = html[i] == '<'; bool greaterThan = html[i] == '>'; if (!smallerThan && !greaterThan) // Common case { continue; } if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue { expected = TagType.GreaterThan; continue; } if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue { expected = TagType.SmallerThan; continue; } return false; // Disallow } return expected == TagType.SmallerThan; // Must expect < } } Output True False True False True
A summary. Several methods can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.
© 2007-2019 Sam Allen. Every person is special and unique. Send bug reports to info@dotnetperls.com.
Home
Dot Net Perls