Remove HTML markup from Strings using a Regex-based method that does not always work.
Remove HTML. A String contains HTML markup. It is possible to remove this markup with a VB.NET Function. We develop a custom Function based on the Regex type. It uses a regular expression to strip HTML markup tags.
Example. To begin, this program imports the System.Text.RegularExpressions namespace. Next it introduces the StripTags Function, which performs the HTML removal. This calls the Regex.Replace function.
StripTags: Here all text matching the pattern < followed by multiple characters and ending with > is replaced with an empty string.
VB.NET program that removes HTML markup from String
Dim html As String = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>"
' Call Function.
Dim tagless As String = StripTags(html)
''' Strip HTML tags.
Function StripTags(ByVal html As String) As String
' Remove HTML tags.
Return Regex.Replace(html, "<.*?>", "")
There was a .NET programmer and he stripped the HTML tags.
Comments. If you have HTML markup that is malformed in any way, or has comments, this method will cause you grief. You may wish to first validate the markup. You can validate HTML markup using a simple parser that matches < and > tags.
Alternatively: You can build a more advanced parser that handles the incorrect markup you encounter.
Summary. The easiest way to strip HTML tags from your String data is to use the Regex type. Other methods that scan the String and use Char arrays are more efficient, but will also make your program much more complicated.