
An HTML syntax error is sometimes a major problem. It can prevent information from being indexed. It can prevent ads from being shown. One way you can prevent syntax errors is by using simple validation routines in your C# code.
This C# article demonstrates an HTML bracket counting method.
The method shown here does not implement a complete validation routine for HTML. However, a complete HTML validator for your specific case might be hard to develop quickly. This method simply makes sure that the < character is always followed by a > character; and the > character is always followed by the < character.
The first angle must be a < and the last angle must be a >. With this algorithm, you can gain some confidence that your HTML files are not heavily corrupted. They still might not be correct, but they are at least more likely to be.
Program that validates brackets [C#]
using System;
class Program
{
static void Main()
{
// Test the IsValid method.
Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>"));
Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html"));
Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>"));
Console.WriteLine(HtmlUtil.IsValid("<<>>"));
Console.WriteLine(HtmlUtil.IsValid(""));
}
}
static class HtmlUtil
{
enum TagType
{
SmallerThan, // <
GreaterThan // >
}
public static bool IsValid(string html)
{
TagType expected = TagType.SmallerThan; // Must start with <
for (int i = 0; i < html.Length; i++) // Loop
{
bool smallerThan = html[i] == '<';
bool greaterThan = html[i] == '>';
if (!smallerThan && !greaterThan) // Common case
{
continue;
}
if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue
{
expected = TagType.GreaterThan;
continue;
}
if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue
{
expected = TagType.SmallerThan;
continue;
}
return false; // Disallow
}
return expected == TagType.SmallerThan; // Must expect <
}
}
Output
True
False
True
False
TrueResults. The program shows that three of the inputs are valid, and two are not. The method will detect some encoding errors in HTML pages: for example, if you have unencoded > or < symbols in your text, this will alert you to errors in them.

Because I had nothing better to do, I tried to optimize this method. It is executed tens of thousands of times each day on my computer, so I thought a small improvement could be beneficial. I changed the logic to simplify some of the branches. Here's the optimized method.
Optimized IsValid method [C#]
public static bool IsValidFast(string html)
{
// False = SmallerThan
// True = GreaterThan
bool expected = false; // Must start with < [Smaller Than]
for (int i = 0; i < html.Length; i++) // Loop
{
// Letter.
char letter = html[i];
// Common case.
if (letter != '>' &&
letter != '<')
{
continue;
}
// False = SmallerThan [<]
// True = GreaterThan [>]
bool found = letter == '>';
// If we found what we expected, expect the opposite next.
if (found == expected)
{
expected = !expected;
}
else
{
// Disallow.
return false;
}
}
// Return true if expected is false [we expect < SmallerThan]
return !expected;
}
Performance results
IsValid: 353.33 ns
IsValidFast: 207.63 nsNotes on benchmark. To do the benchmark, I used the standard benchmark code and tested the five calls to IsValid in a tight loop. The five calls are shown in the top example code. You can see the IsValidFast version is significantly faster.
Benchmark Programs
In my experience, a really sophisticated HTML validator is both hard to build, hard to use, and not really useful in many cases. However, to ensure that no obvious errors are present, this sort of method is much more useful. Also, because of how HTML works, having unescaped angle brackets is potentially very harmful to your website layout; it can cause missing page elements, or indexing errors.
Validate XHTML
It is possible to develop a simple looping method that can adequately validate some HTML block structures based on the arrangement of the angle brackets. This sort of algorithm will not prove a document's correctness, but it can help ensure a higher standard of markup quality when used throughout a web site's lifetime. Fewer errors in your HTML pages may result in better results overall on your web site.
HTML Articles