C# Validate XHTML

Hypertext markup language: HTML

A website has invalid XHTML markup on many pages. We need to find markup errors on hundreds of XHTML fragments. We aim not to validate the documents, but to search for likely errors, and use it as quality control on a web site.

HTML:The HTML may be simple and used for text and informational content. It includes tables, bold tags, italic tags, links, images, lists, and line breaks.

Parser

The parser is stack-based and enforces that the "markup tree" in the HTML is valid. Therefore it finds every tag that it not closed. It also finds extra closing tags, and stray characters that must be HTML-encoded to form a valid document. It enforces several properties.

Squares

Quoted attributes:The parser understands double-quoted attributes, and will flag errors if an attribute isn't quoted. Its support isn't elaborate or precise, but these errors will be noted.

Invalid characters:Valid HTML does not allow the ampersand (&) in the source code unless it is part of an entity. This is a common problem with lists of people's names or with programming sites.

Lowercase markup:The parser demands that all markup be lowercased. Lowercase markup compresses better, and good style usually requires lowercase tags.

Solo tags:In XHTML, some tags (such as img and br) must be closed in the same tag as they are declared. The parser I developed enforces this. It will flag errors where these tags are not closed.

Nested tags:HTML markup must be nested correctly. Child elements must be closed before their parent tags. My parser doesn't deal with complicated inline elements, such as when an "a" element overlaps with "i".

Implementation

The parser uses a Stack generic object, and also uses look ahead to read in tags. So when it encounters a tag, it will "look ahead" to read the whole tag. Then that tag will be added to the stack. It has a bunch of rules for quotes and spaces, and other significant characters in HTML.

Stack Collection

Validator.CheckHtml has two parameters, the first of which the HTML itself, and the second is an out parameter that returns a string containing errors. This method iterates over each character, and also manages the tag stack.

Out

Also:It calls the look ahead method when it finds a tag start (<).

CheckHtml method implementation: C#

/// <summary>
/// Whether the HTML is likely valid. Error parameter will be empty
/// if no errors were found.
/// </summary>
static public void CheckHtml(string html, out string error)
{
    //
    // Store our tags in a stack
    //
    Stack<string> tags = new Stack<string>();

    //
    // Initialize out parameter to empty
    //
    error = string.Empty;

    //
    // Count of parenthesis
    //
    int parenthesisR = 0;
    int parenthesisL = 0;

    //
    // Traverse entire HTML
    //
    for (int i = 0; i < html.Length; i++)
    {
	char c = html[i];
	if (c == '<')
	{
	    bool isClose;
	    bool isSolo;

	    //
	    // Look ahead at this tag
	    //
	    string tag = LookAhead(html, i, out isClose, out isSolo);

	    //
	    // Make sure tag is lowercase
	    //
	    if (tag.ToLower() != tag)
	    {
		error = "upper: " + tag;
		return;
	    }

	    //
	    // Make sure solo tags are parsed as solo tags
	    //
	    if (_soloTags.ContainsKey(tag))
	    {
		if (!isSolo)
		{
		    error = "!solo: " + tag;
		    return;
		}
	    }
	    else
	    {
		//
		// We are on a regular end or start tag
		//
		if (isClose)
		{
		    //
		    // We can't close a tag that isn't on the stack
		    //
		    if (tags.Count == 0)
		    {
			error = "closing: " + tag;
			return;
		    }

		    //
		    // Tag on stack must be equal to this closing tag
		    //
		    if (tags.Peek() == tag)
		    {
			//
			// Remove the start tag from the stack
			//
			tags.Pop();
		    }
		    else
		    {
			//
			// Mismatched closing tag
			//
			error = "!match: " + tag;
			return;
		    }
		}
		else
		{
		    //
		    // Add tag to stack
		    //
		    tags.Push(tag);
		}
	    }
	    i += tag.Length;
	}
	else if (c == '&')
	{
	    //
	    // & must never be followed by space or other &
	    //
	    if ((i + 1) < html.Length)
	    {
		char next = html[i + 1];

		if (char.IsWhiteSpace(next) ||
		    next == '&')
		{
		    error = "ampersand";
		    return;
		}
	    }
	}
	else if (c == '\t')
	{
	    error = "tab";
	    return;
	}
	else if (c == '(')
	{
	    parenthesisL++;
	}
	else if (c == ')')
	{
	    parenthesisR++;
	}
    }

    //
    // If we have tags in the stack, write them to error
    //
    foreach (string tagName in tags)
    {
	error += "extra:" + tagName + " ";
    }

    //
    // Require even number of parenthesis
    //
    if (parenthesisL != parenthesisR)
    {
	error = "!even ";
    }
}

The method shown above will call the following method, which deals with the tags themselves. This method deals with 'solo' or self-closing tags. It does this by recording the position of the slash. It also handles quotes. After the method there is a static Dictionary that is used to tell if a tag is a solo tag.

Static Dictionary
LookAhead method implementation: C#

/// <summary>
/// Called at the start of an html tag. We look forward and record information
/// about our tag. Handles start tags, close tags, and solo tags. 'Collects'
/// an entire tag.
/// </summary>
/// <returns>Tag name.</returns>
static private string LookAhead(string html, int start, out bool isClose,
    out bool isSolo)
{
    isClose = false;
    isSolo = false;

    StringBuilder tagName = new StringBuilder();

    //
    // Stores the position of the final slash
    //
    int slashPos = -1;

    //
    // Whether we have encountered a space
    //
    bool space = false;

    //
    // Whether we are in a quote
    //
    bool quote = false;

    //
    // Begin scanning the tag
    //
    int i;
    for (i = 0; ; i++)
    {
	//
	// Get the position in main html
	//
	int pos = start + i;

	//
	// Don't go outside the html
	//
	if (pos >= html.Length)
	{
	    return "x";
	}

	//
	// The character we are looking at
	//
	char c = html[pos];

	//
	// See if a space has been encountered
	//
	if (char.IsWhiteSpace(c))
	{
	    space = true;
	}

	//
	// Add to our tag name if none of these are present
	//
	if (space == false &&
	    c != '<' &&
	    c != '>' &&
	    c != '/')
	{
	    tagName.Append(c);
	}

	//
	// Record position of slash if not inside a quoted area
	//
	if (c == '/' &&
	    quote == false)
	{
	    slashPos = i;
	}

	//
	// End at the > bracket
	//
	if (c == '>')
	{
	    break;
	}

	//
	// Record whether we are in a quoted area
	//
	if (c == '\"')
	{
	    quote = !quote;
	}
    }

    //
    // Determine if this is a solo or closing tag
    //
    if (slashPos != -1)
    {
	//
	// If slash is at the end so this is solo
	//
	if (slashPos + 1 == i)
	{
	    isSolo = true;
	}
	else
	{
	    isClose = true;
	}
    }

    //
    // Return the name of the tag collected
    //
    string name = tagName.ToString();
    if (name.Length == 0)
    {
	return "empty";
    }
    else
    {
	return name;
    }
}

/// <summary>
/// Tags that must be closed in the start
/// </summary>
static Dictionary<string, bool> _soloTags = new Dictionary<string, bool>()
{
    {"img", true},
    {"br", true}
};

Error reports

The legend here shows how the program reports errors. The parser in this article reports the first error it encounters in many cases, and sometimes also lists multiple nested tag issues. It has the following characteristics, as I show in the following "manual" page in the program.

HTML markup must use double quotes.
    <a href="..."></a>

Tabs not allowed

XHTML style br and img (solo tags)
    <br />

Tags must all be lowercase

Markup must be properly nested
Program checks that each opened tag is closed
Hierarchy is enforced
    <a><b></a></b> // ERROR
    <a><i></a> // ERROR

Ampersand (&) must not be alone
May find some non-encoded ampersands (not all invalid ones)
    Sally & Me // ERROR

Must have even number of parenthesis ( and )
Could conflict with some valid text (some code, numbers in text)

Error legend:
    upper: tag isn't lowercase
    !solo: img or br tag isn't self-closed (solo)
    closing: extra closing tag
    ampersand: unencoded & (should be &amp;)
    tab: tab character found
    block: invalid nesting
    !even: odd number of parenthesis )(
    !tags: tags left open by end of document

Parentheses

The validator can also match parentheses in the content, just as it matches tags. This feature is not always useful. But when I forced my text to have matching "(" and ")", I found about ten errors where I hadn't closed a parenthesis.

Note:This will not be universally helpful.

Results

Question and answer

What did this code do for my ASP.NET site? First, I consider my HTML to be pretty tight and well-written. However, I fixed problems in over 100 documents (at least one error in each of those). I found at least 10 errors where the errors cause the documents to display wrong.

And:I found one spot where I forgot to delete half a sentence and it didn't make any sense when read.

Summary

The C# programming language

Validating your documents isn't important from many angles, but it can quickly help you pinpoint problem areas, and with methods like the ones here, it causes no performance penalty on your site. It is useful for quality analysis, and can quickly fix errors in hundreds of documents.


C#: File: HTML