Scraping HTML

We can extract important page elements by scraping HTML. With the Regex type and WebClient (in the C# language) we implement screen scraping for HTML.

We cannot easily parse HTML with regular expressions, but we can extract links and other parts of strings with them fairly well.

Example code

We will scrape HTML links from Wikipedia. This is permitted by Wikipedia's GPL license, and this demonstration is fair use.

Here We see code that downloads the English Wikipedia page. It opens Wikipedia and downloads the content at the specified URL.

Info The foreach-loop uses special code to loop over each link and its text. This requires additional code (found below on this page).

using System.Diagnostics;
using System.Net;

class Program
{
    static void Main()
    {
        // URL: http://en.wikipedia.org/wiki/Main_Page
        WebClient w = new WebClient();
        string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");

        foreach (LinkItem i in LinkFinder.Find(s))
        {
            Debug.WriteLine(i);
        }
    }
}

Example 2

This class that receives the HTML string and then extracts all the links and their text into structs. It is fairly fast, but I offer some optimization tips.

Step 1 This example first finds all hyperlink tags. We store all the complete A tags in a MatchCollection.

Step 2 The code loops over all hyperlink tag strings. In the algorithm, the next part examines all the text of the A tags.

Step 3 This attribute points to other web resources. This part is not failsafe, but almost always works.

Return The method returns the List of LinkItem objects. This list can then be used in the foreach-loop from the first C# example.

Detail To match multiline links, we require RegexOptions.Singleline. This is an important option.

using System.Collections.Generic;
using System.Text.RegularExpressions;

public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        // Find all matches in file.
        MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);

        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""", RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
            }

            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "", RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}

Program output

Test the program on your website. It prints out matches to the console. Here we see part of the current results for the Wikipedia home page.

Info The original HTML shows where the links were extracted. They are contained in a LI tag.

Note You will see my program successfully extracted the anchor text and also the HREF value.

#column-one
    navigation
#searchInput
    search
/wiki/Wikipedia
    Wikipedia
/wiki/Free_content
    free
/wiki/Encyclopedia
    encyclopedia
/wiki/Wikipedia:Introduction
    anyone can edit
/wiki/Special:Statistics
    2,617,101
/wiki/English_language
    English
/wiki/Portal:Arts
    Arts
/wiki/Portal:Biography
    Biography
/wiki/Portal:Geography
    Geography
/wiki/Portal:History
    History
/wiki/Portal:Mathematics
    Mathematics
/wiki/Portal:Science
    Science
/wiki/Portal:Society
    Society
/wiki/Portal:Technology_and_applied_sciences
    Technology<ul>
<li><a href=&quot;/wiki/Portal:Arts&quot; title="Portal:Arts">Arts</a></li>
<li><a href=&quot;/wiki/Portal:Biography&quot; title="Portal:Biography">Biography</a></li>
<li><a href=&quot;/wiki/Portal:Geography&quot; title="Portal:Geography">Geography</a></li>

</ul>

Performance

You can improve performance of the regular expressions by specifying RegexOptions.Compiled. Also, you can use instance Regex objects, not the static methods.

We scraped HTML content from the Internet. Using 3 regular expressions, you can extract HTML links into objects with a fair degree of accuracy.