Home
Search
Python HTML: HTMLParser, Read MarkupUse the html.parser module. Import HTMLParser and implement a simple class.
HTML. In HTML, we find tags, attributes and data. We could write custom methods to parse these. But in Python we can instead use the HTMLParser class from the html.parser module. We derive a class from HTMLParser to add more features.
Example. This example implements a class that derives from HTMLParser. It uses the inheritance syntax. This TagParser class is not fully effective on some HTML documents. It works on some tags, like title tags, but not with nested elements.
Class
Methods In the class, we specify 2 methods: handle_starttag and handle_data. Other methods can be specified.
Here We just set a field "tag" to the name of the current start tag in handle_starttag.
And Then when we encounter data, in handle_data, we use the previous tag name to help identify that data.
Caution This approach is not ideal, but if you are just searching for simple tags, like title or h1 elements, it works.
Feed. We call the feed method on the HTMLParser instance. With feed, we "feed" string data to the parser. It then internally reads the characters in the string. And it calls your specified methods, if the required elements are found.
Tip You can specify any Python statements within your class that derives from HTMLParser.
And This makes it possible to develop a custom HTML parser. It erases the need to handle tedious HTML syntax in custom code.
Python program that uses html.parser
from html.parser import HTMLParser # A class that inherits from HTMLParser. # ... It implements two methods. class TagParser(HTMLParser): def handle_starttag(self, tag, attrs): # Set "tag" field to the name of the opened tag. self.tag = tag def handle_data(self, data): # Print data within currently-open tag. print(self.tag + ":", data) parser = TagParser() parser.feed("<h1>Python</h1>" + "<p>Is cool.</p>");
h1: Python p: Is cool.
Methods. There are many methods on HTMLParser that you can specify. Attributes are received as attrs in the handle_starttag method: this is a list of tuples. More detailed examples for attributes (and comments) are available on the Python site.
Tip You can loop over the attributes (attrs) list like any other list. The for-loop is ideal.
List, strings
Summary. HTML markup is far from trivial to parse. HTML is common. And for this reason many edge cases have emerged: few parsers can handle them all. Using a prebuilt class, like HTMLParser, makes building a special parser in Python easier.
Remove HTML Tags
Home
© 2007-2021 sam allen.
see site info on the changelog.