Home
Search
Python Remove HTML TagsRemove HTML tags from strings. HTML comments are removed separately.
dot net perls
Remove HTML tags. HTML is used extensively on the Internet. But HTML tags themselves are sometimes not helpful when processing text.
Python method. We can remove HTML tags, and HTML comments, with Python and the re.sub method. The code does not handle every possible case—use it with caution.
re.sub, subn
Example code. This program imports the re module for regular expression use. This code is not versatile or robust, but it does work on simple inputs.
Part 1 Our string has some HTML tags, including nested tags. Closing tags are also included.
Part 2 We call re.sub with a special pattern as the first argument. Matches are replaced with an empty string (removed).
Tip In the pattern, the question mark is important. It means to match as few characters as possible.
So With the question mark, the entire string is not treated as one huge HTML tag.
Python program that removes HTML with re.sub
import re # Part 1: this string contains HTML. v = "<p id=1>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p>" # Part 2: replace HTML tags with an empty string. result = re.sub("<.*?>", "", v) print(result)
Sometimes, simpler is better, but not always.
< Less-than sign (matches HTML bracket). .*? Match zero or more chars. Match as few as possible. > Greater-than (matches HTML bracket).
Comments. HTML pages often contain comments. These can contain any text, including other comments and HTML tags. This code removes comments, but it does not handle all possible cases.
Note This code is expected to mess up when a comment contains other comments or HTML tags.
But On simple pages, this code can be used to process out HTML comments, reducing page size and increasing rendering performance.
Python program that removes HTML comments
import re # This HTML string contains two comments. v = """<p>Welcome to my <!-- awesome --> website<!-- bro --></p>""" # Remove HTML comments. result = re.sub("<!--.*?-->", "", v) print(v) print(result)
<p>Welcome to my <!-- awesome --> website<!-- bro --></p> <p>Welcome to my website</p>
A discussion. For web browsers, advanced parsers with error correction are used. This makes them more compatible on real web pages, but implementing that logic is challenging.
Instead These simple methods can be used to process pages that contain no errors or unexpected markup.
A summary. With the re.sub method, we remove certain parts of strings. The regular expression argument can be used to match HTML tags, or HTML comments, in a fairly accurate way.
Home
© 2007-2021 sam allen. see site info on the changelog