HaBeAffZWC| 694}FWC 694VC-
Remove HTML Tags.` HTML is used extensively on the Internet. But HTML tags themselves are sometimes not helpful when processing text. We can remove HTML tags, and HTML comments, with Python and the re.sub method. `re.sub `sub-python`Example.` This program imports the re module for regular expression use. The string "v" has some HTML tags, including nested tags. We call re.sub with a special pattern as the first argument. Matches are replaced with an empty string (removed). `In the pattern, the question mark is important. It means to match as few characters as possible.`So: `With the question mark, the entire string is not treated as one huge HTML tag.`Comments.` This is a bonus. HTML pages often contain comments. These can contain any text, including other comments and HTML tags. This code removes comments, but it does not handle all possible cases. `This code is expected to mess up when a comment contains other comments or HTML tags.`But: `On simple pages, this code can be used to process out HTML comments, reducing page size and increasing rendering performance.`Discussion.` These are not perfect methods. For web browsers, advanced parsers with error correction are used. This makes them more compatible on real web pages, but implementing that logic is challenging. `Instead: `These simple methods can be used to process pages that contain no errors or unexpected markup.`Summary.` With the re.sub method, we remove certain parts of strings. The regular expression argument can be used to match HTML tags, or HTML comments, in a fairly accurate way. A new string, containing just text, is returned.
234 38re4 # This 3L c39 HTML. 4v3z"""<p id=1>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p>"""4 # 3? HTML tags with an empty 3L. 43I3z4re.sub4(4"<.*?>"4, "", v) 3_(3I) 4 Sometimes, simpler is better, but not always. 4Pattern details4 <2 Less-than sign (matches HTML bracket). .*?2Match zero or more chars. 2 Match as few as possible. >2 G3`r-than (matches HTML bracket).4 38re4 # This HTML 3L c39 two comments. 4v3z"""<p>Welcome3lmy <!-- awesome --> website<!-- bro --></p>"""4 # R3Y HTML comments. 43I3z4re.sub4(4"<!--.*?-->"4, "", v) 3_(v) 3_(3I) 4 <p>Welcome3lmy <!-- awesome --> website<!-- bro --></p> <p>Welcome3lmy website</p>4
8removes HTML with re.sub5removes HTML comments