42.7. htmllib and HTMLParserPython provides the htmllib module for parsing HTML content, which is often useful when dealing with web resources. Python also has an HTMLParser module, which handles both XHTML and HTML and provides a slightly lower-level view of the content. HTMLParser is also slightly simpler to use, since htmllib uses sgmllib and thus understands many of the complexities of SGML. HTMLParser provides a class that the user subclasses from, defining methods that are called as tags are found in the input. The example below is a very basic HTML parser that uses the HTMLParser.HTMLParser class to print out tags as they are encountered: from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered the beginning of a %s tag" % tag def handle_endtag(self, tag): print "Encountered the end of a %s tag" % tag -- DJPH Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|