Chapter 22. Structured Text: HTML
Most documents on the Web use HTML,
the HyperText Markup Language. Markup is the insertion of special
tokens, known as tags, in a text document to
give structure to the text. HTML is an application of the large,
general standard known as SGML, the Standard General Markup Language.
In practice, many of the Web's documents use HTML in
sloppy or incorrect ways. Browsers have evolved many practical
heuristics over the years to try and compensate for this, but even
so, it still often happens that a browser displays an incorrect web
page in some weird way.
Moreover,
HTML was never suitable for much more than presenting documents on a
screen. Complete and precise extraction of the information in the
document, working backward from the document's
presentation, is often unfeasible. To tighten things up again, HTML
has evolved into a more rigorous standard called XHTML. XHTML is very
similar to traditional HTML, but it is defined in terms of XML and
more precisely than HTML. You can handle XHTML with the tools covered
in Chapter 23.
Despite the difficulties, it's often possible to
extract at least some useful information from HTML documents. Python
supplies the sgmllib, htmllib,
and HTMLParser modules for the task of parsing
HTML documents, whether this parsing is for the purpose of presenting
the documents, or, more typically, as part of an attempt to extract
information from them. Generating HTML and embedding Python in HTML
are also frequent tasks. No standard Python library module supports
HTML generation or embedding directly, but you can use normal Python
string manipulation, and third-party modules can also help.
|