10.2. Deleting ImagesInstead of altering nodes or extracting data from them, it's common to want to just delete them. For example, consider that we have the task of taking normally complex and image-rich web pages and making unadorned text-only versions of them, such as one would print out or paste into email. Each document in question has one big table with three rows, like this:
The simplified version of such a page should omit all images and elements of the class top_button_bar, bottom_button_bar, left_geegaws, and right_geegaws. This can be implemented with a simple call to look_down:
The call to $d->delete detaches the node in $d from its parent, then destroys it along with all its descendant nodes. The resulting file looks like this:
One pragmatic point here: the list returned by the look_down( ) call will contain the two tr and td elements, any images they contain, and also images elsewhere in the document. When we delete one of those tr or td nodes, we are also implicitly deleting every one of its descendant nodes, including some img elements that we are about to hit in a subsequent iteration through look_down( )'s return list. This isn't a problem in this case, because deleting an already deleted node is a harmless no-operation. The larger point here is that when look_down( ) finds a matching node (as with a left_geegaws td node, in our example), that doesn't stop it from looking below that node for more matches. If you need that kind of behavior, you'll need to implement it in your own traverser, as discussed in Chapter 9, "HTML Processing with Trees".
Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|