8.3. Inspecting the HTMLThe first step to getting some code working is to save a file locally. This is so you can look at the source in an editor, and secondly so you can initially test your data extractor on that local file. It may take a good deal of hit-and-miss before you get it working right, and there's no point in making each trial run go and get the same page across the network, especially to Fresh Air's occasionally quite busy server. Saving the above URL as fresh1.html gives us a 12K file. While there's only about 1K of text shown on the screen, the other 11K are mostly whitespace that indents the HTML, some JavaScript, plus all the table code needed to make the navigation bar on the left and the search form on the right. We can completely ignore all that code and just try to figure out how to extract the "Listen..." links. Sifting through the HTML source, we see that those links are represented with this code (note that most lines begin with at least two spaces): ... <A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.ram"> <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#FFCC00" SIZE="2"> Listen to <B>Monday - July 2, 2001</B> </FONT> </A> ... <A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.01.ram">Listen to <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3"> <B> Editor and writer Walter Kirn </B> </FONT></A> <BR> <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="2"> <BLOCKQUOTE>Editor and writer <A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn">Walter Kirn</A>'s new novel <I>Up in the Air</I> (Doubleday) is about ... </BLOCKQUOTE></FONT> <BR> <A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.02.ram">Listen to <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3"> <B> Casting director and actress Joanna Merlin </B> </FONT></A> <BR> <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="2"> <BLOCKQUOTE>Casting director and actress <A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin">Joanna Merlin</A> has written a new guide for actors, <I>Auditioning: An ... </BLOCKQUOTE></FONT> <BR> ... Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|