8.3. Inspecting the HTML
The first step to getting some code working is to save a file
locally. This is so you can look at the source in an editor, and
secondly so you can initially test your data extractor on that local
file. It may take a good deal of hit-and-miss before you get it
working right, and there's no point in making each
trial run go and get the same page across the network, especially to
Fresh Air's occasionally quite
busy server. Saving the above URL as
fresh1.html gives us a 12K file. While
there's only about 1K of text shown on the screen,
the other 11K are mostly whitespace that indents the HTML, some
JavaScript, plus all the table code needed to make the navigation bar
on the left and the search form on the right. We can completely
ignore all that code and just try to figure out how to extract the
"Listen..." links. Sifting through
the HTML source, we see that those links are represented with this
code (note that most lines begin with at least two spaces):
...
<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.ram">
<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#FFCC00" SIZE="2">
Listen to <B>Monday - July 2, 2001</B>
</FONT>
</A>
...
<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.01.ram">Listen to
<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3">
<B> Editor and writer Walter Kirn </B>
</FONT></A>
<BR>
<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="2">
<BLOCKQUOTE>Editor and writer <A
HREF="http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn">Walter
Kirn</A>'s new novel <I>Up in the Air</I> (Doubleday) is about
...
</BLOCKQUOTE></FONT>
<BR>
<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.02.ram">Listen to
<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3">
<B> Casting director and actress Joanna Merlin </B>
</FONT></A>
<BR>
<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="2">
<BLOCKQUOTE>Casting director and actress <A
HREF="http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin">Joanna
Merlin</A> has written a new guide for actors, <I>Auditioning: An
...
</BLOCKQUOTE></FONT>
<BR>
...
 |  |  | | 8.2. Getting the Data |  | 8.4. First Code |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|