11.4. An HTTP Authentication Example:The Unicode Mailing ArchiveMost password-protected sites (whether protected via HTTP Basic Authentication or otherwise) are that way because the sites' owners don't want just anyone to look at the content. And it would be a bit odd if I gave away such a username and password by mentioning it in this book! However, there is one well-known site whose content is password protected without being secret: the mailing list archive of the Unicode mailing lists. In an effort to keep email-harvesting bots from finding the Unicode mailing list archive while spidering the Web for fresh email addresses, the Unicode.org sysadmins have put a password on that part of their site. But to allow people (actual not-bot humans) to access the site, the site administrators publicly state the password, on an unprotected page, at http://www.unicode.org/mail-arch/, which links to the protected part, but also states the username and password you should use. The main Unicode mailing list (called unicode) once in a while has a thread that is really very interesting and you really must read, but it's buried in a thousand other messages that are not even worth downloading, even in digest form. Luckily, this problem meets a tidy solution with LWP: I've written a short program that, on the first of every month, downloads the index of all the previous month's messages and reports the number of messages that has each topic as its subject. The trick is that the web pages that list this information are password protected. Moreover, the URL for the index of last month's posts is different every month, but in a fairly obvious way. The URL for March 2002, for example, is: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/ Deducing the URL for the month that has just ended is simple enough:
But getting the contents of that URL involves first providing the username and password and realm name. The Unicode web site doesn't publicly declare the realm name, because it's an irrelevant detail for users with interactive browsers, but we need to know it for our call to the credential method. To find out the realm name, try accessing the URL in an interactive browser. The realm will be shown in the authentication dialog box, as shown in Figure 11-1. In this case, it's "Unicode-MailList-Archives," which is all we needed to make our request:
If this fails (if the Unicode site's admins have changed the username or password or even the realm name), that will die with this error message: Error getting http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/: 401 Authorization Required at unicode_list001.pl line 21. But assuming the authorization data is correct, the page is retrieved as if it were a normal, unprotected page. From there, counting the topics and noting the absolute URL of the first message of each thread is a matter of extracting data from the HTML source and reporting it concisely.
Typical output starts out like this:
This continues for a few pages.
Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|