Reading and Writing RSS Files (Perl Cookbook, 2nd Edition)

22.9.3. Discussion

There are at least four variations of RSS extant: 0.9, 0.91, 1.0, and 2.0. At the time of this writing, XML::RSS understood all but RSS 2.0. Each version has different capabilities, so methods and parameters depend on which version of RSS you're using. For example, RSS 1.0 supports RDF and uses the Dublin Core metadata (http://dublincore.org/). Consult the documentation for what you can and cannot call.

XML::RSS uses XML::Parser to parse the RSS. Unfortunately, not all RSS files are well-formed XML, let alone valid. The XML::RSSLite module on CPAN offers a looser approach to parsing RSS—it uses regular expressions and is much more forgiving of incorrect XML.

Example 22-13 uses XML::RSSLite and LWP::Simple to download The Guardian's RSS feed and print out the items whose descriptions contain the keywords we're interested in.

Example 22-13. rss-parser

#!/usr/bin/perl -w
# guardian-list -- list Guardian articles matching keyword

use XML::RSSLite;
use LWP::Simple;
use strict;

# list of keywords we want
my @keywords = qw(perl internet porn iraq bush);

# get the RSS
my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml';
my $content = get($URL);

# parse the RSS
my %result;
parseRSS(\%result, \$content);

# build the regex from keywords
my $re = join "|", @keywords;
$re = qr/\b(?:$re)\b/i;

# print report of matching items
foreach my $item (@{ $result{items} }) {
  my $title = $item->{title};
  $title =~ s{\s+}{ };  $title =~ s{^\s+}{  }; $title =~ s{\s+$}{  };

  if ($title =~ /$re/) {
    print "$title\n\t$item->{link}\n\n";
  }
}

The following is sample output from Example 22-13:

UK troops to lead Iraq peace force
        http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss

Shia cleric challenges Bush plan for Iraq
        http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss

We can combine this with XML::RSS to generate a new RSS feed from the filtered items. It would be easier, of course, to do it all with XML::RSS, but this way you get to see both modules in action. Example 22-14 shows the finished program.

Example 22-14. rss-filter

#!/usr/bin/perl -w
# guardian-filter -- filter the Guardian's RSS feed by keyword
use XML::RSSLite;
use XML::RSS;
use LWP::Simple;
use strict;

# list of keywords we want
my @keywords = qw(perl internet porn iraq bush);

# get the RSS
my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml';
my $content = get($URL);

# parse the RSS
my %result;
parseRSS(\%result, \$content);

# build the regex from keywords
my $re = join "|", @keywords;
$re = qr/\b(?:$re)\b/i;

# make new RSS feed
my $rss = XML::RSS->new(version => '0.91');
$rss->channel(title       => $result{title},
              link        => $result{link},
              description => $result{description});

foreach my $item (@{ $result{items} }) {
  my $title = $item->{title};
  $title =~ s{\s+}{ };  $title =~ s{^\s+}{  }; $title =~ s{\s+$}{  };

  if ($title =~ /$re/) {
    $rss->add_item(title => $title, link => $item->{link});
  }
}
print $rss->as_string;

Here's an example of the RSS feed it produces:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
            "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">

<channel>
<title>Guardian Unlimited</title>
<link>http://www.guardian.co.uk</link>
<description>Intelligent news and comment throughout the day from The Guardian 
newspaper</description>

<item>
<title>UK troops to lead Iraq peace force</title>
<link>http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss</link>
</item>

<item>
<title>Shia cleric challenges Bush plan for Iraq</title>
<link>http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss</link>
</item>

</channel>
</rss>

Example 22-13. rss-parser

Example 22-14. rss-filter

22.9. Reading and Writing RSS Files

22.9.1. Problem

22.9.2. Solution

22.9.3. Discussion

22.9.4. See Also