The Spider and Manifest scripts can be obtained from Cisco.com using the same procedure that is used to obtain updated versions of the Cisco Internet CDN Software.

To obtain the Manifest and Spider scripts from Cisco.com:

Step 1 Launch your preferred web browser and point it to:

http://www.cisco.com/cgi-bin/tablebuild.pl/cdn-sp

Step 2 When prompted, log in to Cisco.com using your designated Cisco.com username and password.

The Cisco Internet CDN Software download page appears, listing the available software updates for the Cisco Internet CDN Software product.

Step 3 Locate the file named manifest-tools.zip. This is a ZIP archive containing both the Manifest and Spider PERL scripts.

Step 4 Click the link for the manifest-tools.zip file. The download page appears.

Step 5 Click the Software License Agreement link. A new browser window will open displaying the license agreement.

Step 6 After you have read the license agreement, close the browser window displaying the agreement and return to the Software Download page.

Step 7 Click the filename link labeled Download.

Step 8 Click Save to file and then choose a location on your workstation to temporarily store the zip file containing the scripts.

Step 9 Use your preferred unzip program to unpack the scripts to a location on your workstation or your network.

After you have unzipped the scripts, you are ready to begin using them to build manifest files for your website. See the "Listing Web Site Content Using the Spider Script" section and the "Selecting Live and Pre-position Content Using the Manifest Script" section for instructions on running the scripts.

Listing Web Site Content Using the Spider Script

This section contains information on the following topics:

In the simplest scenario, the spider is pointed to the address of an origin server and given the name of a database (.db) file into which it will place any valid URLs it discovers on that site. For example, if you wanted to analyze the contents of www.cisco.com for content that might be pre-positioned, you would issue the following command:

spider --start=www.cisco.com --db=ciscocontent.db

Limiting Scope

But spidering the entirety of www.cisco.com might take hours and produce much more information than you are interested in. What if you want to limit your review of an origin server to just a particular part of that server? The Spider script contains a variety of tools that enable you to limit as well as broaden the scope of a spider's action.

For example, to limit the spider's search of www.cisco.com to just that part of the server containing product-related support information, you could enter the following command:

spider --start=www.cisco.com/public/support/ --db=ciscocontent.db

Broadening Scope

Or, to ask the spider to follow links from www.cisco.com to the Cisco networking professionals forum, you could enter the following spider command:

spider --start=www.cisco.com 
--allow=forums.cisco.com --db=ciscocontent.db

Re-spidering Servers

In addition to covering new origin servers, the Spider script can also be run on sites that have already been analyzed and that contain links into the CDN. When spidering a server that has already been analyzed, you use the --hd keyword to specify the name of hosted domain on which content from the origin server will be hosted, and the --map keyword to provide mapping information between URLs on the origin server and on the Internet CDN.

For example, the following commands will trace the content mapped to the /support area on the hosted domain www.hosted.cisco.com back to its origins in in the support area of www.cisco.com:

--start=

http://www.cisco.com/public/support/tac/home.html  
--hd=www.hosted.cisco.com

--map=http://www.cisco.com/public/support/tac/=/support  

--db=ciscocontent.db

In each of these examples listed, the Spider analyzes the URL of each piece of content on the origin server or in that area of the origin server that has been targeted and applies filters to them that incorporate the parameters supplied when the Spider was run and identify potential pre-positioning or live streaming candidates. If the URL matches the pattern provided by the Spider, it is accepted and its URL is recorded in the database being created by the Spider. If the pattern does not match, the content is rejected and the Spider moves on.

Spider Script Syntax Guidelines

The Spider script accepts the following syntax:

spider {--start=origin_server_url
[--allow=allowed_url | --depth=number | --file=filename |
{--hd=hosted_domain_name --map={origin_server_url_prefix=cdn_prefix} } |
--limit=number | --prefix=url_prefix | --reject=disallowed_url | ]
--db=database_name.db}

Table A-1: Spider Script Keywords

Keyword	Description	Syntax
--start	Names the location (URL) of the origin server that will be analyzed.	`--start=www.cisco.com`
--db	Names the database file in which content URLs from the origin server and any allowed locations will be placed.	`--db=ciscocontent.db`
--allow (optional)	Names a location other than that specified using the start keyword that will be accepted when it is found in URLs.	`--allow=forums.cisco.com`
--depth (optional)	Causes the Spider script to stop after following links a specified number of levels deep on the origin server.	`--depth=6`
--file (optional)	Causes the Spider script to read its commands from a specified rules file, one line at a time.	`--file=cisco-rules.cfg`
--hd (optional)	Identifies a hosted domain on your CDN as the hosted domain for the content being spidered. Used with the --map keyword for mapping content from the CDN back to the origin server.	`--hd=www.hosted.cisco.com`
--map (optional)	Causes the Spider script to substitute the second URL prefix (appearing after the second =) for the first in any URLs from the origin server, or substitute the first prefix for the second when re-spidering content an origin server.	`--map=http://www.cisco.com/ public/support/tac/=/support`
--limit (optional)	Causes the Spider script to stop after retrieving a specified number of pages from the origin server. The default is 100. Specifying 0 sets no limit for the number of pages retrieved.	`--limit 1000`
--prefix (optional)	Specifies a URL prefix which, when it is encountered, will be accepted by the Spider.	`--prefix=http://www.cisco.co m/partners/CDN/`
--reject (optional)	Names a location that will be rejected when it is found in URLs.	`--reject=cgi-bin`

Combining Spider Data

What if you ran the Spider script on two separate locations on an origin server, but would like to combine the content into one database from which a manifest file will be generated?

The data output by the Spider can easily be combined—just open the *.db file containing the data you want to move, select that data, and copy it. Next, open the *.db file you want to serve as the merged file, locate the end of the file, and paste the data you copied into it.

The Manifest script can now be run on the merged data.

Customizing the Spider Script

Because the Spider script anticipates certain platforms and scenarios that might not correspond to your own web site configuration, Cisco provides you with the PERL source code for the Spider script, which you can modify to suit your own needs.

See the "Spider Script Source" section to review the source code for the Spider script.

Selecting Live and Pre-position Content Using the Manifest Script

Whereas the Spider script is used to gather a list of potential hosted content from an origin server, the Manifest file is where you will cull through all the information gathered by the Spider and decide which content you will actually import to the CDN for placement on a hosted domain.

This section contains information on the following topics:

Pre-Positioned Versus Live Content

The Manifest script distinguishes between content that needs to be pre-positioned and live, streamed content that, by definition, cannot be pre-positioned.

Using the prepos command, you identify and pre-position all content that meets criteria that you specify. For example, to pre-position all image files from cisco.com larger than one megabyte, you would enter the following command:

manifest --prepos='type(image/*) and size > 1000k' 
--db=ciscocontent.db --xml=cisco.xml

Using the live command, you identify the URLs of live content. Unlike pre-positioned content, live content cannot be identified by information stored in the header, so you will need to devise a method of locating live content based solely on information contained in the URL of that content. For example, you might identify streamed content with the following command:

manifest --live='match(rtsp://*)'

Manifest Script Syntax Guidelines

manifest {[--file=filename | --live='keyword_comparison' | --prepos='keyword_comparison' | --set='attribute=value : keyword_comparison' | --playservertable=filename | --map={origin_server_url_prefix=cdn_prefix}] --db=database_name.db --xml=manifest_file_name.xml}

Table A-2: Manifest Script Keywords

Keyword	Description	Syntax
--file	Causes the Manifest script to read its commands from a specified rules file, one line at a time.	`--file=ciscocontent.cfg`
--live	Marks content URLs in the database file that match the terms of the keyword comparison as live (type="live") content in the manifest file.	`--live='match(rtsp://*)'`
--prepos	Marks content URLs in the database file that match the terms of the keyword comparison as pre-positioned content (type='prepos') in the manifest file.	`--prepos='type(image/jpg) and size > 1000k'`
--set	Sets the specified attribute to the value provided for all content items with URLs in the database file that match the keyword comparison.	`--set='ttl=10000 : match(/urgent/)'`
--playservertable	Adds the playserver table in the specified file to the manifest file. Playserver tables map MIME content types and filename extensions to specific server types to use (for example, "real" or "wmt") for the content in a specific hosted domain. See the "Manifest File Structure and Syntax" section for more information on the <playServerTable> attributes.	`--playservertable=info.txt`
--map	Causes the Manifest script to substitute the second URL prefix (appearing after the second =) for the first in any URLs from the origin server.	`--map=http://www.cisco.com/ public/support/tac/=/support`
--db	Names the database file in which content URLs from the origin server and any allowed locations are located. This file provides the data that the Manifest script analyzes.	`--db=ciscocontent.db`
--xml	Names the manifest file that is generated by the Manifest script.	`--xml=ciscomanifest.xml`
match	A comparison keyword that locates text in content URLs that are identical to a value that is provided.	`--prepos='match (http://forums.cisco.com/*)'`
size	A comparison keyword that identifies content named in the database file according to the specified filesize parameter (in kilobytes).	`--prepos='size >= 1000k'`
time	A comparison keyword that identifies content named in the database file according to the time since the content was last modified (in hours).	`--prepos= 'time < 72 hours'`
type	A comparison keyword that identifies content named in the database file according to its MIME type (text, application, image, and so on).	--prepos='type(image/gif)'

Customizing the Manifest Script

Because the Manifest script anticipates certain platforms and scenarios that might not correspond to your own web site configuration, Cisco provides you with the PERL source code for the Manifest script, which you can modify to suit your own needs.

See the "Manifest Script Source" section to review the source code for the Manifest script.

Creating a Rules File for the Spider and Manifest Scripts

When using the Spider and Manifest scripts on a large web server, the parameters and rules you set for your scripts may be numerous and complex. When this is the case, it may make more sense to create a file containing all your instructions to the scripts that you can then simply point to than having to type a long series of commands time and again.

Using a rules file makes it easy to re-run the Spider and Manifest scripts, and be confident that the scripts are receiving identical commands each time. In addition, the same commands file can be read by both the Manifest and Spider scripts without generating incorrect output; the Spider script simply ignores commands for the Manifest script, and vice versa.

To create a rules file for the Spider and Manifest scripts to use:

Step 1 Open your preferred text editor.

Step 2 Enter your commands one at a time and each on its own line. Each line of your rule file is sent to the scripts as a single argument.

For example, a rules file for the Cisco web site might read:

--start=www.cisco.com

--allow=forums.cisco.com

--reject=cgi-bin

--limit=0

--db=ciscocontent.db

--prepos='match(image/gif) and size > 1000k'

--xml=ciscomanifest.xml

Step 3 Save your file in a location relative to the Spider and Manifest scripts.

Step 4 Use the file command to run each script using your rules file. For example:

spider --file=cisco-rules.cfg

manifest --file=cisco-rules.cfg

Spider Script Source

#!/usr/bin/perl -w
use strict;
my @todo = ();                  # Array of urls we still have to fetch
my %seen = ();                  # Hash of urls we've fetched
 
use Getopt::Long;
my $limit = 100;                # Maximum number of URLs we might 
fetch.
my $depth = 0;                  # Spidering depth (0 == infinite)
my @prefix = ();
my @filters = ();               # A filter is a regexp and a bool.
# (["mit\.edu", 1], [".", 0]) means accept mit.edu urls, reject all 
others
my @start =();                  # URLs to start spidering
my $db = "";                    # The filename to write the database 
to.
my $proxy = "";                 # The proxy to use when making HTTP 
requests
 
# These allow us be intelligent about spidering sites that have 
already
# been rewritten to contain links to the hosted domain.
my @map = ();                   # origin to cdn-url mappings
my $hd = "";                    # The hosted domain
 
my $debug = 0;                  # Print extra debugging info?
 
# Return an array containing each line from a file.
# Used by the --file option to allow stuffing @ARGV with args from a 
file.
#  '#' until end of line is a comment character (ie it is not 
returned)
#  whitespace is stripped from the beginning and end of lines
#  empty lines (or just comments and/or whitespace) are ignored
sub lines ( $ ) {
  my ($filename) = @_;
  open (F, "< $filename") or die "$filename: $!\n";
  my @lines = map { s/\#.*//g; s/\s*(\S*)\s*/$1/; $_ || (); } <F>;
  close F or die $!;
  return @lines;
}
 
# We want spider and manifest to be runnable from a single config 
file, so
# each take all of the arguments of the other.  Naturally, these 
arguments
# are ignored if they are irrelevent.  When running this way, it's
# important to use the "--start" option to name urls in spider, and 
the
# "--db" option to name databases in manifest.
my $junk;
GetOptions("limit=n" => \$limit,
           "depth=n" => \$depth,
           "prefix=s" => \@prefix,
           "accept=s" => sub {my ($opt, $val) = @_; push @filters, 
[$val,1]},
           "reject=s" => sub {my ($opt, $val) = @_; push @filters, 
[$val,0]},
           "hd|rd=s" => \$hd,
           "map=s" => \@map,
           "db=s" => \$db,
           "proxy=s" => \$proxy,
           "start=s" => \@start,
           "<>" => sub { push @start, $_[0]; },
           # Arguments that all scripts take
           "file=s" => sub {my ($opt, $val) = @_; unshift @ARGV, 
lines($val)},
           "debug!" => \$debug,
           # Arguments that are really only for 'manifest' or 
'rewrite'
           "prepos=s" => \$junk,
           "live=s" => \$junk,
           "set=s" => \$junk,
           "recursive!" => \$junk,
           "playservertable=s" => \$junk,
           "xml=s" => \$junk,
           "file-map=s" => \$junk,
           "index=s" => \$junk,
           "od|origin=s" => \$junk,
           "always-rewrite=s" => \$junk,
	  ) or die "Bad argument syntax\n";
 
my %rmap;                       # Reverse map
for my $map (@map) {
  my ($origin, $cdn) = split('=', $map);
  $rmap{$cdn} = $origin;
}
 
# Allow crawling to any --prefix specified paths.  They can be comma 
separated.
@prefix = split(/,/,join(',',@prefix));
my %prefix;                      # Use a hash to avoid dupicates
for (@prefix) {
  $prefix{$_} = 1;
}
 
# Given a url, extract the "prefix".  That is, everything up to and
# including the last '/'.
sub prefix ( $ ) {
  my ($prefix) = @_;
  $prefix =~ s|(.*/).*|$1|;
  return $prefix;
}
 
use URI;
 
# The reason to do this at all is so rtsp and mms urls have methods 
like
# host(). 
my $http_impl= URI::implementor('http');
URI::implementor('rtsp', $http_impl);
URI::implementor('mms', $http_impl);
 
push @todo, map { s|^|http://|   unless /:/; URI->new($_)->canonical } 
@start;
 
for my $uri (@todo) {
  next if $seen{$uri}++;
  $prefix{prefix($uri)} = 1;
}
unshift @todo, $depth if $depth; # Integers in the todo list limit 
depth
my $depth_left = 1;             # Used to stop getting links if in 
last round
 
my $prefix_re = "^(".join('|', map {quotemeta($_)} keys %prefix).")";
#warn "$prefix_re\n";
push @filters, [$prefix_re, 1]; # Accept appropriate prefixes
push @filters, [".",0];         # Reject anything that gets to the end
 
# Filter debugging
#for my $f (@filters) {
#  warn "$f->[0] $f->[1]\n";
#}
 
my %extractors = ("text/html" => \&html_extract,
                  # Real Networks formats
                  "application/smil" => \&smil_extract,
                  "image/vnd.rn-realpix" => \&rp_extract,
                  "text/vnd.rn-realtext" => \&rt_extract,
                  "audio/x-pn-realaudio" => \&list_extract,
                  "audio/x-pn-realaudio-plugin" => \&list_extract,
                  # Microsoft formats
                  "video/x-ms-asf" => \&asx_extract,
                  "audio/x-ms-wax" => \&asx_extract,
                  "video/x-ms-wvx" => \&asx_extract,
                  # Flash
                  "application/x-shockwave-flash" => \&swf_extract,
                  # JavaScript
                  "application/x-javascript" => \&js_extract,
                  # .m3u files aren't really standardized...
                  "audio/x-m3u" => \&list_extract,
                  "audio/m3u" => \&list_extract,
                  "audio/x-mpegurl" => \&list_extract,
                 );
 
# Web servers are often stupid.  Try to guess an extractor based on 
these
# extensions if mime type doesn't work.
my %ext_extractors = (# Real networks
                      "smi" => \&smil_extract,
                      "rp" => \&rp_extract,
                      "rt" => \&rt_extract,
                      "ram" => \&list_extract,
                      "rpm" => \&list_extract,
                      # Microsoft
                      "asf" => \&asx_extract,
                      "wax" => \&asx_extract,
                      "wvx" => \&asx_extract,
                      # Flash
                      "swf" => \&swf_extract,
                      # JavaScript
                      "js" => \&js_extract,
                      # And for good measure
                      "m3u" => \&list_extract);
 
# Given a URI and a mime type, return the appropriate extractor if it 
is a
# container type, else 0;
sub extractor ( $$ ) {
  my ($uri, $type) = @_;
  my $ext = lc($uri);
  $ext =~ s/(.*\.)//;           # Remove everything up to the last .
 
  # Sleezy hack, but blame Real.  They have code to differentiate .ram
  # files from .rm and .ra files instead of separate mime types.  I 
really
  # don't want to suck down a multimegabyte binary file thinking it's 
a ram
  # file, so bail now.
  return 0 if $ext =~ /^r[ma]$/;
 
  return $extractors{lc($type)} if exists $extractors{lc($type)};
  # Might want to use extention only for text/plain...
  return $ext_extractors{$ext} if exists $ext_extractors{$ext};
  return 0;
}
 
# HTML extractor
 
# The following hash is taken from HTML::LinkEtor.  I've commented out 
all
# the places where links appear but they don't seem to be necessary to 
view
# the page, leaving things that should be considered embedded.
# http://www.w3.org/TR/html4/   was used to determine what things meant.
# Applet and object are supported poorly -- that is the 'base' 
attributes
# don't work yet, nor does 'archive'.
 
my %emb =
(
# a       => 'href',
 applet  => [qw(code)], #archive codebase)], unsupported for now
# area    => 'href',
# base    => 'href',
 bgsound => 'src',
# blockquote => 'cite',
 body    => [qw(background)],
# del     => 'cite',
# embed is not in w3c spec - described at
# http://home.netscape.com/assist/net_sites/embed_tag.html  
 embed   => [qw(src pluginspage)],
# form    => 'action',
 frame   => [qw(src longdesc)],
 iframe  => [qw(src longdesc)],
 ilayer  => [qw(background)],
 img     => [qw(src lowsrc longdesc)], # usemap)], usemap is a local 
anchor
 input   => [qw(src)], #usemap)], usemap is a local anchor
# ins     => 'cite',
# isindex => 'action',
# head    => 'profile',
 layer   => [qw(background src)],
#'link'   => 'href',
 object  => [qw(classid data)], #codebase archive usemap)], 
unsupported for now
'q'      => [qw(cite)],
 script  => [qw(src)], #for)],  "for" is not in w3c spec. Unsure what 
it means
 table   => [qw(background)],
 td      => [qw(background)],
 th      => [qw(background)],
# xmp     => 'href',  Deprecated, and I doubt an href is "embedded" 
anyway
);
 
use HTML::LinkExtor;
my $ex = HTML::LinkExtor->new();
sub html_extract ( $ ) {
  my ($content) = @_;
  my (@refs,@embs);
  $ex->parse($content);
  for my $link ($ex->links) {
    my ($tag, %attr) = @$link;
  KEY:
    while (my ($key, $val) = each(%attr)) {
      if (exists $emb{$tag}) {
        for my $attr (@{$emb{$tag}}) {
          if ($attr eq $key) {
            push @embs, $val;
            next KEY;
          }
        }
      }
      push @refs, $val        # If it's not embedded, it must be a ref
    }
  }
 
  # Hackish.  Since js_extract is lame anyway, we're not even 
bothering to
  # extract the JavaScript, just let js_extract look at the whole 
thing.
 
  my ($js_refs, $js_embs) = js_extract($content);
  push @refs, @$js_refs;
  push @embs, @$js_embs;
 
  (\@refs,\@embs);
}
 
# Trivial list format extractor.  Assumes that URLs must be absolute
# because these formats are usually used in a way that precludes their
# interpretters from knowing the context, thus they must be absolute.  
This
# has the advantage of being able to ignore "noise lines" like 
--stop-- in
# Real files.
sub list_extract ( $ ) {
  my ($content) = @_;
  my @embs = grep { m|^\s*[a-zA-Z]+://| } split("\n", $content);
  ([],\@embs);
}
 
# JavaScript extractor can't be perfect, but we can at least check out 
the
# first argument to any window.open calls.  If it's a constant 
(enclosed by
# quotes), assume it's a url.
 
#  Furthermore, this won't extract corrcetly if there is a comma 
*inside*
#  the first argument.
 
sub js_extract ( $ ) {
  my ($content) = @_;
  my @refs;
  while ($content =~ m/window\.open\s*\(\s*([^\)]+)\)/g) {
    my @args = split(/,/,$1);
    my $first = $args[0];
    push @refs, $1 if $first =~ /^\'([^\']*)\'$/;
    push @refs, $1 if $first =~ /^\"([^\"]*)\"$/;
  }
  (\@refs,[]);
}
 
# XML base extractors (smil, rp, rt, asx)
use XML::Parser;
my $xp = XML::Parser->new();
sub smil_extract ( $ ) {
  my ($content) = @_;
  my (@refs,@embs);
  my @links = ();
  $xp->setHandlers('Start' => sub {
                     shift; my $elt = shift;
                     my %attrs = @_;
                     push @refs, $attrs{href} if exists $attrs{href};
                     push @embs, $attrs{src} if exists $attrs{src};
                   });
  $xp->parse($content);
  (\@refs,\@embs);
}
 
# Real has a command syntax that can be used in RealText files as 
something
# to do when a link is clicked.  One common use is to open a link in a 
new
# window.  command() either returns its argument, or if it's argument 
is
# such a command, returns the URL that it mentions.  If it's a command 
that
# does not mention a URL, return an empty list.
 
sub command ( $ ) {
  my ($command) = @_;
  return $command unless $command =~ /command:/;
  return () unless $command =~ /command:openwindow/;
  return () unless $command =~ /,([^,\)]*)[,\)]/; # Put second 
argument into $1
  my $url = $1;
  $url =~ s/\s*(\S*)\s*/$1/;    # Trim whitespace
  return $url;
}
 
sub rp_extract ( $ ) {
  my ($content) = @_;
  my (@refs,@embs);
  $xp->setHandlers('Start' => sub {
                     shift;
                     my $elt = shift;
                     my %attrs = @_;
                     push @refs, command($attrs{url}) if exists 
$attrs{url};
                     push @embs, $attrs{name} if $elt eq "image";
                   });
  $xp->parse($content);
  (\@refs,\@embs);
}
 
sub rt_extract ( $ ) {
  my ($content) = @_;
  my @refs;
  $xp->setHandlers('Start' => sub {
                     shift; my $elt = shift;
                     my %attrs = @_;
                     push @refs, command($attrs{href}) if exists 
$attrs{href};
                   });
  $xp->parse($content);
  (\@refs,[]);
}
 
sub asx_extract ( $ ) {
  my ($content) = @_;
  my (@refs,@embs);
  $xp->setHandlers('Start' => sub {
                     shift; my $elt = shift;
                     my %attrs = @_;
                     push @embs, $attrs{href} if exists $attrs{href};
                   });
  $xp->parse($content);
  (\@refs,\@embs);
}
 
sub bin ( $ ) {
  my ($str) = @_;
  my $num = 0;
  while ($str ne "") {
    $num *= 2;
    $num += substr($str,0,1);
    substr($str,0,1) = "";
  }
  return $num;
}
 
sub swf_extract ( $ ) {
  # For format info, see http://www.openswf.org/SWFfilereference.html  
  my ($content) = @_;
  my (@refs,@embs);
 
  my $ndx = 8;                  # Start after  sig, ver and length
 
  # Skip a RECT.  See 
http://www.openswf.org/SWFfilereference.html#RECT  
  my $bits = substr($content, $ndx, 1); $ndx += 1;
  $bits = bin(unpack("B5", $bits));
  my $bytes = int(((5 + (4*$bits))+1)/8);
  $ndx += $bytes;
 
  $ndx += 4;                    # skip frame rate and count
 
  while ($ndx < length($content)) {
    my $buf = substr($content, $ndx, 2); $ndx += 2;
    $buf = unpack("S", $buf);
    my $tag = $buf >> 6;
    my $len = $buf & 0x3F;
    if ($len == 0x3f) {
      $len = substr($content, $ndx, 4); $ndx += 4;
      $len = unpack("L", $len);
    }
    if ($tag == 12) {             # DoAction
      my $action;
      while ($len) {
        my $action = substr($content, $ndx, 1); $ndx += 1; $len--;
        $action = unpack("C", $action);
        if ($action & 0x80) {
          my $sublen = substr($content, $ndx, 2); $ndx += 2; $len -= 
2;
          $sublen = unpack("S", $sublen);
          $buf = substr($content, $ndx, $sublen);
          $ndx += $sublen; $len -= $sublen;
          if ($action == 0x83) {  # Get URL
            $buf =~ m/^([^\000]+)/;
            push @embs, $1;
          }
        }
      }
    }
    $ndx += $len;
  }
  (\@refs,\@embs);
}
 
use LWP::UserAgent;
#use LWP::Debug qw(+);
 
my $ua = new LWP::UserAgent;
$ua->proxy('http', $proxy);
 
sub fetch ( $ ) {
  my ($uri) = @_;
  warn "Retreiving $uri\n" if $debug;
  my $req = HTTP::Request->new(HEAD => $uri);
  $uri->scheme =~ /http|ftp|file/ or # Someday it would be nice to 
DESCRIBE
    return HTTP::Response->new(200); # rtsp urls.  For now, act like 
we got it.
  my $res = $ua->request($req);
  # Check the outcome of the response
  if (!$res->is_success) {
    warn "Unable to HEAD $uri: ".$res->status_line."\n" if $debug;
    # This is bit cheesy, but since some servers barf on HEAD 
requests,
    # we do a GET on a hard failure.
    if ($res->code == 500) {
      $req = HTTP::Request->new(GET => $uri);
      $res = $ua->request($req);
      warn "Unable to GET $uri: ".$res->status_line."\n" unless
        $res->is_success;
    }
    return $res unless $res->is_success;
  }
 
  # If we can parse it then we should actually GET it, so we can 
spider off
  # links.  Also, no need to retreive if we are going no deeper.
  return $res unless extractor($uri, $res->content_type) && 
$depth_left;
 
  unless ($req->method eq 'GET') { # Don't bother if we already had to 
GET
    $req = HTTP::Request->new(GET => $uri);
    $res = $ua->request($req);
    warn "Unable to GET $uri: ".$res->status_line."\n" unless 
$res->is_success;
  }
  # Insert Content-Length if it's not there
  $res->headers->header("Content-Length", length($res->content)) 
unless
    $res->headers->header("Content-Length");
  return $res;
}
 
# Make sure a URI points to its origin, not the cdn.
sub originify ( $ ) {
  my ($uri) = @_;
  # Look for URLs the content providers have already rewritten.  We
  # will revert them to their original form (to find their origin
  # location) and spider that instead.  (We don't want to spider the 
cdn.)
  return $uri unless defined $uri->host && $uri->host eq $hd;
 
  # It's a link to the hd.  We have to reverse it or throw it away.
  my @path = $uri->path_segments;
  shift @path;                  # First segment is always nothing;
  if (@path and $path[0] =~ '^cdn-') {
    my @tail;
    shift @path;                # Remove cdn-* tag
    unshift @path, "";          # Put that first segment back.
    while (@path) {
      my $path = join('/', @path);
      if (exists $rmap{$path}) {
        my $new = URI->new(join('/', $rmap{$path}, @tail))->canonical;
        return $new;
      }
      unshift @tail, pop @path;
    }
  }
  warn "Unreversable cdn link: $uri\n";
  return 0;
}
 
# Convert an ARL back to it's original URL.  Works only on one type of 
ARL.
# Maybe it should return 0 if it sees an ARL it doesn't understand?
sub deakamize ( $ ) {
  my ($arl) = @_;
  # 7 is hard coded because that is the typecode for this kind of ARL
  if ($arl =~ 
m@http://[^/]*akamai(?:tech)?.net/7/\d+/\d+/[\dabcdef]+/(.*)@)   {
    return URI->new("http://$1");  
  } else {
    return $arl;
  }
}
 
# Filter out schemes we don't understand and queries, then convert the 
rest
# to a standard form - pointing into the origin instead of cdn.
sub canonicalize {
  my ($base, @urls) = @_;
  @urls = map { s/\#.*//; $_; } @urls; # Get rid of fragments
  @urls = map { URI->new_abs($_, $base)->canonical } @urls; # Standard 
form
  @urls = grep {
    $_->scheme =~ /http|ftp|file|rtsp|mms/ &&  # Filter for "normal" 
schemes
    ! $_->query;                # and non-queries
  } @urls;
  return map { deakamize(originify($_)); } @urls # To origin
}
 
# return true if $url is ACCEPTed by filters.
sub filter ( $$ ) {
  my ($filters, $url) = @_;
  for my $filter (@$filters) {
    return $filter->[1] if "$url" =~ $filter->[0];
  }
}
 
my %catalog = ();               # Maps URIs to response headers
 
# Main spidering loop
my $fetched = 0;
while (my $uri = shift @todo) {
  # Stop if we are at max --depth
  if (!ref($uri)) {             # A non-ref must be an integer, used 
for DEPTH
    if ($uri) {                 # Non-zero means we keep going
      push @todo, $uri-1;
      $depth_left = $uri-1;
      next;
    }
    my $left = @todo;
    warn "Stopping with $left urls left because --depth=$depth\n";
    last;                       # Hit a zero, meaning stop
  }
 
  my $res = fetch($uri);
  next unless $res->is_success;
  $catalog{$uri} = $res->headers;
  $fetched++;
 
  if (my $extract = extractor($uri, $res->content_type)) {
    my ($refs, $embs) = &$extract($res->content);
 
    # Get urls into standard form
    @$refs = canonicalize($res->base, @$refs);
    @$embs = canonicalize($res->base, @$embs);
 
    # Get rid of urls we don't care about
    @$embs = grep { filter(\@filters, $_) } @$embs;
    @$refs = grep { filter(\@filters, $_) } @$refs;
 
    # Remove duplicate embs before saving.
    # Dup removal is unnecessary for correctness, but it avoids big 
CONTAINS
    my %dup = ();
    @$embs = grep { !$dup{$_}++ } @$embs;
    $catalog{$uri}->header(CONTAINS=>"@$embs");
 
    # Add unseen urls to the todo list
    push @todo, grep { !$seen{$_}++ } (@$refs, @$embs);
  }
 
  # Stop if we have hit our --limit
  if ($fetched == $limit) {
    my $left = @todo;
    warn "Stopping with $left urls left because --limit=$limit\n";
    last;
  }
 
  if ($debug and $fetched % 100 == 0) {
    warn sprintf "%d fetched, %d todo\n", $fetched, scalar @todo;
  }
}
 
# Output loop. The only reason this loop isn't built into the input 
loop is
# so the output can be sorted.  Maybe we don't really care about.  If 
not,
# it should be moved into the input loop so that output will continue 
as
# progress is made and memory for the catalog will not be required.
 
# datapoint: 20k URLs from hgtv cause 70Mb process and take > 11hrs
 
use IO::File;
my $DB = IO::Handle->new_from_fd(fileno(STDOUT),"w");
$DB = IO::File->new("> $db") or die "$db: $!\n" if $db;
my @headers = qw/Content-Type Content-Length Last-Modified CONTAINS/;
for my $key (sort keys %catalog) {
  print $DB "URL: $key\n";
  for my $h (@headers) {
    print $DB "$h: ".$catalog{$key}->header($h)."\n"
      if $catalog{$key}->header($h);
  }
  print $DB "\n";
}
close $DB or die $! if $db;

Manifest Script Source

#!/usr/bin/perl -w
 
use strict;
 
use Getopt::Long;
my @db = ();                    # URL databases to read in
my $xml = "";                   # XML filename to write manifest to
my @setters = ();               # General attribute setters
my $playservertable = "";       # File that contains the 
PlayServerTable
my @map = ();                   # origin to cdn-url mappings
my $debug = 0;                  # Print extra debugging info?
my $recursive = 1;              # Should prepos containers prepos 
their kids?
 
# Return an array containing each line from a file.
# Used by the --file option to allow stuffing @ARGV with args from a 
file.
#  '#' until end of line is a comment character (ie it is not 
returned)
#  whitespace is stripped from the beginning and end of lines
#  empty lines (or just comments and/or whitespace) are ignored
sub lines ( $ ) {
  my ($filename) = @_;
  open (F, "< $filename") or die "$filename: $!\n";
  my @lines = map { s/\#.*//g; s/\s*(\S*)\s*/$1/; $_ || (); } <F>;
  close F or die $!;
  return @lines;
}
 
# Convert a glob pattern to a regular expression.
# Assumes that the glob matches only if it matches the entire string.
sub glob2regex ( $ ) {
  # Note: This does not allow the writer of glob patterns to escape 
them.
  # * and ? are always special, [,],and- are always passed through.
  my ($glob) = @_;
  $glob = quotemeta($glob);  # First, escape everything
  $glob =~ s/\\\*/.*/g;         # Convert * to .*
  $glob =~ s/\\\?/./g;          # Convert ? to .
  $glob =~ s/\\(\[|\-|\])/$1/g; # Reconstruct things like [a-z].
  return "^$glob\$";
}
 
sub process_type ( $$ ) {
  my ($opt, $val) = @_;
  push @setters, parse_setter("type=$opt:$val");
}
 
# We want all scripts to be runnable from a single config file, so 
each
# take all of the arguments of the others.  Naturally, these arguments 
are
# ignored if they are irrelevent.  If you want to use identical 
command
# lines for all three scripts, be sure to use the --start, --db, and 
--xml
# options.
 
my $junk;
GetOptions("prepos=s" => \&process_type,
           "live=s" => \&process_type,
           "set=s" => sub {my $val = $_[1]; push @setters, 
parse_setter($val)},
           "recursive!" => \$recursive,
           "playservertable=s" => $playservertable,
           "xml=s" => \$xml,
           "map=s" => \@map,
           "db=s" => \@db,
           "<>" => sub { push @db, $_[0]; },
           # Arguments that all scripts take
           "file=s" => sub {my ($opt, $val) = @_; unshift @ARGV, 
lines($val)},
           "debug!" => \$debug,
           # Arguments that are really only for 'spider' or 'rewrite'
           "limit=n" => \$junk,
           "depth=n" => \$junk,
           "prefix=s" => \$junk,
           "accept=s" => \$junk,
           "reject=s" => \$junk,
           "hd|rd=s" => \$junk,
           "start=s" => \$junk,
           "file-map=s" => \$junk,
           "index=s" => \$junk,
           "od|origin=s" => \$junk,
           "always-rewrite=s" => \$junk,
	  ) or die "Bad argument syntax\n";
 
sub parse_setter ( $ ) {
  my ($setter) = @_;
  my ($settings, $sub) = split ':', $setter, 2;
  my @settings = split ' ', $settings;
  my %settings = ();
  for my $setting (@settings) {
    my ($key, $val) = split '=', $setting;
    $settings{$key} = $val;
  }
  return [parse_sub($sub), \%settings];
}
 
sub parse_sub ( $ ) {
  local ($_) = @_;
  warn "Changing \Q$_'\n" if $debug;
 
  # Allow comparisons to size, including shortcuts like 10k
  s/\bsize\b/\$H{'Content-Length'}/gio;
  s/\b(\d+)GB?\b/($1*1024M)/gio;
  s/\b(\d+)MB?\b/($1*1024K)/gio;
  s/\b(\d+)KB?\b/($1*1024)/gio;
 
  # Allow matching on URL
  s+\bmatch\(([^\)]*)\)+"m\@".glob2regex($1)."\@i"+gioe;
 
  # Allow matching on type
  s+\btype\(([^\)]*)\)+"(exists \$H{'Content-Type'} && 
\$H{'Content-Type'} =~ m@".glob2regex($1)."@)"+gioe;
 
  warn " into \Q$_'\n" if $debug;
  my $sub = eval "sub { my (\$i) = \@_; $_ }";
  $sub or die "Unable to understand \Q$_'\n";
}
 
my %map;
for my $map (@map) {
  my ($origin, $cdn) = split('=', $map);
  $map{$origin} = $cdn;
}
 
# Convert one URI to another according to map.  Always return a new 
URI,
# even if contents are unchanged.
sub translate ( $$ ) {
  my ($uri, $map) = @_;
  # Try each prefix, longer ones first
  for my $prefix (sort { length($b) <=> length($a) } keys %$map) {
    if (index($uri, $prefix) == 0) {
      my $t = "$uri";
      substr($t, 0, length($prefix)) = $map->{$prefix};
      return URI->new($t)
    }
  }
  return $uri->clone;
}
 
my @backups = grep { /=/ } @ARGV; # args with = in them are backup 
specifiers
@ARGV = grep { ! /=/ } @ARGV;
 
use URI;
 
my %servers = ();
my %items = ();                 # Catalog of all URLs
my @items = ();                 # Same as %items, but sorted.
 
my $depth = 0;
my %default = ();               # Current attributes in effect because 
of group
my @chain = ();                 # Undo chain as groups are closed
 
sub push_params ( % ) {
  my %hash = @_;
  my $str = params(%hash);
  my $changes = {};
  while (my ($key,$val) = each %hash) {
    $changes->{$key} = $default{$key};
    $default{$key} = $val;
  }
  push @chain, $changes;
  return $str;
}
 
sub pop_params () {
  my $changes = pop @chain;
  while (my ($key,$val) = each %{$changes}) {
    $default{$key} = $val;
  }
}
 
my %xml_ent = ('&' => 'amp', '<' => 'lt', '>' => 'gt', '"' => 'quot');
sub xml_attr ( $ ) {
  my ($val) = @_;
  $val =~ s/([\&\<\>\"])/&$xml_ent{$1};/;
  return "\"$val\"";
}
 
sub params ( % ) {
  my %hash = @_;
  my $str = "";
  while (my ($key,$val) = each %hash) {
    $str .= " $key=" . xml_attr($val)
      unless defined $default{$key} && $default{$key} eq $val;
  }
  return $str;
}
 
sub open_server ( @ ) {
  return ("  " x $depth++) . "<server".params(@_).">\n";
}
sub close_server ( ) {
  die "More close_server()s than open_server()s!" unless  $depth;
  return ("  " x --$depth) . "</server>\n";
}
sub host ( % ) {
  return ("  " x $depth) . "<host".params(@_)."/>";
}
 
sub open_group ( @ ) {
  return ("  " x $depth++) . "<item-group".push_params(@_).">\n";
}
sub close_group ( ) {
  die "More close_group()s than open_group()s!" unless  $depth;
  pop_params();
  return ("  " x --$depth) . "</item-group>\n";
}
 
sub item ( $ ) {
  my ($item) = @_;
  my $str = ("  " x $depth) . "<item".params(%{$item->{attrs}});
  if (! @{$item->{contains}}) {
    return "$str/>";
  }
  $str .= ">\n";
  $depth++;
  for my $contained (@{$item->{contains}}) {
    if ($contained->{type} eq 'prepos') {
      my $path = translate($contained->{uri}, \%map)->path;
      $str .= ("  " x $depth) . 
"<contains".params('cdn-url'=>$path)."/>\n";
    }
  }
  $depth--;
  $str .=  ("  " x $depth) . "</item>";
  return $str;
}
 
sub header () {
  return <<HEADER
<?xml version="1.0" standalone="no"?>
<!DOCTYPE CdnManifest SYSTEM "CdnManifest.dtd">
 
<CdnManifest>
HEADER
}
 
sub footer () {
  return "</CdnManifest>\n";
}
 
sub set ( $$ ) {
  my ($setter, $i) = @_;
  my ($pred, $attrs) = @$setter;
  use vars "%H";
  local ($_, %H) = ($i->{uri}, %{$i->{hdrs}});
  if (&$pred($i)) {
    for my $key (keys %$attrs) {
      $i->{attrs}->{$key} = $attrs->{$key};
    }
    return 1;
  }
  return 0;
}
 
sub print_items {
  my $server = "";
  for my $i (@_) {
    my $uri = $i->{uri};
    if ($uri->host ne $server) {
      print close_group() if $server;
      print open_group(server=>$uri->host);
      $server = $uri->host;
    }
    $i->{attrs}->{src} = $uri->path;
    my $t = translate($uri, \%map);
    $i->{attrs}->{'cdn-url'} = $t if $t ne $uri; # Only include if 
needed
    print item($i) . "\n";
  }
  print close_group() if $server;
}
 
sub preposition_contents {
  my ($item) = @_;
 
  for my $contained (@{$item->{contains}}) {
    next if $contained->{type}; # Already set, or live
    warn "Forcing prepos of ".$contained->{uri}."\n" if $debug;
    $contained->{type} = $contained->{attrs}->{type} = 'prepos';
    preposition_contents($contained);
  }
}
 
# Input loop.  Read in all the URLs from the databases
@ARGV = @db;
while (<>) {
  my $item = {  uri => undef,   # Original url
                hdrs => {},     # Headers from database
                attrs => {},
                contains => [], # items that this one contains
                type => ""      # Convenience.  It's just 
attrs->{type}
             };
  { do { # do {} while IS NOT A LOOP, the extra braces allow "last" to 
work
    chomp;
    last unless $_;
    my ($header, $val) = split(": ", $_, 2);
    $item->{hdrs}->{$header} = $val;
  } while (<>); }
  die "Headers without a URL!\n" unless exists $item->{hdrs}->{URL};
  my $uri = $item->{uri} = URI->new($item->{hdrs}->{URL});
  push @items, $items{$uri} = $item;
  $servers{$uri->host} = 1;
}
 
# Figure out what contains what
my %missing = ();               # Tracks URI that have been reported 
missing
for my $item (@items) {
  if (exists $item->{hdrs}->{CONTAINS}) {
    my @contains = ();
    my @missing = ();
    for my $c (split ' ', $item->{hdrs}->{CONTAINS}) {
      my $contained = $items{$c};
      if ($contained) {
        push @contains, $contained;
      } else {
        # Only consider $c missing if it has not yet been reported
        push @missing, $c unless $missing{$c}++;
      }
    }
    warn $item->{uri}." contains missing urls:\n ".join("\n ", 
@missing)."\n"
      if @missing;
    $item->{contains} = \@contains;
  }
}
 
# Report URLs that were missing multiple times
my $intro = 0;
while (my ($uri, $times) = each(%missing)) {
  if ($times > 1) {
    warn "Some URLs are missing multiple times:\n" unless $intro++;
    warn sprintf " %3d %s\n", $times, $uri;
  }
}
warn "\n" if $intro;
 
# Run all the command line setters
for my $item (@items) {
  for my $setter (@setters) {
    set($setter, $item);
  }
  $item->{type} = $item->{attrs}->{type} if exists 
$item->{attrs}->{type};
}
 
# Recursively preposition anything that is contained in a 
prepositioned item
if ($recursive) {
  for my $item (@items) {
    next unless $item->{type} eq 'prepos';
    preposition_contents($item);
  }
}
 
# Spit out the manifest file
if ($xml) {
  open XML, "> $xml" or die "$xml: $!\n";
  select XML;
}
 
print header();
for my $s (keys %servers) {
  print open_server(name=>$s);
  print host(name=>$s, proto=>'http')."\n";
  print close_server();
}
print "\n<!-- Prepositioned Items -->\n";
print open_group(type=>"prepos");
print_items(grep {$_->{type} eq 'prepos'} @items);
print close_group();
 
print "\n<!-- Live Items -->\n";
print open_group(type=>"live");
print_items(grep {$_->{type} eq 'live'} @items);
print close_group();
 
if ($playservertable) {
  open TABLE, "< $playservertable" or die "$playservertable: $!\n";
  print <TABLE>;
  close TABLE or die $!;
}
 
print footer();
close XML or die $! if $xml;
select STDOUT;
 
# Collect and print some simple statistics
my %space = ();                 # Amount of space used by various 
types
my $space;
my %num = ();                   # Number of pieces of each type pf 
content
my $num = 0;
for my $item (grep {$_->{type} eq 'prepos'} @items) {
  if (! exists $item->{hdrs}->{'Content-Type'}) {
    warn $item->{uri} . " has no content type.\n";
    next;
  }
  my $type = $item->{hdrs}->{'Content-Type'};
  $type =~ s/[\s,;].*$//;
  $num{$type} ||= 0;
  $num{$type}++;
  $num++;
 
  $space{$type} ||= 0;
  if (exists $item->{hdrs}->{'Content-Length'}) {
    $space{$type} += $item->{hdrs}->{'Content-Length'};
    $space += $item->{hdrs}->{'Content-Length'};
  }
}
 
my $k = 1024; my $m = 1024*$k; my $g = 1024*$m;
sub abbrev ( $ ) {
  my ($num) = @_;
  return 0 unless defined $num;
  return ($num/$g,"G") if $num > $g;
  return ($num/$m,"M") if $num > $m;
  return ($num/$k,"K") if $num > $k;
  return ($num,"b");
}
 
for my $type (sort { $space{$a} <=> $space{$b} } keys %num) {
  warn sprintf "%22s %5d %4d%s\n", $type, $num{$type}, 
abbrev($space{$type});
}
warn sprintf "%22s %5d %4d%s\n", "Total", $num, abbrev($space);

Table of Contents