cc/td/doc/product/webscale/content/cdnsp
hometocprevnextglossaryfeedbacksearchhelp
PDF

Table of Contents

Sample Manifest File Scripts

Sample Manifest File Scripts

This appendix contains information that you can use to automate the creation of manifest files for your web site.

This appendix contains the following sections:

Overview

Two sample scripts are provided in this appendix:

These scripts shipped with your CDN software and can serve as the basis for your own automation scripts.

Installing PERL on Your Workstation

You need to have PERL installed on your workstation prior to working with or running the Spider or Manifest scripts. It may also be useful to have a PERL compiler available. PERL is open source software and can be downloaded for free from a variety of locations on the Internet. Refer to the Comprehensive PERL Archive Network (CPAN) at http://www.cpan.org, or http://www.perl.com.

Obtaining the Scripts

The Spider and Manifest scripts can be obtained from Cisco.com using the same procedure that is used to obtain updated versions of the Cisco Internet CDN Software.

To obtain the Manifest and Spider scripts from Cisco.com:


Step 1   Launch your preferred web browser and point it to:

http://www.cisco.com/cgi-bin/tablebuild.pl/cdn-sp

Step 2   When prompted, log in to Cisco.com using your designated Cisco.com username and password.

The Cisco Internet CDN Software download page appears, listing the available software updates for the Cisco Internet CDN Software product.

Step 3   Locate the file named manifest-tools.zip. This is a ZIP archive containing both the Manifest and Spider PERL scripts.

Step 4   Click the link for the manifest-tools.zip file. The download page appears.

Step 5   Click the Software License Agreement link. A new browser window will open displaying the license agreement.

Step 6   After you have read the license agreement, close the browser window displaying the agreement and return to the Software Download page.

Step 7   Click the filename link labeled Download.

Step 8   Click Save to file and then choose a location on your workstation to temporarily store the zip file containing the scripts.

Step 9   Use your preferred unzip program to unpack the scripts to a location on your workstation or your network.

After you have unzipped the scripts, you are ready to begin using them to build manifest files for your website. See the "Listing Web Site Content Using the Spider Script" section and the "Selecting Live and Pre-position Content Using the Manifest Script" section for instructions on running the scripts.


Listing Web Site Content Using the Spider Script

This section contains information on the following topics:

In the simplest scenario, the spider is pointed to the address of an origin server and given the name of a database (.db) file into which it will place any valid URLs it discovers on that site. For example, if you wanted to analyze the contents of www.cisco.com for content that might be pre-positioned, you would issue the following command:

spider --start=www.cisco.com --db=ciscocontent.db

Limiting Scope

But spidering the entirety of www.cisco.com might take hours and produce much more information than you are interested in. What if you want to limit your review of an origin server to just a particular part of that server? The Spider script contains a variety of tools that enable you to limit as well as broaden the scope of a spider's action.

For example, to limit the spider's search of www.cisco.com to just that part of the server containing product-related support information, you could enter the following command:

spider --start=www.cisco.com/public/support/ --db=ciscocontent.db

Broadening Scope

Or, to ask the spider to follow links from www.cisco.com to the Cisco networking professionals forum, you could enter the following spider command:

spider --start=www.cisco.com --allow=forums.cisco.com --db=ciscocontent.db

Re-spidering Servers

In addition to covering new origin servers, the Spider script can also be run on sites that have already been analyzed and that contain links into the CDN. When spidering a server that has already been analyzed, you use the --hd keyword to specify the name of hosted domain on which content from the origin server will be hosted, and the --map keyword to provide mapping information between URLs on the origin server and on the Internet CDN.

For example, the following commands will trace the content mapped to the /support area on the hosted domain www.hosted.cisco.com back to its origins in in the support area of www.cisco.com:

--start=
http://www.cisco.com/public/support/tac/home.html --hd=www.hosted.cisco.com --map=http://www.cisco.com/public/support/tac/=/support --db=ciscocontent.db

In each of these examples listed, the Spider analyzes the URL of each piece of content on the origin server or in that area of the origin server that has been targeted and applies filters to them that incorporate the parameters supplied when the Spider was run and identify potential pre-positioning or live streaming candidates. If the URL matches the pattern provided by the Spider, it is accepted and its URL is recorded in the database being created by the Spider. If the pattern does not match, the content is rejected and the Spider moves on.

Spider Script Syntax Guidelines

The Spider script accepts the following syntax:

spider {--start=origin_server_url
[--
allow=allowed_url | --depth=number | --file=filename |
{--
hd=hosted_domain_name --map={origin_server_url_prefix=cdn_prefix} } |
--
limit=number | --prefix=url_prefix | --reject=disallowed_url | ]
--
db=database_name.db}


Table A-1: Spider Script Keywords
Keyword Description Syntax

--start

Names the location (URL) of the origin server that will be analyzed.

--start=www.cisco.com

--db

Names the database file in which content URLs from the origin server and any allowed locations will be placed.

--db=ciscocontent.db

--allow (optional)

Names a location other than that specified using the start keyword that will be accepted when it is found in URLs.

--allow=forums.cisco.com

--depth (optional)

Causes the Spider script to stop after following links a specified number of levels deep on the origin server.

--depth=6

--file (optional)

Causes the Spider script to read its commands from a specified rules file, one line at a time.

--file=cisco-rules.cfg

--hd (optional)

Identifies a hosted domain on your CDN as the hosted domain for the content being spidered. Used with the --map keyword for mapping content from the CDN back to the origin server.

--hd=www.hosted.cisco.com

--map (optional)

Causes the Spider script to substitute the second URL prefix (appearing after the second =) for the first in any URLs from the origin server, or substitute the first prefix for the second when re-spidering content an origin server.

--map=http://www.cisco.com/
public/support/tac/=/support

--limit (optional)

Causes the Spider script to stop after retrieving a specified number of pages from the origin server. The default is 100. Specifying 0 sets no limit for the number of pages retrieved.

--limit 1000

--prefix (optional)

Specifies a URL prefix which, when it is encountered, will be accepted by the Spider.

--prefix=http://www.cisco.co m/partners/CDN/

--reject (optional)

Names a location that will be rejected when it is found in URLs.

--reject=cgi-bin

Combining Spider Data

What if you ran the Spider script on two separate locations on an origin server, but would like to combine the content into one database from which a manifest file will be generated?

The data output by the Spider can easily be combined—just open the *.db file containing the data you want to move, select that data, and copy it. Next, open the *.db file you want to serve as the merged file, locate the end of the file, and paste the data you copied into it.

The Manifest script can now be run on the merged data.

Customizing the Spider Script

Because the Spider script anticipates certain platforms and scenarios that might not correspond to your own web site configuration, Cisco provides you with the PERL source code for the Spider script, which you can modify to suit your own needs.

See the "Spider Script Source" section to review the source code for the Spider script.

Selecting Live and Pre-position Content Using the Manifest Script

Whereas the Spider script is used to gather a list of potential hosted content from an origin server, the Manifest file is where you will cull through all the information gathered by the Spider and decide which content you will actually import to the CDN for placement on a hosted domain.

This section contains information on the following topics:

Pre-Positioned Versus Live Content

The Manifest script distinguishes between content that needs to be pre-positioned and live, streamed content that, by definition, cannot be pre-positioned.

Using the prepos command, you identify and pre-position all content that meets criteria that you specify. For example, to pre-position all image files from cisco.com larger than one megabyte, you would enter the following command:

manifest --prepos='type(image/*) and size > 1000k' --db=ciscocontent.db --xml=cisco.xml

Using the live command, you identify the URLs of live content. Unlike pre-positioned content, live content cannot be identified by information stored in the header, so you will need to devise a method of locating live content based solely on information contained in the URL of that content. For example, you might identify streamed content with the following command:

manifest --live='match(rtsp://*)'

Manifest Script Syntax Guidelines

manifest {[--file=filename | --live='keyword_comparison' | --prepos='keyword_comparison' | --set='attribute=value : keyword_comparison' | --playservertable=filename | --map={origin_server_url_prefix=cdn_prefix}] --db=database_name.db --xml=manifest_file_name.xml}


Table A-2: Manifest Script Keywords
Keyword Description Syntax

--file

Causes the Manifest script to read its commands from a specified rules file, one line at a time.

--file=ciscocontent.cfg

--live

Marks content URLs in the database file that match the terms of the keyword comparison as live (type="live") content in the manifest file.

--live='match(rtsp://*)'

--prepos

Marks content URLs in the database file that match the terms of the keyword comparison as pre-positioned content (type='prepos') in the manifest file.

--prepos='type(image/jpg) and size > 1000k'

--set

Sets the specified attribute to the value provided for all content items with URLs in the database file that match the keyword comparison.

--set='ttl=10000 : match(*/urgent/*)'

--playservertable

Adds the playserver table in the specified file to the manifest file. Playserver tables map MIME content types and filename extensions to specific server types to use (for example, "real" or "wmt") for the content in a specific hosted domain.

See the "Manifest File Structure and Syntax" section for more information on the <playServerTable> attributes.

--playservertable=info.txt

--map

Causes the Manifest script to substitute the second URL prefix (appearing after the second =) for the first in any URLs from the origin server.

--map=http://www.cisco.com/
public/support/tac/=/support

--db

Names the database file in which content URLs from the origin server and any allowed locations are located. This file provides the data that the Manifest script analyzes.

--db=ciscocontent.db

--xml

Names the manifest file that is generated by the Manifest script.

--xml=ciscomanifest.xml

match

A comparison keyword that locates text in content URLs that are identical to a value that is provided.

--prepos='match (http://forums.cisco.com/*)'

size

A comparison keyword that identifies content named in the database file according to the specified filesize parameter (in kilobytes).

--prepos='size >= 1000k'

time

A comparison keyword that identifies content named in the database file according to the time since the content was last modified (in hours).

--prepos= 'time < 72 hours'

type

A comparison keyword that identifies content named in the database file according to its MIME type (text, application, image, and so on).

--prepos='type(image/gif)'

Customizing the Manifest Script

Because the Manifest script anticipates certain platforms and scenarios that might not correspond to your own web site configuration, Cisco provides you with the PERL source code for the Manifest script, which you can modify to suit your own needs.

See the "Manifest Script Source" section to review the source code for the Manifest script.

Creating a Rules File for the Spider and Manifest Scripts

When using the Spider and Manifest scripts on a large web server, the parameters and rules you set for your scripts may be numerous and complex. When this is the case, it may make more sense to create a file containing all your instructions to the scripts that you can then simply point to than having to type a long series of commands time and again.

Using a rules file makes it easy to re-run the Spider and Manifest scripts, and be confident that the scripts are receiving identical commands each time. In addition, the same commands file can be read by both the Manifest and Spider scripts without generating incorrect output; the Spider script simply ignores commands for the Manifest script, and vice versa.

To create a rules file for the Spider and Manifest scripts to use:


Step 1   Open your preferred text editor.

Step 2   Enter your commands one at a time and each on its own line. Each line of your rule file is sent to the scripts as a single argument.

For example, a rules file for the Cisco web site might read:

    --start=www.cisco.com --allow=forums.cisco.com --reject=cgi-bin --limit=0 --db=ciscocontent.db --prepos='match(image/gif) and size > 1000k' --xml=ciscomanifest.xml

Step 3   Save your file in a location relative to the Spider and Manifest scripts.

Step 4   Use the file command to run each script using your rules file. For example:

    spider --file=cisco-rules.cfg manifest --file=cisco-rules.cfg

Spider Script Source

#!/usr/bin/perl -w use strict; my @todo = (); # Array of urls we still have to fetch my %seen = (); # Hash of urls we've fetched use Getopt::Long; my $limit = 100; # Maximum number of URLs we might fetch. my $depth = 0; # Spidering depth (0 == infinite) my @prefix = (); my @filters = (); # A filter is a regexp and a bool. # (["mit\.edu", 1], [".", 0]) means accept mit.edu urls, reject all others my @start =(); # URLs to start spidering my $db = ""; # The filename to write the database to. my $proxy = ""; # The proxy to use when making HTTP requests # These allow us be intelligent about spidering sites that have already # been rewritten to contain links to the hosted domain. my @map = (); # origin to cdn-url mappings my $hd = ""; # The hosted domain my $debug = 0; # Print extra debugging info? # Return an array containing each line from a file. # Used by the --file option to allow stuffing @ARGV with args from a file. # '#' until end of line is a comment character (ie it is not returned) # whitespace is stripped from the beginning and end of lines # empty lines (or just comments and/or whitespace) are ignored sub lines ( $ ) { my ($filename) = @_; open (F, "< $filename") or die "$filename: $!\n"; my @lines = map { s/\#.*//g; s/\s*(\S*)\s*/$1/; $_ || (); } <F>; close F or die $!; return @lines; } # We want spider and manifest to be runnable from a single config file, so # each take all of the arguments of the other. Naturally, these arguments # are ignored if they are irrelevent. When running this way, it's # important to use the "--start" option to name urls in spider, and the # "--db" option to name databases in manifest. my $junk; GetOptions("limit=n" => \$limit, "depth=n" => \$depth, "prefix=s" => \@prefix, "accept=s" => sub {my ($opt, $val) = @_; push @filters, [$val,1]}, "reject=s" => sub {my ($opt, $val) = @_; push @filters, [$val,0]}, "hd|rd=s" => \$hd, "map=s" => \@map, "db=s" => \$db, "proxy=s" => \$proxy, "start=s" => \@start, "<>" => sub { push @start, $_[0]; }, # Arguments that all scripts take "file=s" => sub {my ($opt, $val) = @_; unshift @ARGV, lines($val)}, "debug!" => \$debug, # Arguments that are really only for 'manifest' or 'rewrite' "prepos=s" => \$junk, "live=s" => \$junk, "set=s" => \$junk, "recursive!" => \$junk, "playservertable=s" => \$junk, "xml=s" => \$junk, "file-map=s" => \$junk, "index=s" => \$junk, "od|origin=s" => \$junk, "always-rewrite=s" => \$junk, ) or die "Bad argument syntax\n"; my %rmap; # Reverse map for my $map (@map) { my ($origin, $cdn) = split('=', $map); $rmap{$cdn} = $origin; } # Allow crawling to any --prefix specified paths. They can be comma separated. @prefix = split(/,/,join(',',@prefix)); my %prefix; # Use a hash to avoid dupicates for (@prefix) { $prefix{$_} = 1; } # Given a url, extract the "prefix". That is, everything up to and # including the last '/'. sub prefix ( $ ) { my ($prefix) = @_; $prefix =~ s|(.*/).*|$1|; return $prefix; } use URI; # The reason to do this at all is so rtsp and mms urls have methods like # host(). my $http_impl= URI::implementor('http'); URI::implementor('rtsp', $http_impl); URI::implementor('mms', $http_impl); push @todo, map { s|^|http://| unless /:/; URI->new($_)->canonical } @start; for my $uri (@todo) { next if $seen{$uri}++; $prefix{prefix($uri)} = 1; } unshift @todo, $depth if $depth; # Integers in the todo list limit depth my $depth_left = 1; # Used to stop getting links if in last round my $prefix_re = "^(".join('|', map {quotemeta($_)} keys %prefix).")"; #warn "$prefix_re\n"; push @filters, [$prefix_re, 1]; # Accept appropriate prefixes push @filters, [".",0]; # Reject anything that gets to the end # Filter debugging #for my $f (@filters) { # warn "$f->[0] $f->[1]\n"; #} my %extractors = ("text/html" => \&html_extract, # Real Networks formats "application/smil" => \&smil_extract, "image/vnd.rn-realpix" => \&rp_extract, "text/vnd.rn-realtext" => \&rt_extract, "audio/x-pn-realaudio" => \&list_extract, "audio/x-pn-realaudio-plugin" => \&list_extract, # Microsoft formats "video/x-ms-asf" => \&asx_extract, "audio/x-ms-wax" => \&asx_extract, "video/x-ms-wvx" => \&asx_extract, # Flash "application/x-shockwave-flash" => \&swf_extract, # JavaScript "application/x-javascript" => \&js_extract, # .m3u files aren't really standardized... "audio/x-m3u" => \&list_extract, "audio/m3u" => \&list_extract, "audio/x-mpegurl" => \&list_extract, ); # Web servers are often stupid. Try to guess an extractor based on these # extensions if mime type doesn't work. my %ext_extractors = (# Real networks "smi" => \&smil_extract, "rp" => \&rp_extract, "rt" => \&rt_extract, "ram" => \&list_extract, "rpm" => \&list_extract, # Microsoft "asf" => \&asx_extract, "wax" => \&asx_extract, "wvx" => \&asx_extract, # Flash "swf" => \&swf_extract, # JavaScript "js" => \&js_extract, # And for good measure "m3u" => \&list_extract); # Given a URI and a mime type, return the appropriate extractor if it is a # container type, else 0; sub extractor ( $$ ) { my ($uri, $type) = @_; my $ext = lc($uri); $ext =~ s/(.*\.)//; # Remove everything up to the last . # Sleezy hack, but blame Real. They have code to differentiate .ram # files from .rm and .ra files instead of separate mime types. I really # don't want to suck down a multimegabyte binary file thinking it's a ram # file, so bail now. return 0 if $ext =~ /^r[ma]$/; return $extractors{lc($type)} if exists $extractors{lc($type)}; # Might want to use extention only for text/plain... return $ext_extractors{$ext} if exists $ext_extractors{$ext}; return 0; } # HTML extractor # The following hash is taken from HTML::LinkEtor. I've commented out all # the places where links appear but they don't seem to be necessary to view # the page, leaving things that should be considered embedded. # http://www.w3.org/TR/html4/ was used to determine what things meant. # Applet and object are supported poorly -- that is the 'base' attributes # don't work yet, nor does 'archive'. my %emb = ( # a => 'href', applet => [qw(code)], #archive codebase)], unsupported for now # area => 'href', # base => 'href', bgsound => 'src', # blockquote => 'cite', body => [qw(background)], # del => 'cite', # embed is not in w3c spec - described at # http://home.netscape.com/assist/net_sites/embed_tag.html embed => [qw(src pluginspage)], # form => 'action', frame => [qw(src longdesc)], iframe => [qw(src longdesc)], ilayer => [qw(background)], img => [qw(src lowsrc longdesc)], # usemap)], usemap is a local anchor input => [qw(src)], #usemap)], usemap is a local anchor # ins => 'cite', # isindex => 'action', # head => 'profile', layer => [qw(background src)], #'link' => 'href', object => [qw(classid data)], #codebase archive usemap)], unsupported for now 'q' => [qw(cite)], script => [qw(src)], #for)], "for" is not in w3c spec. Unsure what it means table => [qw(background)], td => [qw(background)], th => [qw(background)], # xmp => 'href', Deprecated, and I doubt an href is "embedded" anyway ); use HTML::LinkExtor; my $ex = HTML::LinkExtor->new(); sub html_extract ( $ ) { my ($content) = @_; my (@refs,@embs); $ex->parse($content); for my $link ($ex->links) { my ($tag, %attr) = @$link; KEY: while (my ($key, $val) = each(%attr)) { if (exists $emb{$tag}) { for my $attr (@{$emb{$tag}}) { if ($attr eq $key) { push @embs, $val; next KEY; } } } push @refs, $val # If it's not embedded, it must be a ref } } # Hackish. Since js_extract is lame anyway, we're not even bothering to # extract the JavaScript, just let js_extract look at the whole thing. my ($js_refs, $js_embs) = js_extract($content); push @refs, @$js_refs; push @embs, @$js_embs; (\@refs,\@embs); } # Trivial list format extractor. Assumes that URLs must be absolute # because these formats are usually used in a way that precludes their # interpretters from knowing the context, thus they must be absolute. This # has the advantage of being able to ignore "noise lines" like --stop-- in # Real files. sub list_extract ( $ ) { my ($content) = @_; my @embs = grep { m|^\s*[a-zA-Z]+://| } split("\n", $content); ([],\@embs); } # JavaScript extractor can't be perfect, but we can at least check out the # first argument to any window.open calls. If it's a constant (enclosed by # quotes), assume it's a url. # Furthermore, this won't extract corrcetly if there is a comma *inside* # the first argument. sub js_extract ( $ ) { my ($content) = @_; my @refs; while ($content =~ m/window\.open\s*\(\s*([^\)]+)\)/g) { my @args = split(/,/,$1); my $first = $args[0]; push @refs, $1 if $first =~ /^\'([^\']*)\'$/; push @refs, $1 if $first =~ /^\"([^\"]*)\"$/; } (\@refs,[]); } # XML base extractors (smil, rp, rt, asx) use XML::Parser; my $xp = XML::Parser->new(); sub smil_extract ( $ ) { my ($content) = @_; my (@refs,@embs); my @links = (); $xp->setHandlers('Start' => sub { shift; my $elt = shift; my %attrs = @_; push @refs, $attrs{href} if exists $attrs{href}; push @embs, $attrs{src} if exists $attrs{src}; }); $xp->parse($content); (\@refs,\@embs); } # Real has a command syntax that can be used in RealText files as something # to do when a link is clicked. One common use is to open a link in a new # window. command() either returns its argument, or if it's argument is # such a command, returns the URL that it mentions. If it's a command that # does not mention a URL, return an empty list. sub command ( $ ) { my ($command) = @_; return $command unless $command =~ /command:/; return () unless $command =~ /command:openwindow/; return () unless $command =~ /,([^,\)]*)[,\)]/; # Put second argument into $1 my $url = $1; $url =~ s/\s*(\S*)\s*/$1/; # Trim whitespace return $url; } sub rp_extract ( $ ) { my ($content) = @_; my (@refs,@embs); $xp->setHandlers('Start' => sub { shift; my $elt = shift; my %attrs = @_; push @refs, command($attrs{url}) if exists $attrs{url}; push @embs, $attrs{name} if $elt eq "image"; }); $xp->parse($content); (\@refs,\@embs); } sub rt_extract ( $ ) { my ($content) = @_; my @refs; $xp->setHandlers('Start' => sub { shift; my $elt = shift; my %attrs = @_; push @refs, command($attrs{href}) if exists $attrs{href}; }); $xp->parse($content); (\@refs,[]); } sub asx_extract ( $ ) { my ($content) = @_; my (@refs,@embs); $xp->setHandlers('Start' => sub { shift; my $elt = shift; my %attrs = @_; push @embs, $attrs{href} if exists $attrs{href}; }); $xp->parse($content); (\@refs,\@embs); } sub bin ( $ ) { my ($str) = @_; my $num = 0; while ($str ne "") { $num *= 2; $num += substr($str,0,1); substr($str,0,1) = ""; } return $num; } sub swf_extract ( $ ) { # For format info, see http://www.openswf.org/SWFfilereference.html my ($content) = @_; my (@refs,@embs); my $ndx = 8; # Start after sig, ver and length # Skip a RECT. See http://www.openswf.org/SWFfilereference.html#RECT my $bits = substr($content, $ndx, 1); $ndx += 1; $bits = bin(unpack("B5", $bits)); my $bytes = int(((5 + (4*$bits))+1)/8); $ndx += $bytes; $ndx += 4; # skip frame rate and count while ($ndx < length($content)) { my $buf = substr($content, $ndx, 2); $ndx += 2; $buf = unpack("S", $buf); my $tag = $buf >> 6; my $len = $buf & 0x3F; if ($len == 0x3f) { $len = substr($content, $ndx, 4); $ndx += 4; $len = unpack("L", $len); } if ($tag == 12) { # DoAction my $action; while ($len) { my $action = substr($content, $ndx, 1); $ndx += 1; $len--; $action = unpack("C", $action); if ($action & 0x80) { my $sublen = substr($content, $ndx, 2); $ndx += 2; $len -= 2; $sublen = unpack("S", $sublen); $buf = substr($content, $ndx, $sublen); $ndx += $sublen; $len -= $sublen; if ($action == 0x83) { # Get URL $buf =~ m/^([^\000]+)/; push @embs, $1; } } } } $ndx += $len; } (\@refs,\@embs); } use LWP::UserAgent; #use LWP::Debug qw(+); my $ua = new LWP::UserAgent; $ua->proxy('http', $proxy); sub fetch ( $ ) { my ($uri) = @_; warn "Retreiving $uri\n" if $debug; my $req = HTTP::Request->new(HEAD => $uri); $uri->scheme =~ /http|ftp|file/ or # Someday it would be nice to DESCRIBE return HTTP::Response->new(200); # rtsp urls. For now, act like we got it. my $res = $ua->request($req); # Check the outcome of the response if (!$res->is_success) { warn "Unable to HEAD $uri: ".$res->status_line."\n" if $debug; # This is bit cheesy, but since some servers barf on HEAD requests, # we do a GET on a hard failure. if ($res->code == 500) { $req = HTTP::Request->new(GET => $uri); $res = $ua->request($req); warn "Unable to GET $uri: ".$res->status_line."\n" unless $res->is_success; } return $res unless $res->is_success; } # If we can parse it then we should actually GET it, so we can spider off # links. Also, no need to retreive if we are going no deeper. return $res unless extractor($uri, $res->content_type) && $depth_left; unless ($req->method eq 'GET') { # Don't bother if we already had to GET $req = HTTP::Request->new(GET => $uri); $res = $ua->request($req); warn "Unable to GET $uri: ".$res->status_line."\n" unless $res->is_success; } # Insert Content-Length if it's not there $res->headers->header("Content-Length", length($res->content)) unless $res->headers->header("Content-Length"); return $res; } # Make sure a URI points to its origin, not the cdn. sub originify ( $ ) { my ($uri) = @_; # Look for URLs the content providers have already rewritten. We # will revert them to their original form (to find their origin # location) and spider that instead. (We don't want to spider the cdn.) return $uri unless defined $uri->host && $uri->host eq $hd; # It's a link to the hd. We have to reverse it or throw it away. my @path = $uri->path_segments; shift @path; # First segment is always nothing; if (@path and $path[0] =~ '^cdn-') { my @tail; shift @path; # Remove cdn-* tag unshift @path, ""; # Put that first segment back. while (@path) { my $path = join('/', @path); if (exists $rmap{$path}) { my $new = URI->new(join('/', $rmap{$path}, @tail))->canonical; return $new; } unshift @tail, pop @path; } } warn "Unreversable cdn link: $uri\n"; return 0; } # Convert an ARL back to it's original URL. Works only on one type of ARL. # Maybe it should return 0 if it sees an ARL it doesn't understand? sub deakamize ( $ ) { my ($arl) = @_; # 7 is hard coded because that is the typecode for this kind of ARL if ($arl =~ m@http://[^/]*akamai(?:tech)?.net/7/\d+/\d+/[\dabcdef]+/(.*)@) { return URI->new("http://$1"); } else { return $arl; } } # Filter out schemes we don't understand and queries, then convert the rest # to a standard form - pointing into the origin instead of cdn. sub canonicalize { my ($base, @urls) = @_; @urls = map { s/\#.*//; $_; } @urls; # Get rid of fragments @urls = map { URI->new_abs($_, $base)->canonical } @urls; # Standard form @urls = grep { $_->scheme =~ /http|ftp|file|rtsp|mms/ && # Filter for "normal" schemes ! $_->query; # and non-queries } @urls; return map { deakamize(originify($_)); } @urls # To origin } # return true if $url is ACCEPTed by filters. sub filter ( $$ ) { my ($filters, $url) = @_; for my $filter (@$filters) { return $filter->[1] if "$url" =~ $filter->[0]; } } my %catalog = (); # Maps URIs to response headers # Main spidering loop my $fetched = 0; while (my $uri = shift @todo) { # Stop if we are at max --depth if (!ref($uri)) { # A non-ref must be an integer, used for DEPTH if ($uri) { # Non-zero means we keep going push @todo, $uri-1; $depth_left = $uri-1; next; } my $left = @todo; warn "Stopping with $left urls left because --depth=$depth\n"; last; # Hit a zero, meaning stop } my $res = fetch($uri); next unless $res->is_success; $catalog{$uri} = $res->headers; $fetched++; if (my $extract = extractor($uri, $res->content_type)) { my ($refs, $embs) = &$extract($res->content); # Get urls into standard form @$refs = canonicalize($res->base, @$refs); @$embs = canonicalize($res->base, @$embs); # Get rid of urls we don't care about @$embs = grep { filter(\@filters, $_) } @$embs; @$refs = grep { filter(\@filters, $_) } @$refs; # Remove duplicate embs before saving. # Dup removal is unnecessary for correctness, but it avoids big CONTAINS my %dup = (); @$embs = grep { !$dup{$_}++ } @$embs; $catalog{$uri}->header(CONTAINS=>"@$embs"); # Add unseen urls to the todo list push @todo, grep { !$seen{$_}++ } (@$refs, @$embs); } # Stop if we have hit our --limit if ($fetched == $limit) { my $left = @todo; warn "Stopping with $left urls left because --limit=$limit\n"; last; } if ($debug and $fetched % 100 == 0) { warn sprintf "%d fetched, %d todo\n", $fetched, scalar @todo; } } # Output loop. The only reason this loop isn't built into the input loop is # so the output can be sorted. Maybe we don't really care about. If not, # it should be moved into the input loop so that output will continue as # progress is made and memory for the catalog will not be required. # datapoint: 20k URLs from hgtv cause 70Mb process and take > 11hrs use IO::File; my $DB = IO::Handle->new_from_fd(fileno(STDOUT),"w"); $DB = IO::File->new("> $db") or die "$db: $!\n" if $db; my @headers = qw/Content-Type Content-Length Last-Modified CONTAINS/; for my $key (sort keys %catalog) { print $DB "URL: $key\n"; for my $h (@headers) { print $DB "$h: ".$catalog{$key}->header($h)."\n" if $catalog{$key}->header($h); } print $DB "\n"; } close $DB or die $! if $db;

Manifest Script Source

#!/usr/bin/perl -w use strict; use Getopt::Long; my @db = (); # URL databases to read in my $xml = ""; # XML filename to write manifest to my @setters = (); # General attribute setters my $playservertable = ""; # File that contains the PlayServerTable my @map = (); # origin to cdn-url mappings my $debug = 0; # Print extra debugging info? my $recursive = 1; # Should prepos containers prepos their kids? # Return an array containing each line from a file. # Used by the --file option to allow stuffing @ARGV with args from a file. # '#' until end of line is a comment character (ie it is not returned) # whitespace is stripped from the beginning and end of lines # empty lines (or just comments and/or whitespace) are ignored sub lines ( $ ) { my ($filename) = @_; open (F, "< $filename") or die "$filename: $!\n"; my @lines = map { s/\#.*//g; s/\s*(\S*)\s*/$1/; $_ || (); } <F>; close F or die $!; return @lines; } # Convert a glob pattern to a regular expression. # Assumes that the glob matches only if it matches the entire string. sub glob2regex ( $ ) { # Note: This does not allow the writer of glob patterns to escape them. # * and ? are always special, [,],and- are always passed through. my ($glob) = @_; $glob = quotemeta($glob); # First, escape everything $glob =~ s/\\\*/.*/g; # Convert * to .* $glob =~ s/\\\?/./g; # Convert ? to . $glob =~ s/\\(\[|\-|\])/$1/g; # Reconstruct things like [a-z]. return "^$glob\$"; } sub process_type ( $$ ) { my ($opt, $val) = @_; push @setters, parse_setter("type=$opt:$val"); } # We want all scripts to be runnable from a single config file, so each # take all of the arguments of the others. Naturally, these arguments are # ignored if they are irrelevent. If you want to use identical command # lines for all three scripts, be sure to use the --start, --db, and --xml # options. my $junk; GetOptions("prepos=s" => \&process_type, "live=s" => \&process_type, "set=s" => sub {my $val = $_[1]; push @setters, parse_setter($val)}, "recursive!" => \$recursive, "playservertable=s" => $playservertable, "xml=s" => \$xml, "map=s" => \@map, "db=s" => \@db, "<>" => sub { push @db, $_[0]; }, # Arguments that all scripts take "file=s" => sub {my ($opt, $val) = @_; unshift @ARGV, lines($val)}, "debug!" => \$debug, # Arguments that are really only for 'spider' or 'rewrite' "limit=n" => \$junk, "depth=n" => \$junk, "prefix=s" => \$junk, "accept=s" => \$junk, "reject=s" => \$junk, "hd|rd=s" => \$junk, "start=s" => \$junk, "file-map=s" => \$junk, "index=s" => \$junk, "od|origin=s" => \$junk, "always-rewrite=s" => \$junk, ) or die "Bad argument syntax\n"; sub parse_setter ( $ ) { my ($setter) = @_; my ($settings, $sub) = split ':', $setter, 2; my @settings = split ' ', $settings; my %settings = (); for my $setting (@settings) { my ($key, $val) = split '=', $setting; $settings{$key} = $val; } return [parse_sub($sub), \%settings]; } sub parse_sub ( $ ) { local ($_) = @_; warn "Changing \Q$_'\n" if $debug; # Allow comparisons to size, including shortcuts like 10k s/\bsize\b/\$H{'Content-Length'}/gio; s/\b(\d+)GB?\b/($1*1024M)/gio; s/\b(\d+)MB?\b/($1*1024K)/gio; s/\b(\d+)KB?\b/($1*1024)/gio; # Allow matching on URL s+\bmatch\(([^\)]*)\)+"m\@".glob2regex($1)."\@i"+gioe; # Allow matching on type s+\btype\(([^\)]*)\)+"(exists \$H{'Content-Type'} && \$H{'Content-Type'} =~ m@".glob2regex($1)."@)"+gioe; warn " into \Q$_'\n" if $debug; my $sub = eval "sub { my (\$i) = \@_; $_ }"; $sub or die "Unable to understand \Q$_'\n"; } my %map; for my $map (@map) { my ($origin, $cdn) = split('=', $map); $map{$origin} = $cdn; } # Convert one URI to another according to map. Always return a new URI, # even if contents are unchanged. sub translate ( $$ ) { my ($uri, $map) = @_; # Try each prefix, longer ones first for my $prefix (sort { length($b) <=> length($a) } keys %$map) { if (index($uri, $prefix) == 0) { my $t = "$uri"; substr($t, 0, length($prefix)) = $map->{$prefix}; return URI->new($t) } } return $uri->clone; } my @backups = grep { /=/ } @ARGV; # args with = in them are backup specifiers @ARGV = grep { ! /=/ } @ARGV; use URI; my %servers = (); my %items = (); # Catalog of all URLs my @items = (); # Same as %items, but sorted. my $depth = 0; my %default = (); # Current attributes in effect because of group my @chain = (); # Undo chain as groups are closed sub push_params ( % ) { my %hash = @_; my $str = params(%hash); my $changes = {}; while (my ($key,$val) = each %hash) { $changes->{$key} = $default{$key}; $default{$key} = $val; } push @chain, $changes; return $str; } sub pop_params () { my $changes = pop @chain; while (my ($key,$val) = each %{$changes}) { $default{$key} = $val; } } my %xml_ent = ('&' => 'amp', '<' => 'lt', '>' => 'gt', '"' => 'quot'); sub xml_attr ( $ ) { my ($val) = @_; $val =~ s/([\&\<\>\"])/&$xml_ent{$1};/; return "\"$val\""; } sub params ( % ) { my %hash = @_; my $str = ""; while (my ($key,$val) = each %hash) { $str .= " $key=" . xml_attr($val) unless defined $default{$key} && $default{$key} eq $val; } return $str; } sub open_server ( @ ) { return (" " x $depth++) . "<server".params(@_).">\n"; } sub close_server ( ) { die "More close_server()s than open_server()s!" unless $depth; return (" " x --$depth) . "</server>\n"; } sub host ( % ) { return (" " x $depth) . "<host".params(@_)."/>"; } sub open_group ( @ ) { return (" " x $depth++) . "<item-group".push_params(@_).">\n"; } sub close_group ( ) { die "More close_group()s than open_group()s!" unless $depth; pop_params(); return (" " x --$depth) . "</item-group>\n"; } sub item ( $ ) { my ($item) = @_; my $str = (" " x $depth) . "<item".params(%{$item->{attrs}}); if (! @{$item->{contains}}) { return "$str/>"; } $str .= ">\n"; $depth++; for my $contained (@{$item->{contains}}) { if ($contained->{type} eq 'prepos') { my $path = translate($contained->{uri}, \%map)->path; $str .= (" " x $depth) . "<contains".params('cdn-url'=>$path)."/>\n"; } } $depth--; $str .= (" " x $depth) . "</item>"; return $str; } sub header () { return <<HEADER <?xml version="1.0" standalone="no"?> <!DOCTYPE CdnManifest SYSTEM "CdnManifest.dtd"> <CdnManifest> HEADER } sub footer () { return "</CdnManifest>\n"; } sub set ( $$ ) { my ($setter, $i) = @_; my ($pred, $attrs) = @$setter; use vars "%H"; local ($_, %H) = ($i->{uri}, %{$i->{hdrs}}); if (&$pred($i)) { for my $key (keys %$attrs) { $i->{attrs}->{$key} = $attrs->{$key}; } return 1; } return 0; } sub print_items { my $server = ""; for my $i (@_) { my $uri = $i->{uri}; if ($uri->host ne $server) { print close_group() if $server; print open_group(server=>$uri->host); $server = $uri->host; } $i->{attrs}->{src} = $uri->path; my $t = translate($uri, \%map); $i->{attrs}->{'cdn-url'} = $t if $t ne $uri; # Only include if needed print item($i) . "\n"; } print close_group() if $server; } sub preposition_contents { my ($item) = @_; for my $contained (@{$item->{contains}}) { next if $contained->{type}; # Already set, or live warn "Forcing prepos of ".$contained->{uri}."\n" if $debug; $contained->{type} = $contained->{attrs}->{type} = 'prepos'; preposition_contents($contained); } } # Input loop. Read in all the URLs from the databases @ARGV = @db; while (<>) { my $item = { uri => undef, # Original url hdrs => {}, # Headers from database attrs => {}, contains => [], # items that this one contains type => "" # Convenience. It's just attrs->{type} }; { do { # do {} while IS NOT A LOOP, the extra braces allow "last" to work chomp; last unless $_; my ($header, $val) = split(": ", $_, 2); $item->{hdrs}->{$header} = $val; } while (<>); } die "Headers without a URL!\n" unless exists $item->{hdrs}->{URL}; my $uri = $item->{uri} = URI->new($item->{hdrs}->{URL}); push @items, $items{$uri} = $item; $servers{$uri->host} = 1; } # Figure out what contains what my %missing = (); # Tracks URI that have been reported missing for my $item (@items) { if (exists $item->{hdrs}->{CONTAINS}) { my @contains = (); my @missing = (); for my $c (split ' ', $item->{hdrs}->{CONTAINS}) { my $contained = $items{$c}; if ($contained) { push @contains, $contained; } else { # Only consider $c missing if it has not yet been reported push @missing, $c unless $missing{$c}++; } } warn $item->{uri}." contains missing urls:\n ".join("\n ", @missing)."\n" if @missing; $item->{contains} = \@contains; } } # Report URLs that were missing multiple times my $intro = 0; while (my ($uri, $times) = each(%missing)) { if ($times > 1) { warn "Some URLs are missing multiple times:\n" unless $intro++; warn sprintf " %3d %s\n", $times, $uri; } } warn "\n" if $intro; # Run all the command line setters for my $item (@items) { for my $setter (@setters) { set($setter, $item); } $item->{type} = $item->{attrs}->{type} if exists $item->{attrs}->{type}; } # Recursively preposition anything that is contained in a prepositioned item if ($recursive) { for my $item (@items) { next unless $item->{type} eq 'prepos'; preposition_contents($item); } } # Spit out the manifest file if ($xml) { open XML, "> $xml" or die "$xml: $!\n"; select XML; } print header(); for my $s (keys %servers) { print open_server(name=>$s); print host(name=>$s, proto=>'http')."\n"; print close_server(); } print "\n<!-- Prepositioned Items -->\n"; print open_group(type=>"prepos"); print_items(grep {$_->{type} eq 'prepos'} @items); print close_group(); print "\n<!-- Live Items -->\n"; print open_group(type=>"live"); print_items(grep {$_->{type} eq 'live'} @items); print close_group(); if ($playservertable) { open TABLE, "< $playservertable" or die "$playservertable: $!\n"; print <TABLE>; close TABLE or die $!; } print footer(); close XML or die $! if $xml; select STDOUT; # Collect and print some simple statistics my %space = (); # Amount of space used by various types my $space; my %num = (); # Number of pieces of each type pf content my $num = 0; for my $item (grep {$_->{type} eq 'prepos'} @items) { if (! exists $item->{hdrs}->{'Content-Type'}) { warn $item->{uri} . " has no content type.\n"; next; } my $type = $item->{hdrs}->{'Content-Type'}; $type =~ s/[\s,;].*$//; $num{$type} ||= 0; $num{$type}++; $num++; $space{$type} ||= 0; if (exists $item->{hdrs}->{'Content-Length'}) { $space{$type} += $item->{hdrs}->{'Content-Length'}; $space += $item->{hdrs}->{'Content-Length'}; } } my $k = 1024; my $m = 1024*$k; my $g = 1024*$m; sub abbrev ( $ ) { my ($num) = @_; return 0 unless defined $num; return ($num/$g,"G") if $num > $g; return ($num/$m,"M") if $num > $m; return ($num/$k,"K") if $num > $k; return ($num,"b"); } for my $type (sort { $space{$a} <=> $space{$b} } keys %num) { warn sprintf "%22s %5d %4d%s\n", $type, $num{$type}, abbrev($space{$type}); } warn sprintf "%22s %5d %4d%s\n", "Total", $num, abbrev($space);

hometocprevnextglossaryfeedbacksearchhelp
Posted: Tue Oct 1 04:12:40 PDT 2002
All contents are Copyright © 1992--2002 Cisco Systems, Inc. All rights reserved.
Important Notices and Privacy Statement.