Processing Files Larger Than Available Memory (Perl Cookbook, 2nd Edition)

22.8. Processing Files Larger Than Available Memory

22.8.1. Problem

You want to work with a large XML file, but you can't read it into memory to form a DOM or other kind of tree because it's too big.

22.8.2. Solution

Use SAX (as described in Recipe 22.3) to process events instead of building a tree.

Alternatively, use XML::Twig to build trees only for the parts of the document you want to work with (as specified by XPath expressions):

use XML::Twig;

my $twig = XML::Twig->new( twig_handlers => {
                               $XPATH_EXPRESSION => \&HANDLER,
                               # ...
                            });
$twig->parsefile($FILENAME);
$twig->flush( );

You can call a lot of DOM-like functions from within a handler, but only the elements identified by the XPath expression (and whatever those elements enclose) go into a tree.

22.8.3. Discussion

DOM modules turn the entire document into a tree, regardless of whether you use all of it. With SAX modules, there are no trees built—if your task depends on document structure, you must keep track of that structure yourself. A happy middle ground is XML::Twig, which creates DOM trees only for the bits of the file that you're interested in. Because you work with files a piece at a time, you can cope with very large files by processing pieces that fit in memory.

For example, to print the titles of books in books.xml (Example 22-1), you could write:

use XML::Twig;

my $twig = XML::Twig->new( twig_roots => { '/books/book' => \&do_book });
$twig->parsefile("books.xml");
$twig->purge( );

sub do_book {
  my($title) = $_->find_nodes("title");
  print $title->text, "\n";
}

For each book element, XML::Twig calls do_book on its contents. That subroutine finds the title node and prints its text. Rather than having the entire file parsed into a DOM structure, we keep only one book element at a time.

Consult the XML::Twig manpages for details on how much DOM and XPath the module supports—it's not complete, but it's growing all the time. XML::Twig uses XML::Parser for its XML parsing, and as a result the functions available on nodes are slightly different from those provided by XML::LibXSLT's DOM parsing.

22.8.4. See Also

Recipe 22.6; the documentation for the module XML::Twig