21.9. Splitting Files at Fixed Points: splitMost versions of Unix come with a program called split whose purpose is to split large files into smaller files for tasks such as editing them in an editor that cannot handle large files, or mailing them if they are so big that some mailers will refuse to deal with them. For example, let's say you have a really big text file that you want to mail to someone: % ls -l bigfile -r--r--r-- 1 jik 139070 Oct 15 21:02 bigfile Running split on that file will (by default, with most versions of split) break it up into pieces that are each no more than 1000 lines long: wc Section 16.6 % split bigfile % ls -l total 283 -r--r--r-- 1 jik 139070 Oct 15 21:02 bigfile -rw-rw-r-- 1 jik 46444 Oct 15 21:04 xaa -rw-rw-r-- 1 jik 51619 Oct 15 21:04 xab -rw-rw-r-- 1 jik 41007 Oct 15 21:04 xac % wc -l x* 1000 xaa 1000 xab 932 xac 2932 total Note the default naming scheme, which is to append "aa", "ab", "ac", etc., to the letter "x" for each subsequent filename. It is possible to modify the default behavior. For example, you can make split create files that are 1500 lines long instead of 1000: % rm x?? % split -1500 bigfile % ls -l total 288 -r--r--r-- 1 jik 139070 Oct 15 21:02 bigfile -rw-rw-r-- 1 jik 74016 Oct 15 21:06 xaa -rw-rw-r-- 1 jik 65054 Oct 15 21:06 xab You can also get it to use a name prefix other than "x": % rm x?? % split -1500 bigfile bigfile.split. % ls -l total 288 -r--r--r-- 1 jik 139070 Oct 15 21:02 bigfile -rw-rw-r-- 1 jik 74016 Oct 15 21:07 bigfile.split.aa -rw-rw-r-- 1 jik 65054 Oct 15 21:07 bigfile.split.ab Although the simple behavior described above tends to be relatively universal, there are differences in the functionality of split on different Unix systems. There are four basic variants of split as shipped with various implementations of Unix:
The only way to tell which version you've got is to read the manual page for it on your system, which will also tell you the exact syntax for using it. The problem with the third variant is that although it tries to be smart and automatically do the right thing with both text and nontext files, it sometimes guesses wrong and splits a text file as a nontext file or vice versa, with completely unsatisfactory results. Therefore, if the variant on your system is (3), you probably want to get your hands on one of the many split clones out there that is closer to one of the other variants (see below). Variants (1) and (2) listed above are OK as far as they go, but they aren't adequate if your environment provides only one of them rather than both. If you find yourself needing to split a nontext file when you have only a text split, or needing to split a text file when you have only bsplit, you need to get one of the clones that will perform the function you need. Go to http://examples.oreilly.com/upt3 for more information on: split Variant (4) is the most reliable and versatile of the four listed, and it is therefore what you should go with if you find it necessary to get a clone and install it on your system. There are several such clones in the various source archives, including the free BSD Unix version. GNU split is on the CD-ROM [see http://examples.oreilly.com/upt3]. Alternatively, if you have installed perl (Section 41.1), it is quite easy to write a simple split clone in perl, and you don't have to worry about compiling a C program to do it; this is an especially significant advantage if you need to run your split on multiple architectures that would need separate binaries. The Perl code for a binary split program follows: #!/usr/bin/perl -w -- # Split text or binary files; jjohn 2/2002 use strict; use Getopt::Std; my %opts; getopts("?b:f:hp:ts:", \%opts); if( $opts{'?'} || $opts{'h'} || !-e $opts{'f'}){ print <<USAGE; $0 - split files in smaller ones USAGE: $0 -b 1500 -f big_file -p part. OPTIONS: -? print this screen -h print this screen -b <INT> split file into given byte size parts -f <TXT> the file to be split -p <TXT> each new file to begin with given text -s <INT> split file into given number of parts USAGE exit; } my $infile; open($infile, $opts{'f'}) or die "No file given to split\n"; binmode($infile); my $infile_size = (stat $opts{'f'})[7]; my $block_size = 1; if( $block_size = $opts{'b'} ){ # chunk file into blocks }elsif( my $total_parts = $opts{'s'} ){ # chunk file into N parts $block_size = int ( $infile_size / $total_parts) + 1; }else{ die "Please indicate how to split file with -b or -s\n"; } my $outfile_base = $opts{'p'} || 'part.'; my $outfile_ext = "aa"; my $offset = 0; while( $offset < $infile_size ){ my $buf; $offset += read $infile, $buf, $block_size; write_file($outfile_base, $outfile_ext++, \$buf); } #--- subs ---# sub write_file { my($fname, $ext, $buf) = @_; my $outfile; open($outfile, ">$fname$ext") or die "can't open $fname$ext\n"; binmode($outfile); my $wrote = syswrite $outfile, $$buf; my $size = length($$buf); warn "WARN: wrote $wrote bytes instead of $size to $fname$ext\n" unless $wrote == $size; } Although it may seem somewhat complex at first glance, this small Perl script is cross-platform and has its own small help screen to describe its options. Briefly, it can split files into N-sized blocks (given the -b option) or, with -s, create N new segments of the original file. For a better introduction to Perl, see Chapter 42. If you need to split a nontext file and don't feel like going to all of the trouble of finding a split clone to handle it, one standard Unix tool you can use to do the splitting is dd (Section 21.6). For example, if bigfile above were a nontext file and you wanted to split it into 20,000-byte pieces, you could do something like this: for Section 35.21, > Section 28.12 $ ls -l bigfile -r--r--r-- 1 jik 139070 Oct 23 08:58 bigfile $ for i in 1 2 3 4 5 6 7 #[60] > do > dd of=x$i bs=20000 count=1 2>/dev/null #[61] > done < bigfile $ ls -l total 279 -r--r--r-- 1 jik 139070 Oct 23 08:58 bigfile -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x1 -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x2 -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x3 -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x4 -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x5 -rw-rw-r-- 1 jik 20000 Oct 23 09:00 x6 -rw-rw-r-- 1 jik 19070 Oct 23 09:00 x7
[60] To figure out how many numnbers to count up to, divide the total size of the file by the block size you want and add one of there's a remainder. The jot program can help here. --JIK and JJ Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|