Splitting Files at Fixed Points: split (Unix Power Tools, 3rd Edition)

21.9. Splitting Files at Fixed Points: split

Most versions of Unix come with a program called split whose purpose is to split large files into smaller files for tasks such as editing them in an editor that cannot handle large files, or mailing them if they are so big that some mailers will refuse to deal with them. For example, let's say you have a really big text file that you want to mail to someone:

% ls -l bigfile
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile

Running split on that file will (by default, with most versions of split) break it up into pieces that are each no more than 1000 lines long:

wc Section 16.6

% split bigfile
% ls -l
total 283
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         46444 Oct 15 21:04 xaa
-rw-rw-r--  1 jik         51619 Oct 15 21:04 xab
-rw-rw-r--  1 jik         41007 Oct 15 21:04 xac
% wc -l x*
    1000 xaa
    1000 xab
     932 xac
    2932 total

Note the default naming scheme, which is to append "aa", "ab", "ac", etc., to the letter "x" for each subsequent filename. It is possible to modify the default behavior. For example, you can make split create files that are 1500 lines long instead of 1000:

% rm x??
% split -1500 bigfile
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:06 xaa
-rw-rw-r--  1 jik         65054 Oct 15 21:06 xab

You can also get it to use a name prefix other than "x":

% rm x??
% split -1500 bigfile bigfile.split.
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:07 bigfile.split.aa
-rw-rw-r--  1 jik         65054 Oct 15 21:07 bigfile.split.ab

Although the simple behavior described above tends to be relatively universal, there are differences in the functionality of split on different Unix systems. There are four basic variants of split as shipped with various implementations of Unix:

A split that understands only how to deal with splitting text files into chunks of n lines or less each.

A split, usually called bsplit, that understands only how to deal with splitting nontext files into n-character chunks.
A split that splits text files into n-line chunks, or nontext files into n-character chunks, and tries to figure out automatically whether it's working on a text file or a nontext file.
A split that does either text files or nontext files but needs to be told explicitly when it is working on a nontext file.

The only way to tell which version you've got is to read the manual page for it on your system, which will also tell you the exact syntax for using it.

The problem with the third variant is that although it tries to be smart and automatically do the right thing with both text and nontext files, it sometimes guesses wrong and splits a text file as a nontext file or vice versa, with completely unsatisfactory results. Therefore, if the variant on your system is (3), you probably want to get your hands on one of the many split clones out there that is closer to one of the other variants (see below).

Variants (1) and (2) listed above are OK as far as they go, but they aren't adequate if your environment provides only one of them rather than both. If you find yourself needing to split a nontext file when you have only a text split, or needing to split a text file when you have only bsplit, you need to get one of the clones that will perform the function you need.

Go to http://examples.oreilly.com/upt3 for more information on: split

Variant (4) is the most reliable and versatile of the four listed, and it is therefore what you should go with if you find it necessary to get a clone and install it on your system. There are several such clones in the various source archives, including the free BSD Unix version. GNU split is on the CD-ROM [see http://examples.oreilly.com/upt3]. Alternatively, if you have installed perl (Section 41.1), it is quite easy to write a simple split clone in perl, and you don't have to worry about compiling a C program to do it; this is an especially significant advantage if you need to run your split on multiple architectures that would need separate binaries. The Perl code for a binary split program follows:

#!/usr/bin/perl -w --
# Split text or binary files; jjohn 2/2002
use strict;
use Getopt::Std;

my %opts;
getopts("?b:f:hp:ts:", \%opts);

if( $opts{'?'} || $opts{'h'} || !-e $opts{'f'}){
  print <<USAGE;
$0 - split files in smaller ones

USAGE:
    $0 -b 1500 -f big_file -p part.

OPTIONS:

    -?       print this screen
    -h       print this screen
    -b <INT> split file into given byte size parts
    -f <TXT> the file to be split
    -p <TXT> each new file to begin with given text
    -s <INT> split file into given number of parts
USAGE
   exit;
}

my $infile;
open($infile, $opts{'f'}) or die "No file given to split\n";
binmode($infile);
my $infile_size = (stat $opts{'f'})[7];

my $block_size = 1;
if( $block_size = $opts{'b'} ){
  # chunk file into blocks

}elsif( my $total_parts = $opts{'s'} ){
  # chunk file into N parts
  $block_size = int ( $infile_size / $total_parts) + 1;

}else{
  die "Please indicate how to split file with -b or -s\n";
}

my $outfile_base = $opts{'p'} || 'part.';
my $outfile_ext = "aa";

my $offset = 0;
while( $offset < $infile_size ){
  my $buf;
  $offset += read $infile, $buf, $block_size;
  write_file($outfile_base, $outfile_ext++, \$buf);
}

#--- subs ---#
sub write_file {
  my($fname, $ext, $buf) = @_;

  my $outfile;
  open($outfile, ">$fname$ext") or die "can't open $fname$ext\n";
  binmode($outfile);
  my $wrote = syswrite $outfile, $$buf;
  my $size  = length($$buf);
  warn "WARN: wrote $wrote bytes instead of $size to $fname$ext\n"
    unless $wrote == $size;
}

Although it may seem somewhat complex at first glance, this small Perl script is cross-platform and has its own small help screen to describe its options. Briefly, it can split files into N-sized blocks (given the -b option) or, with -s, create N new segments of the original file. For a better introduction to Perl, see Chapter 42.

If you need to split a nontext file and don't feel like going to all of the trouble of finding a split clone to handle it, one standard Unix tool you can use to do the splitting is dd (Section 21.6). For example, if bigfile above were a nontext file and you wanted to split it into 20,000-byte pieces, you could do something like this:

for Section 35.21, > Section 28.12

$ ls -l bigfile
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
$ for i in 1 2 3 4 5 6 7   #[60]
> do
>       dd of=x$i bs=20000 count=1 2>/dev/null  #[61]
> done < bigfile
$ ls -l
total 279
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x1
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x2
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x3
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x4
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x5
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x6
-rw-rw-r--  1 jik         19070 Oct 23 09:00 x7

[60] To figure out how many numnbers to count up to, divide the total size of the file by the block size you want and add one of there's a remainder. The jot program can help here.

[61] The output file size I want is denoted by the bs or "block size" parameter to dd. The 2>/dev/null gets rid of dd's diagnostic output, which isn't useful here and takes up space.

--JIK and JJ