Splitting Files by Context: csplit (Unix Power Tools, 3rd Edition)

21.10. Splitting Files by Context: csplit

Go to http://examples.oreilly.com/upt3 for more information on: csplit

Like split (Section 21.9), csplit lets you break a file into smaller pieces, but csplit (context split) also allows the file to be broken into different-sized pieces, according to context. With csplit, you give the locations (line numbers or search patterns) at which to break each section. csplit comes with System V, but there are also free versions available.

Let's look at search patterns first. Suppose you have an outline consisting of three main sections that start on lines with the Roman numerals I., II., and III.. You could create a separate file for each section by typing:

% csplit outline /I./ /II./ /III./
28      number of characters in each file
415                   .
372                   .
554                   .
% ls
outline
xx00     outline title, etc.
xx01     Section I
xx02     Section II
xx03     Section III

This command creates four new files (outline remains intact). csplit displays the character counts for each file. Note that the first file (xx00) contains any text up to but not including the first pattern, and xx01 contains the first section, as you'd expect. This is why the naming scheme begins with 00. (If outline had begun immediately with a I., xx01 would still contain Section I, but in this case xx00 would be empty.)

If you don't want to save the text that occurs before a specified pattern, use a percent sign as the pattern delimiter:

% csplit outline %I.% /II./ /III./
415
372
554
% ls
outline
xx00         Section I
xx01         Section II
xx02         Section III

The preliminary text file has been suppressed, and the created files now begin where the actual outline starts (the file numbering is off, however).

Let's make some further refinements. We'll use the -s option to suppress the display of the character counts, and we'll use the -f option to specify a file prefix other than the conventional xx:

% csplit -s -f part. outline /I./ /II./ /III./
% ls
outline
part.00
part.01
part.02
part.03

There's still a slight problem, though. In search patterns, a period is a metacharacter (Section 32.21) that matches any single character, so the pattern /I./ may inadvertently match words like Introduction. We need to escape the period with a backslash; however, the backslash has meaning both to the pattern and to the shell, so in fact, we need either to use a double backslash or to surround the pattern in quotes (Section 27.12). A subtlety, yes, but one that can drive you crazy if you don't remember it. Our command line becomes:

% csplit -s -f part. outline "/I\./" /II./ /III./

You can also break a file at repeated occurrences of the same pattern. Let's say you have a file that describes 50 ways to cook a chicken, and you want each method stored in a separate file. The sections begin with headings WAY #1, WAY #2, and so on. To divide the file, use csplit's repeat argument:

% csplit -s -f cook. fifty_ways /^WAY/ "{49}"

This command splits the file at the first occurrence of WAY, and the number in braces tells csplit to repeat the split 49 more times. Note that a caret (^) (Section 32.5) is used to match the beginning of the line and the C shell requires quotes around the braces (Section 28.4). The command has created 50 files:

% ls cook.*
cook.00
cook.01
  ...
cook.48
cook.49

Quite often, when you want to split a file repeatedly, you don't know or don't care how many files will be created; you just want to make sure that the necessary number of splits takes place. In this case, it makes sense to specify a repeat count that is slightly higher than what you need (the maximum is 99). Unfortunately, if you tell csplit to create more files than it's able to, this produces an "out of range" error. Furthermore, when csplit encounters an error, it exits by removing any files it created along the way. (A bug, if you ask me.) This is where the -k option comes in. Specify -k to keep the files around, even when the "out of range" message occurs.

csplit allows you to break a file at some number of lines above or below a given search pattern. For example, to break a file at the line that is five lines below the one containing Sincerely, you could type:

% csplit -s -f letter. all_letters /Sincerely/+5

This situation might arise if you have a series of business letters strung together in one file. Each letter begins differently, but each one begins five lines after the previous letter's Sincerely line. Here's another example, adapted from AT&T's Unix User's Reference Manual:

% csplit -s -k -f routine. prog.c '%main(%' '/^}/+1' '{99}'

The idea is that the file prog.c contains a group of C routines, and we want to place each one in a separate file (routine.00, routine.01, etc.). The first pattern uses % because we want to discard anything before main. The next argument says, "Look for a closing brace at the beginning of a line (the conventional end of a routine) and split on the following line (the assumed beginning of the next routine)." Repeat this split up to 99 times, using -k to preserve the created files.[62]

[62]In this case, the repeat can actually occur only 98 times, since we've already specified two arguments and the maximum number is 100.

The csplit command takes line-number arguments in addition to patterns. You can say:

% csplit stuff 50 373 955

to create files split at some arbitrary line numbers. In that example, the new file xx00 will have lines 1-49 (49 lines total), xx01 will have lines 50-372 (323 lines total), xx02 will have lines 373-954 (582 lines total), and xx03 will hold the rest of stuff.

csplit works like split if you repeat the argument. The command:

% csplit top_ten_list 10 "{18}"

breaks the list into 19 segments of 10 lines each.[63]

[63]Not really. The first file contains only nine lines (1-9); the rest contain 10. In this case, you're better off saying split -10 top_ten_list.

-- DG