[Chapter 35] 35.10 Splitting Files by Context: csplit

35.10 Splitting Files by Context: csplit

csplit	Like split ( 35.9 ) , csplit lets you break a file into smaller pieces, but csplit (context split) also allows the file to be broken into different-sized pieces, according to context. With csplit , you give the locations (line numbers or search patterns) at which to break each section. csplit comes with System V, but there are also freely available versions.

Let's look at search patterns first. Suppose you have an outline consisting of three main sections. You could create a separate file for each section by typing:

% 

csplit outline /I./ /II./ /III./


28 
number of characters in each file

415                   .
372                   .
554                   .
% 

ls


outline
xx00   
 outline title, etc.

xx01   
 Section I

xx02   
 Section II

xx03   
 Section III

This command creates four new files ( outline remains intact). csplit displays the character counts for each file. Note that the first file ( xx00 ) contains any text up to but not including the first pattern, and that xx01 contains the first section, as you'd expect. This is why the naming scheme begins with 00 . (Even if outline had begun immediately with a I. , xx01 would still contain Section I, but xx00 would be empty in this case.)

If you don't want to save the text that occurs before a specified pattern, use a percent sign as the pattern delimiter:

% 

csplit outline %I.% /II./ /III./


415
372
554
% 

ls


outline
xx00 
 Section I

xx01 
 Section II

xx02 
 Section III

The preliminary text file has been suppressed, and the created files now begin where the actual outline starts (the file numbering is off, however).

Let's make some further refinements. We'll use the -s option to suppress the display of the character counts, and we'll use the -f option to specify a file prefix other than the conventional xx :

% 

csplit -s -f part. outline /I./ /II./ /III./


% 

ls


outline
part.00
part.01
part.02
part.03

There's still a slight problem though. In search patterns, a period is a metacharacter ( 26.10 ) that matches any single character, so the pattern /I./ may inadvertently match words like Introduction . We need to escape the period with a backslash; however, the backslash has meaning both to the pattern and to the shell, so in fact, we need either to use a double backslash or to surround the pattern in quotes ( 8.14 ) . A subtlety, yes, but one that can drive you crazy if you don't remember it. Our command line becomes:

% 

csplit -s -f part. outline "/I\./" /II./ /III./

You can also break a file at repeated occurrences of the same pattern. Let's say you have a file that describes 50 ways to cook a chicken, and you want each method stored in a separate file. Each section begins with headings WAY #1 , WAY #2 , and so on. To divide the file, use csplit 's repeat argument:

% 

csplit -s -f cook. fifty_ways /^WAY/ "{49}"

This command splits the file at the first occurrence of WAY , and the number in braces tells csplit to repeat the split 49 more times. Note that a caret is used to match the beginning of the line and that the C shell requires quotes around the braces ( 9.5 ) . The command has created 50 files:

% 

ls cook.*


cook.00
cook.01
  ...
cook.48
cook.49

Quite often, when you want to split a file repeatedly, you don't know or don't care how many files will be created; you just want to make sure that the necessary number of splits takes place. In this case, it makes sense to specify a repeat count that is slightly higher than what you need (maximum is 99). Unfortunately, if you tell csplit to create more files than it's able to, this produces an "out of range" error. Furthermore, when csplit encounters an error, it exits by removing any files it created along the way. (A bug, if you ask me.) This is where the -k option comes in. Specify -k to k eep the files around, even when the "out of range" message occurs.

csplit allows you to break a file at some number of lines above or below a given search pattern. For example, to break a file at the line that is five lines below the one containing Sincerely, you could type:

% 

csplit -s -f letter. all_letters /Sincerely/+5

This situation might arise if you have a series of business letters strung together in one file. Each letter begins differently, but each one begins five lines after the previous letter's Sincerely line. Here's another example, adapted from AT&T's UNIX User's Reference Manual :

% 

csplit -s -k -f routine. prog.c '%main(%' '/^}/+1' '{99}'

The idea is that the file prog.c contains a group of C routines, and we want to place each one in a separate file ( routine.00 , routine.01 , etc.). The first pattern uses % because we want to discard anything before main . The next argument says, "Look for a closing brace at the beginning of a line (the conventional end of a routine) and split on the following line (the assumed beginning of the next routine)." Repeat this split up to 99 times, using -k to preserve the created files. [4]

[4] In this case, the repeat can actually occur only 98 times, since we've already specified two arguments and the maximum number is 100.

The csplit command takes line-number arguments in addition to patterns. You can say:

% 

csplit stuff 50 373 955

to create files split at some arbitrary line numbers. In that example, the new file xx00 will have lines 1-49 (49 lines total), xx01 will have lines 50-372 (323 lines total), xx02 will have lines 373-954 (582 lines total), and xx03 will hold the rest of stuff .

csplit works like split if you repeat the argument. The command:

% 

csplit top_ten_list 10 "{18}"

breaks the list into 19 segments of 10 lines each. [5]

[5] Not really. The first file contains only nine lines (1-9); the rest contain 10. In this case, you're better off saying split -10 top_ten_list .

- DG