43.21 Preprocessing troff Input with sed

On a typewriter-like device (including a CRT), an em-dash is typed as a pair of hyphens ( -- ). [2] In typesetting, it is printed as a single, long dash ( - ). troff provides a special character name for the em-dash, but it is inconvenient to type \ - , and the escape sequence is also inappropriate for use with nroff .

[2] Typists often use three hyphens ( --- ) for an em-dash, and two ( -- ) for the shorter en-dash.

Similarly, a typesetter provides "curly" quotation marks (" and ") as opposed to a typewriter's straight quotes ( <"> ). In standard troff , you can substitute two backquote characters ( " ) for open quote and two frontquote characters ( " ) for closed quote; these characters would appear as " and ". But it would be much better if we could just continue to type in <"> and have the computer do the dirty work.

A peculiarity of troff is that it generates the space before each word in the font used at the beginning of that word. This means that when we mix a constant-width font such as Courier within text, we get a noticeably large space before each word, which can be distracting for readers - for example: The following text is in Courier ; note the spaces . The fix for this is to force troff to generate the space in the previous font by inserting a no-space character ( \& ) before each constant-width font change. As you can imagine, this can turn into a large undertaking.

The solution for each of these problems is to preprocess troff input with sed ( 34.24 ) . This is an application that shows sed in its role as a true stream editor, making edits in a pipeline - edits that are never written back into a file.

We almost never invoke troff directly. Instead, we invoke it with a script that strings together a pipeline including the standard preprocessors (when appropriate) as well as doing this special preprocessing with sed .

The sed commands themselves are fairly simple.

The following command changes two consecutive dashes into an em-dash:


We double the backslashes in the replacement string for \ - , since the backslash has a special meaning to sed .

However, there may be cases in which we don't want this substitution command to be applied. What if someone is using hyphens to draw a horizontal line? We can refine the script to exclude lines containing three or more consecutive hyphens. To do this, we use the ! address modifier ( 34.19 ) :


It may take a moment to penetrate this syntax. What's different is that we use a pattern address to restrict the lines that are affected by the substitute command, and we use ! to reverse the sense of the pattern match. It says, simply, "If you find a line containing three consecutive hyphens, don't apply the edit." On all other lines, the substitute command will be applied.

Similarly, to deal with the font change problem, we can use sed to search for all strings matching \f(CW , \f(CI , and \f(CB , and insert \& before them. This can be written as follows:


To deal with the open and closed quote problem, the script needs to be more involved because there are many separate cases that must be accounted for. You need to make sed smart enough to change double quotes to open quotes only at the beginning of words and to change them to closed quotes only at the end of words. Such a script might look like the one below, which obviously could be shortened by judicious application of \([...]\) ( 34.10 ) regular expression syntax, but it is shown in its long form for effect.

s/"? /''? /g
s/ "/ ``/g
s/" /'' /g

The preceding code shows the kind of contortions you need to go through to capture all the possible situations in which quotation marks appear. The solution to the other problems mentioned earlier in the article is left for your imagination. If you prefer, a more complete "typesetting preprocessor" script written in sed , and suitable for integration into a troff environment (perhaps with a bit of tweaking), can be found on the disc.

In addition to the changes described above, it tightens up the spacing of ellipses (...), and doesn't do anything between certain pairs of troff macros ( 34.19 ) .

- TOR,

