String Functions (sed & awk, Second Edition)

A loop supplies numbers from 97 to 122, which produce ASCII characters from a to z.

That leaves us with three basic built-in string functions to discuss: index(), substr(), and length().

9.2.1. Substrings

The index() and substr() functions both deal with substrings. Given a string s, index(s,t) returns the leftmost position where string t is found in s. The beginning of the string is position 1 (which is different from the C language, where the first character in a string is at position 0). Look at the following example:

pos = index("Mississippi", "is")

The value of pos is 2. If the substring is not found, the index() function returns 0.

Given a string s, substr(s,p) returns the characters beginning at position p. The following example creates a phone number without an area code.

phone = substr("707-555-1111", 5)

You can also supply a third argument which is the number of characters to return. The next example returns just the area code:

area_code = substr("707-555-1111", 1, 3)

The two functions can be and often are used together, as in the next example. This example capitalizes the first letter of the first word for each input record.

awk '# caps - capitalize 1st letter of 1st word
# initialize strings
BEGIN { upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        lower = "abcdefghijklmnopqrstuvwxyz" 
}

# for each input line
{
# get first character of first word
	FIRSTCHAR = substr($1, 1, 1)
# get position of FIRSTCHAR in lowercase array; if 0, ignore
	if (CHAR = index(lower, FIRSTCHAR)) 
		# change $1, using position to retrieve
		# uppercase character 
		$1 = substr(upper, CHAR, 1) substr($1, 2)
# print record
	print $0
}'

This script creates two variables, upper and lower, consisting of uppercase and lowercase letters. Any character that we find in lower can be found at the same position in upper. The first statement of the main procedure extracts a single character, the first one, from the first field. The conditional statement tests to see if that character can be found in lower using the index() function. If CHAR is not 0, then CHAR can be used to extract the uppercase character from upper. There are two substr() function calls: the first one retrieves the capitalized letter and the second call gets the rest of the first field, extracting all characters, beginning with the second character. The values returned by both substr() functions are concatenated and assigned to $1. Making an assignment to a field as we do here is a new twist, but it has the added benefit that the record can be output normally. (If the assignment was made to a variable, you'd have to output the variable and then output the record's remaining fields.) The print statement prints the changed record. Let's see it in action:

$ caps
root user
Root user
dale
Dale
Tom
Tom

In a little bit, we'll see how to revise this program to change all characters in a string from lower- to uppercase or vice versa.

9.2.2. String Length

When presenting the awkro program in the previous chapter, we noted that the program was likely to produce lines that exceed 80 characters. After all, the descriptions are quite long. We can find out how many characters are in a string using the built-in function length(). For instance, to evaluate the length of the current input record, we specify length($0). (As it happens, if length() is called without an argument, it returns the length of $0.)

The length() function is often used to find the length of the current input record, in order to determine if we need to break the line.

One way to handle the line break, perhaps more efficiently, is to use the length() function to get the length of each field. By accumulating those lengths, we could specify a line break when a new field causes the total to exceed a certain number.

Chapter 13, "A Miscellany of Scripts", contains a script that uses the length() function to break lines greater than 80 columns wide.

9.2.3. Substitution Functions

Awk provides two substitution functions: sub() and gsub(). The difference between them is that gsub() performs its substitution globally on the input string whereas sub() makes only the first possible substitution. This makes gsub() equivalent to the sed substitution command with the g (global) flag.

Both functions take at least two arguments. The first is a regular expression (surrounded by slashes) that matches a pattern and the second argument is a string that replaces what the pattern matches. The regular expression can be supplied by a variable, in which case the slashes are omitted. An optional third argument specifies the string that is the target of the substitution. If there is no third argument, the substitution is made for the current input record ($0).

The substitution functions change the specified string directly. You might expect, given the way functions work, that the function returns the new string created when the substitution is made. The substitution functions actually return the number of substitutions made. sub() will always return 1 if successful; both return 0 if not successful. Thus, you can test the result to see if a substitution was made.

For example, the following example uses gsub() to replace all occurrences of "UNIX" with "POSIX".

if (gsub(/UNIX/, "POSIX"))
	print

The conditional statement tests the return value of gsub() such that the current input line is printed only if a change is made.

As with sed, if an "&" appears in the substitution string, it will be replaced by the string matched by the regular expression. Use "\&" to output an ampersand. (Remember that to get a literal "\" into a string, you have to type two of them.) Also, note that awk does not "remember" the previous regular expression, as does sed, so you cannot use the syntax "//" to refer to the last regular expression.

The following example surrounds any occurrence of "UNIX" with the troff font-change escape sequences.

gsub(/UNIX/, "\\fB&\\fR")

If the input is "the UNIX operating system", the output is "the \fBUNIX\fR operating system".

In Chapter 4, "Writing sed Scripts", we presented the following sed script named do.outline:

sed -n '
s/"//g
s/^\.Se /Chapter /p
s/^\.Ah /•A. /p
s/^\.Bh /••B.  /p' $*

Now here's that script rewritten using the substitution functions:

awk '
{
gsub(/"/, "")
if (sub(/^\.Se /, "Chapter ")) print
if (sub(/^\.Ah /, "\tA. ")) print
if (sub(/^\.Bh /, "\t\tB.  ")) print
}' $*

The two scripts are exactly equivalent, printing out only those lines that are changed. For the first edition of this book, Dale compared the run-time of both scripts and, as he expected, the awk script was slower. For the second edition, new timings showed that performance varies by implementation, and in fact, all tested versions of new awk were faster than sed! This is nice, since we have the capabilities in awk to make the script do more things. For instance, instead of using letters of the alphabet, we could number the headings. Here's the revised awk script:

awk '# do.outline -- number headings in chapter.
{
gsub(/"/, "")
}
/^\.Se/ {
	sub(/^\.Se /, "Chapter ") 
	ch = $2
	ah = 0
	bh = 0
	print
	next
}
/^\.Ah/ {
	sub(/^\.Ah /, "\t " ch "." ++ah " ") 
	bh = 0
	print
	next
}
/^\.Bh/ {
	sub(/^\.Bh /, "\t\t " ch "."  ah "." ++bh " ")
	print
}' $*

In this version, we break out each heading into its own pattern-matching rule. This is not necessary but seems more efficient since we know that once a rule is applied, we don't need to look at the others. Note the use of the next statement to bypass further examination of a line that has already been identified.

The chapter number is read as the first argument to the ".Se" macro and is thus the second field on that line. The numbering scheme is done by incrementing a variable each time the substitution is made. The action associated with the chapter-level heading initializes the section-heading counters to zero. The action associated with the top-level heading ".Ah" zeroes the second-level heading counter. Obviously, you can create as many levels of heading as you need. Note how we can specify a concatenation of strings and variables as a single argument to the sub() function.

$ do.outline ch02
Chapter 2 Understanding Basic Operations
         2.1 Awk, by Sed and Grep, out of Ed 
         2.2 Command-line Syntax
                 2.2.1 Scripting
                 2.2.2 Sample Mailing List
         2.3 Using Sed
                 2.3.1 Specifying Simple Instructions
                 2.3.2 Script Files
         2.4 Using Awk
         2.5 Using Sed and Awk Together

If you wanted the option of choosing either numbers or letters, you could maintain both programs and construct a shell wrapper that uses some flag to determine which program should be invoked.

9.2. String Functions

Table 9.2. Awk's Built-In String Functions

9.2.1. Substrings

9.2.2. String Length

9.2.3. Substitution Functions

9.2.4. Converting Case

9.2.5. The match() Function