Grabbing Parts of a String (Unix Power Tools, 3rd Edition)

36.23. Grabbing Parts of a String

How can you parse (split, search) a string of text to find the last word, the second column, and so on? There are a lot of different ways. Pick the one that works best for you -- or invent another one! (Unix has lots of ways to work with strings of text.)

36.23.1. Matching with expr

The expr command (Section 36.21) can grab part of a string with a regular expression. The example below is from a shell script whose last command-line argument is a filename. The two commands below use expr to grab the last argument and all arguments except the last one. The "$*" gives expr a list of all command-line arguments in a single word. (Using "$@" (Section 35.20) here wouldn't work because it gives individually quoted arguments. expr needs all arguments in one word.)

last=`expr "$*" : '.* \(.*\)'`     # LAST ARGUMENT
first=`expr "$*" : '\(.*\) .*'`    # ALL BUT LAST ARGUMENT

Let's look at the regular expression that gets the last word. The leading part of the expression, .* , matches as many characters as it can, followed by a space. This includes all words up to and including the last space. After that, the end of the expression, $.*$, matches the last word.

The regular expression that grabs the first words is the same as the previous one -- but I've moved the  pair. Now it grabs all words up to but not including the last space. The end of the regular expression, .*, matches the last space and last word -- and expr ignores them. So the final .* really isn't needed here (though the space is). I've included the final .* because it follows from the first example.

expr is great when you want to split a string into just two parts. The .* also makes expr good for skipping a variable number of words when you don't know how many words a string will have. But expr is poor at getting, say, the fourth word in a string. And it's almost useless for handling more than one line of text at a time.

36.23.2. Using echo with awk or cut

awk can split lines into words, but it has a lot of overhead and can take some time to execute, especially on a busy system. The cut (Section 21.14) command starts more quickly than awk but it can't do as much.

Both those utilities are designed to handle multiple lines of text. You can tell awk to handle a single line with its pattern-matching operators and its NR variable. You can also run those utilities with a single line of text, fed to the standard input through a pipe from echo. For example, to get the third field from a colon-separated string:

string="this:is:just:a:dummy:string"
field3_awk=`echo "$string" | awk -F: '{print $3}'`
field3_cut=`echo "$string" | cut -d: -f3`

Let's combine two echo commands. One sends text to awk or cut through a pipe; the utility ignores all the text from columns 1-24, then prints columns 25 to the end of the variable text. The outer echo prints The answer is and that answer. Notice that the inner double quotes are escaped with backslashes to keep the Bourne shell from interpreting them before the inner echo runs:

echo "The answer is `echo \"$text\" | awk '{print substr($0,25)}'`"
echo "The answer is `echo \"$text\" | cut -c25-`"

36.23.3. Using set and IFS

The Bourne shell set (Section 35.25) command can be used to parse a single-line string and store it in the command-line parameters (Section 35.20) "$@", $*, $1, $2, and so on. Then you can also loop through the words with a for loop (Section 35.21) and use everything else the shell has for dealing with command-line parameters. Also, you can set the Bourne shell's IFS variable to control how the shell splits the string.

NOTE: The formats used by stty and the behavior of IFS may vary from platform to platform.

By default, the IFS (internal field separator) shell variable holds three characters: SPACE, TAB, and NEWLINE. These are the places that the shell parses command lines.

If you have a line of text -- say, from a database -- and you want to split it into fields, you can put the field separator into IFS temporarily, use the shell's set (Section 35.25) command to store the fields in command-line parameters, then restore the old IFS.

For example, the chunk of a shell script below gets current terminal settings from stty -g, which looks like this:

2506:5:bf:8a3b:3:1c:8:15:4:0:0:0:11:13:1a:19:12:f:17:16:0:0

In the next example, the shell parses the line returned from stty by the backquotes (Section 28.14). It stores x in $1, which stops errors if stty fails for some reason. (Without the x, if stty made no standard output, the shell's set command would print a list of all shell variables.) Then 2506 goes into $2, 5 into $3, and so on. The original Bourne shell can handle only nine parameters (through $9); if your input lines may have more than nine fields, this isn't a good technique. But this script uses the Korn shell, which (along with most other Bourne-type shells) doesn't have that limit.

#!/bin/ksh
oldifs="$IFS"
# Change IFS to a colon:
IFS=:
# Put x in $1, stty -g output in $2 thru ${23}:
set x `stty -g`
IFS="$oldifs"
# Window size is in 16th field (not counting the first "x"):
echo "Your window has ${17} rows."

Because you don't need a subprocess to parse the output of stty, this can be faster than using an external command like cut (Section 21.14) or awk (Section 20.10).

There are places where IFS can't be used because the shell separates command lines at spaces before it splits at IFS. It doesn't split the results of variable substitution or command substitution (Section 28.14) at spaces, though. Here's an example -- three different ways to parse a line from /etc/passwd:

% cat splitter
#!/bin/sh
IFS=:
line='larry:Vk9skS323kd4q:985:100:Larry Smith:/u/larry:/bin/tcsh'
set x $line
echo "case 1: \$6 is '$6'"
set x `grep larry /etc/passwd`
echo "case 2: \$6 is '$6'"
set x larry:Vk9skS323kd4q:985:100:Larry Smith:/u/larry:/bin/tcsh
echo "case 3: \$6 is '$6'"

% ./splitter
case 1: $6 is 'Larry Smith'
case 2: $6 is 'Larry Smith'
case 3: $6 is 'Larry'

Case 1 used variable substitution and case 2 used command substitution; the sixth field contained the space. In case 3, though, with the colons on the command line, the sixth field was split: $6 became Larry and $7 was Smith. Another problem would have come up if any of the fields had been empty (as in larry::985:100:etc...) -- the shell would "eat" the empty field and $6 would contain /u/larry. Using sed with its escaped parentheses (Section 34.11) to do the searching and the parsing could solve the last two problems.

36.23.4. Using sed

The Unix sed (Section 34.1) utility is good at parsing input that you may or may not otherwise be able to split into words, at finding a single line of text in a group and outputting it, and many other things. In this example, I want to get the percentage-used of the filesystem mounted on /home. That information is buried in the output of the df (Section 15.8) command. On my system,[115] df output looks like:

[115]If you are using something other than GNU df, you may need to use the -k switch.

+% df
Filesystem            kbytes    used   avail capacity  Mounted on
   ...
/dev/sd3c            1294854  914230  251139    78%    /work
/dev/sd4c             597759  534123    3861    99%    /home
   ...

I want the number 99 from the line ending with /home. The sed address / \/home$/ will find that line (including a space before the /home makes sure the address doesn't match a line ending with /something/home). The -n option keeps sed from printing any lines except the line we ask it to print (with its p command). I know that the "capacity" is the only word on the line that ends with a percent sign (%). A space after the first .* makes sure that .* doesn't "eat" the first digit of the number that we want to match by [0-9]. The sed escaped-parenthesis operators (Section 34.11) grab that number:

usage=`df | sed -n '/ \/home$/s/.* \([0-9][0-9]*\)%.*/\1/p'`

Combining sed with eval (Section 27.8) lets you set several shell variables at once from parts of the same line. Here's a command line that sets two shell variables from the df output:

eval `df |
sed -n '/ \/home$/s/^[^ ]*  *\([0-9]*\)  *\([0-9]*\).*/kb=\1 u=\2/p'`

The left-hand side of that substitution command has a regular expression that uses sed's escaped parenthesis operators. They grab the "kbytes" and "used" columns from the df output. The right-hand side outputs the two df values with Bourne shell variable-assignment commands to set the kb and u variables. After sed finishes, the resulting command line looks like this:

eval kb=597759 u=534123

Now $kb gives you 597759, and $u contains 534123.

-- JP