[Chapter 48] 48.10 Working with Names and Addresses

48.10 Working with Names and Addresses

One of the simplest applications of awk ( 33.11 ) is building a name and address database. It is a good exercise for learning awk as well. It involves organizing the information as a record and then writing programs that extract information from the records for display in reports. The scripts in this article use nawk ( 33.12 ) instead of awk , but the principles are the same.

The first thing to decide is the structure of a record. At the very least we'd like to have the following fields:

Name
Street
City
State
Zip

But we may wish to have a more complex record structure:

Name
Title
Company
Division
Street
City
State
Zip
Phone
Fax
Email
Directory
Comments

It doesn't matter to our programming effort whether the record has five fields or thirteen. It does matter that the structure is decided upon before you begin programming.

The next decision we must make is how to distinguish one field from the next and how to distinguish one record from another. If your records are short, you could have one record per line and use an oddball character as a field delimiter:

Name~Street~City~State~Zip
Name1~Street1~City1~State1~Zip1

The downside of this solution is that it can be difficult to edit the records. (We are going to try to avoid writing programs for automating data entry. Instead, we will assume that you create the record with a text editor- vi or Emacs, for example.)

Another solution is to put each field on a line by itself and separate the records with a blank line:

Name
Street
City
State
Zip

Name1
Street1
City1
State1
Zip1

This is a good solution. You have to be careful that the data does not itself contain blank lines. For instance, if you wanted to add a field for Company name, and not all records have a value for Company, then you must use a placeholder character to indicate an empty value.

Another solution is to put each record in its own file and put each field on its own line. This is the record organization we will implement for our program. Two advantages of it are that it permits variable length records and it does not require the use of special delimiter characters. It is therefore pretty easy to create or edit a record. It is also very easy to select a subset of records for processing.

We will give each file a name that uniquely identifies it in the current directory. A list of records is the same as a list of files. Here is a sample record in a file named pmui :

Peter Mui
International Sales Manager
O'Reilly & Associates, Inc.
East Coast Division
90 Sherman Street
Cambridge
MA
01240
617-354-5800
617-661-1116
peter@ora.com
/home/peter
Any number of lines may appear as 
a comment.

In this record, there are thirteen fields, any of which can be blank (but the blank line must be there to save the position), and the last field can have as many lines as needed.

Our record does not contain labels that identify what each field contains. While we could put that information in the record itself, it is better to maintain the labels separately so they can be changed in a single location. (You can create a record template that contains the labels to help you identify fields when adding a new record.)

We have put the labels for these fields in a separate file named dict . We won't show this file because its contents describe the record structure as shown above.

We are going to have three programs and they share the same syntax:

command record-list

The record-list is a list of one or more filenames. You can use wildcard characters, of course, on the command line to specify multiple records.

The first program, read.base , reads the dict file to get the labels and outputs a formatted record.

% 

read.base record



pmui:
1.  Name:   Peter Mui
2.  Title:   International Sales Manager
3.  Company:   O'Reilly & Associates, Inc.
4.  Division:   East Coast Division
5.  Street:   90 Sherman Street
6.  City:   Cambridge
7.  State:   MA
8.  Zip:   01240
9.  Phone:   617-354-5800
10. Fax:   617-661-1116
11. Email:   peter@ora.com
12. Directory:   /home/peter
13. Comments:   Any number of lines may appear as 
a comment.

read.base first outputs the record name and then lists each field. Let's look at read.base :

nawk 'BEGIN { FS=":"
    # test to see that at least one record was specified
    if (ARGC < 2) {
        print "Please supply record list on command line"
        exit 
    }

    # name of local file containing field labels:
    record_template = "dict"

    # loop to read the record_template
    # field_inc = the number of fields  
    # fields[] = an array of labels indexed by position

    field_inc=0

    while ((getline < record_template) > 0) {
        ++field_inc
        fields[field_inc] = $1
    }
    field_tot=field_inc
}

# Now we are reading the records
# Print filename for each new record
FNR == 1 { 
    field_inc=0
    print "\n" FILENAME ":"
}
{ 

    # Print the field's position, label and value  
    # The last field can have any number of lines without a label.

    if (++field_inc <= field_tot){
        if (field_inc >= 10)
            space = ". "
        else
            space = ".  "

        print field_inc space fields[field_inc] ":\t" $NF 
}
else
    print $NF 
}' $*

Note that the program is not doing any input validation. If the record is missing a Division name (and you didn't leave the fourth line blank), whatever is on line 4 will match up with Division, even if it's really a street address. One of the uses of read.base is simply to verify that what you entered in the file is correct.

If you specify more than one record, then you will get all of those records output in the order that you specified them on the command line.

The second program is mail.base . It extracts mailing label information.

% 

mail.base pmui



Peter Mui
International Sales Manager
O'Reilly & Associates, Inc.
East Coast Division
90 Sherman Street
Cambridge, MA 01240

If you supply a record-list , then you will get a list of mailing labels.

Here is the mail.base program:

nawk 'BEGIN { FS="\n"; 

    # test that user supplies a record
    if (ARGC < 2) {

        print "Please supply record list on command line"
        exit 
    }
}

# ignore blank lines
/^$/ { next }

# this is hard-coded to record format;
# print first 5 fields and then print
# city, state zip on one line.
{ 
    if (FNR < 6)
        print $0 
    else
        if (FNR == 6)
            printf $0 ", "
        else if (FNR == 7)
            printf $0
        else if (FNR == 8)
            printf " " $0 "\n\n"
}' $*

Variations on this very simple program can be written to extract or compile other pieces of information. You could also output formatting codes used when printing the labels.

The last program is list.base . It prepares a tabular list of names and records and allows you to select a particular record.

% 

list.base lwalsh pmui jberlin



  # NAME & COMPANY                           FILE           
 1. Linda Walsh, O'Reilly & Associates, Inc. lwalsh         
 2. Peter Mui, O'Reilly & Associates, Inc.   pmui        
 3. Jill Berlin, O'Reilly & Associates, Inc. jberlin        
Select a record by number: 2

When you select the record number, that record is displayed by using read.base . I have not built in any paging capability, so the list will scroll continuously rather than pause after 24 lines or so as it might.

Here is the list.base program:

nawk 'BEGIN { 
    # Do everything as BEGIN procedure

    # test that user supplied record-list

    if (ARGC < 2) {
        print "Please supply record list on command line"
        exit 
    }

    # Define report format string in one place.
    FMTSTR = "%3s %-40s %-15s\n"

    # print report header

    printf(FMTSTR, "#","NAME & COMPANY", "FILE") 

    # For each record, get Name, Title and Company and print it.
    inc=0
    for (x=1; x < ARGC; x++){
        getline NAME < ARGV[x]
        getline TITLE < ARGV[x]
        getline COMPANY < ARGV[x]
        record_list[x] = ARGV[x]
        printf(FMTSTR, ++inc ".", NAME ", "  COMPANY, ARGV[x]) 
    }

    # Prompt user to select a record by number

    printf "Select a record by number:"
    getline answer < "-"

    # Call read.base program to display the selected record

    system("read.base " record_list[answer])
}

' $*

Different versions of this program can be written to examine individual pieces of information across a set of records.

Article 45.22 shows how to write a shell script that creates a prompt-driven front end to collect names and addresses. (It needs to be modified to put out a blank line for empty fields and not to write the labels into the file.)

- DD