home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


1.15. Parsing Comma-Separated Data

Problem

You have a data file containing comma-separated values that you need to read in, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format.

Solution

Use the procedure in Mastering Regular Expressions .

sub parse_csv {
    my $text = shift;      # record containing comma-separated values
    my @new  = ();
    push(@new, $+) while $text =~ m{
        # the first part groups the phrase inside the quotes.
        # see explanation of this pattern in MRE
        "([^\"\\]*(?:\\.[^\"\\]*)*)",?
           |  ([^,]+),?
           | ,
       }gx;
       push(@new, undef) if substr($text, -1,1) eq ',';
       return @new;      # list of values that were comma-separated
}  

Or use the standard Text::ParseWords module.

use 

Text::ParseWords;

sub parse_csv {
    return quoteword(",",0, $_[0]);
}

Discussion

Comma-separated input is a deceptive and complex format. It sounds simple, but involves a fairly complex escaping system because the fields themselves can contain commas. This makes the pattern matching solution complex and rules out a simple split /,/ .

Fortunately, Text::ParseWords hides the complexity from you. Pass its quotewords function two arguments and the CSV string. The first argument is the separator (a comma, in this case) and the second is a true or false value controlling whether the strings are returned with quotes around them.

If you want to represent quotation marks inside a field delimited by quotation marks, escape them with backslashes " like \"this\" ". Quotation marks and backslashes are the only characters that have meaning backslashed. Any other use of a backslash will be left in the output string.

Here's how you'd use the parse_csv subroutines. The q<> is just a fancy quote so we didn't have to backslash everything.

$line = q<XYZZY,"","O'Reilly, Inc","Wall, Larry","a \"glug\" bit,",5,
    "Error, Core Dumped">;
@fields = parse_csv($line);
for ($i = 0; $i < @fields; $i++) {
    print "$i : $fields[$i]\n";
}




0 : XYZZY








1 :








2 : O'Reilly, Inc








3 : Wall, Larry








4 : a \"glug\" bit,








5 : 5








6 : Error, Core Dumped



See Also

The explanation of regular expression syntax in perlre (1) and Chapter 2 of Programming Perl ; the documentation for the standard Text::ParseWords module (also in Chapter 7 of Programming Perl ); the section "An Introductory Example: Parsing CSV Text" in Chapter 7 of Mastering Regular Expressions