home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePHP CookbookSearch this book

1.11. Parsing Fixed-Width Delimited Data

1.11.3. Discussion

Data in which each field is allotted a fixed number of characters per line may look like this list of books, titles, and publication dates:

$booklist=<<<END
Elmer Gantry             Sinclair Lewis1927
The Scarlatti InheritanceRobert Ludlum 1971
The Parsifal Mosaic      Robert Ludlum 1982
Sophie's Choice          William Styron1979
END;

In each line, the title occupies the first 25 characters, the author's name the next 14 characters, and the publication year the next 4 characters. Knowing those field widths, it's straightforward to use substr( ) to parse the fields into an array:

$books = explode("\n",$booklist);

for($i = 0, $j = count($books); $i < $j; $i++) {
  $book_array[$i]['title'] = substr($books[$i],0,25);
  $book_array[$i]['author'] = substr($books[$i],25,14);
  $book_array[$i]['publication_year'] = substr($books[$i],39,4);
}

Exploding $booklist into an array of lines makes the looping code the same whether it's operating over a string or a series of lines read in from a file.

The loop can be made more flexible by specifying the field names and widths in a separate array that can be passed to a parsing function, as shown in the pc_fixed_width_substr( ) function in Example 1-3.

Example 1-3. pc_fixed_width_substr( )

function pc_fixed_width_substr($fields,$data) {
  $r = array();
  for ($i = 0, $j = count($data); $i < $j; $i++) {
    $line_pos = 0;
    foreach($fields as $field_name => $field_length) {
      $r[$i][$field_name] = rtrim(substr($data[$i],$line_pos,$field_length));
      $line_pos += $field_length;
    }
  }
  return $r;
}

$book_fields = array('title' => 25,
                     'author' => 14,
                     'publication_year' => 4);

$book_array = pc_fixed_width_substr($book_fields,$books);

The variable $line_pos keeps track of the start of each field, and is advanced by the previous field's width as the code moves through each line. Use rtrim( ) to remove trailing whitespace from each field.

You can use unpack( ) as a substitute for substr( ) to extract fields. Instead of specifying the field names and widths as an associative array, create a format string for unpack( ). A fixed-width field extractor using unpack( ) looks like the pc_fixed_width_unpack( ) function shown in Example 1-4.

Example 1-4. pc_fixed_width_unpack( )

function pc_fixed_width_unpack($format_string,$data) {
  $r = array();
  for ($i = 0, $j = count($data); $i < $j; $i++) {
    $r[$i] = unpack($format_string,$data[$i]);
  }
  return $r;
}

$book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',
                                    $books);

Because the A format to unpack( ) means "space padded string," there's no need to rtrim( ) off the trailing spaces.

Once the fields have been parsed into $book_array by either function, the data can be printed as an HTML table, for example:

$book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',
                                    $books);
print "<table>\n";
// print a header row
print '<tr><td>';
print join('</td><td>',array_keys($book_array[0]));
print "</td></tr>\n";
// print each data row
foreach ($book_array as $row) {
    print '<tr><td>';
    print join('</td><td>',array_values($row));
    print "</td></tr>\n";
}
print '</table>\n';

Joining data on </td><td> produces a table row that is missing its first <td> and last </td>. We produce a complete table row by printing out <tr><td> before the joined data and </td></tr> after the joined data.

Both substr( ) and unpack( ) have equivalent capabilities when the fixed-width fields are strings, but unpack( ) is the better solution when the elements of the fields aren't just strings.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.