Recipe 1.5. Processing a String One Character at a Time (Perl Cookbook)

1.5. Processing a String One Character at a Time

Problem

You want to process a string one character at a time.

Solution

Use split with a null pattern to break up the string into individual characters, or use unpack if you just want their ASCII values:

@array = split(//, $string);

@array = unpack("C*", $string);

Or extract each character in turn with a loop:

    while (/(.)/g) { # . is never a newline here
        # do something with $1
    }

As we said before, Perl's fundamental unit is the string, not the character. Needing to process anything a character at a time is rare. Usually some kind of higher-level Perl operation, like pattern matching, solves the problem more easily. See, for example, Recipe 7.7 , where a set of substitutions is used to find command-line arguments.

Splitting on a pattern that matches the empty string returns a list of the individual characters in the string. This is a convenient feature when done intentionally, but it's easy to do unintentionally. For instance, /X*/ matches the empty string. Odds are you will find others when you don't mean to.

Here's an example that prints the characters used in the string "an apple a day ", sorted in ascending ASCII order:

%seen = ();
$string = "an apple a day";
foreach $byte (split //, $string) {
    $seen{$byte}++;
}
print "unique chars are: ", sort(keys %seen), "\n";




unique chars are:  adelnpy

These split and unpack solutions give you an array of characters to work with. If you don't want an array, you can use a pattern match with the /g flag in a while loop, extracting one character at a time:

%seen = ();
$string = "an apple a day";
while ($string =~ /(.)/g) {
    $seen{$1}++;
}
print "unique chars are: ", sort(keys %seen), "\n";




unique chars are:  adelnpy

In general, if you find yourself doing character-by-character processing, there's probably a better way to go about it. Instead of using index and substr or split and unpack , it might be easier to use a pattern. Instead of computing a 32-bit checksum by hand, as in the next example, the unpack function can compute it far more efficiently.

The following example calculates the checksum of $string with a foreach loop. There are better checksums; this just happens to be the basis of a traditional and computationally easy checksum. See the MD5 module from CPAN if you want a more sound checksum.

$sum = 0;
foreach $ascval (unpack("C*", $string)) {
    $sum += $ascval;
}
print "sum is $sum\n";
# prints "1248" if $string was "an apple a day"

This does the same thing, but much faster:

$sum = unpack("%32C*", $string);

This lets us emulate the SysV checksum program:

#!/usr/bin/perl
# sum - compute 16-bit checksum of all input files
$checksum = 0;
while (<>) { $checksum += unpack("%16C*", $_) }
$checksum %= (2 ** 16) - 1;
print "$checksum\n";

Here's an example of its use:

% perl sum /etc/termcap




1510

If you have the GNU version of sum , you'll need to call it with the - -sysv option to get the same answer on the same file.

% sum --sysv /etc/termcap




1510 851 /etc/termcap

Another tiny program that processes its input one character at a time is slowcat , shown in Example 1.1 . The idea here is to pause after each character is printed so you can scroll text before an audience slowly enough that they can read it.

Example 1.1: slowcat

#!/usr/bin/perl
# 

slowcat - emulate a   s l o w   line printer
# usage: slowcat [-DELAY] [files ...]
$DELAY = ($ARGV[0] =~ /^-([.\d]+)/) ? (shift, $1) : 1;
$| = 1;
while (<>) {
    for (split(//)) {
        print;
        select(undef,undef,undef, 0.005 * $DELAY);
    }
}

1.5. Processing a String One Character at a Time

Problem

Solution

Discussion

Example 1.1: slowcat

See Also