Treating a Unicode String as Octets (Perl Cookbook, 2nd Edition)

1.10.3. Discussion

As explained in this chapter's Introduction, Perl knows about two types of string: those made of simple uninterpreted octets, and those made of Unicode characters whose UTF-8 representation may require more than one octet. Each individual string has a flag associated with it, identifying the string as either UTF-8 or octets. Perl's I/O and string operations (such as length) check this flag and give character or octet semantics accordingly.

Sometimes you need to work with bytes and not characters. For example, many protocols have a Content-Length header that specifies the size of the body of a message in octets. You can't simply use Perl's length function to calculate the size, because if the string you're calling length on is marked as UTF-8, you'll get the size in characters.

The use bytes pragma makes all Perl functions in its lexical scope use octet semantics for strings instead of character semantics. Under this pragma, length always returns the number of octets, and read always reports the number of octets read. However, because the use bytes pragma is lexically scoped, you can't use it to change the behavior of code in another scope (e.g., someone else's function).

For this you need to create an octet-encoded copy of the UTF-8 string. In memory, of course, the same byte sequence is used for both strings. The difference is that the copy of your UTF-8 string has the UTF-8 flag cleared. Functions acting on the octet copy will give octet semantics, regardless of the scope they're in.

There is also a no bytes pragma, which forces character semantics, and a decode_utf8 function, which turns octet-encoded strings into UTF-8 encoded strings. However, these functions are less useful because not all octet strings are valid UTF-8 strings, whereas all UTF-8 strings are valid octet strings.

1.10. Treating a Unicode String as Octets

1.10.1. Problem

1.10.2. Solution

1.10.3. Discussion

1.10.4. See Also