Molecules (Programming Perl)

2.2. Molecules

Perl is a free-form language, but that doesn't mean that Perl is totally free of form. As computer folks usually use the term, a free-form language is one in which you can put spaces, tabs, and newlines anywhere you like--except where you can't.

One obvious place you can't put a whitespace character is in the middle of a token. A token is what we call a sequence of characters with a unit of meaning, much like a simple word in natural language. But unlike the typical word, a token might contain other characters besides letters, just as long as they hang together to form a unit of meaning. (In that sense, they're more like molecules, which don't have to be composed of only one particular kind of atom.) For example, numbers and mathematical operators are considered tokens. An identifier is a token that starts with a letter or underscore and contains only letters, digits, and underscores. A token may not contain whitespace characters because this would split the token into two tokens, just as a space in an English word turns it into two words.[2]

[2] The astute reader will point out that literal strings may contain whitespace characters. But strings can get away with it only because they have quotes on both ends to keep the spaces from leaking out.

Although whitespace is allowed between any two tokens, whitespace is required only between tokens that would otherwise be confused as a single token. All whitespace is equivalent for this purpose. Newlines are distinguished from spaces and tabs only within quoted strings, formats, and certain line-oriented forms of quoting. Specifically, newlines do not terminate statements as they do in certain other languages (such as FORTRAN or Python). Statements in Perl are terminated with semicolons, just as they are in C and its various derivatives.

Unicode whitespace characters are allowed in a Unicode Perl program, but you need to be careful. If you use the special Unicode paragraph and line separators, be aware that Perl may count line numbers differently than your text editor does, so error messages may be more difficult to interpret. It's best to stick with good old-fashioned newlines.

Tokens are recognized greedily; if at a particular point the Perl parser has a choice between recognizing a short token or a long token, it will choose the long one. If you meant it to be two tokens, just insert some whitespace between the tokens. (We tend to put extra space around most operators anyway, just for readability.)

Comments are indicated by the # character and extend from there through the end of the line. A comment counts as whitespace for separating tokens. The Perl language attaches no special meaning to anything you might put into a comment.[3]

[3] Actually, that's a small fib. The Perl parser does look for command-line switches on an initial #! line (see Chapter 19, "The Command-Line Interface"). It can also interpret the line number directives that various preprocessors produce (see the section Section 2.5.2, "Generating Perl in Other Languages" in Chapter 24, "Common Practices").

One other oddity is that if a line begins with = anywhere a statement would be legal, Perl ignores everything from that line down to the next line that begins with =cut. The ignored text is assumed to be pod, or "plain old documentation". The Perl distribution has programs that will extract pod commentary from Perl modules and turn it into flat text, manpages, [LaTeX], HTML, or (someday soon) XML documents. In a complementary fashion, the Perl parser extracts the Perl code from Perl modules and ignores the pod. So you may consider this an alternate, multiline form of commenting. You may also consider it completely nuts, but Perl modules documented this way never lose track of their documentation. See Chapter 26, "Plain Old Documentation", for details on pod, including a description of how to effect multiline comments in Perl.

But don't look down on the normal comment character. There's something comforting about the visual effect of a nice row of # characters down the left side of a multiline comment. It immediately tells your eyes: "This is not code." You'll note that even in languages with a multiline quoting mechanisms like C, people often put a row of * characters down the left side of their comments anyway. Appearances are often more important than they appear.

In Perl, just as in chemistry and in language, you can build larger and larger structures out of the smaller ones. We already mentioned the statement; it's just a sequence of tokens that make up a command, that is, a sentence in the imperative mood. You can combine a sequence of statements into a block that is delimited by braces (also known affectionately as "curlies" by people who confuse braces with suspenders.) Blocks can in turn be combined into larger blocks. Some blocks function as subroutines, which can be combined into modules, which can be combined into programs. But we're getting ahead of ourselves--those are subjects for coming chapters. Let's build some more tokens out of characters.