4.1 Lexical Structure
The lexical structure of a
programming language is the set of basic rules that govern how you
write programs in that language. It is the lowest-level syntax of the
language and specifies such things as what variable names look like
and what characters are used for comments. Each Python source file,
like any other text file, is a sequence of characters. You can also
usefully see it as a sequence of lines, tokens, or statements. These
different syntactic views complement and reinforce each other. Python
is very particular about program layout, especially with regard to
lines and indentation, so you'll want to pay
attention to this information if you are coming to Python from
another language.
4.1.1 Lines and Indentation
A Python program is composed of a sequence of
logical lines, each made up
of one or more physical
lines. Each physical line may end with a
comment. A pound sign (#) that is not inside a
string literal begins a comment. All characters after the
# and up to the physical line end are part of the
comment, and the Python interpreter ignores them. A line containing
only whitespace, possibly with a comment, is called a
blank line, and is ignored
by the interpreter. In an interactive interpreter session, you must
enter an empty physical line (without any whitespace or comment) to
terminate a multiline statement.
In Python, the end of a physical line marks the end of most
statements. Unlike in other languages, Python statements are not
normally terminated with a delimiter, such as a semicolon
(;). When a statement is too long to fit on a
single physical line, you can join two adjacent physical lines into a
logical line by ensuring that the first physical line has no comment
and ends with a backslash (\). Python also joins
adjacent physical lines into one logical line if an open parenthesis
((), bracket ([), or brace
({) has not yet been closed. Triple-quoted string
literals can also span physical lines. Physical lines after the first
one in a logical line are known as continuation
lines. The indentation issues covered next do
not apply to continuation lines, but only to the first physical line
of each logical line.
Python uses indentation to
express the block structure of a program. Unlike other languages,
Python does not use braces or begin/end delimiters around blocks of
statements: indentation is the only way to indicate such blocks. Each
logical line in a Python program is indented by the whitespace on its
left. A block is a contiguous sequence of logical lines, all indented
by the same amount; the block is ended by a logical line with less
indentation. All statements in a block must have the same
indentation, as must all clauses in a compound statement. Standard
Python style is to use four spaces per indentation level. The first
statement in a source file must have no indentation (i.e., it must
not begin with any whitespace). Additionally, statements typed at the
interactive interpreter prompt >>>
(covered in Chapter 3) must have no indentation.
A tab is logically replaced by up to 8 spaces, so that the next
character after the tab falls into logical column 9, 17, 25, etc.
Don't mix spaces and tabs for indentation, since
different tools (e.g., editors, email systems, printers) treat tabs
differently. The -t and -tt
options to the Python interpreter (covered in Chapter 3) ensure against inconsistent tab and space
usage in Python source code. You can configure any good editor to
expand tabs to spaces so that all Python source code you write
contains only spaces, not tabs. You then know that all tools,
including Python itself, are going to be consistent in handling the
crucial matter of indentation in your source files.
4.1.2 Tokens
Python
breaks each logical line into a sequence of elementary lexical
components, called tokens. Each token
corresponds to a substring of the logical line. The normal token
types are identifiers,
keywords, operators,
delimiters, and literals,
as covered in the following sections. Whitespace may be freely used
between tokens to separate them. Some whitespace separation is needed
between logically adjacent identifiers or keywords; otherwise, they
would be parsed as a single, longer identifier. For example,
printx is a single identifier—to write the
keyword print followed by identifier
x, you need to insert some whitespace (e.g.,
print
x).
4.1.2.1 Identifiers
An
identifier is a name used to identify a
variable, function, class, module, or other object. An identifier
starts with a letter (A to Z or
a to z) or underscore
(_) followed by zero or more letters, underscores,
and digits (0 to 9). Case is
significant in Python: lowercase and uppercase letters are distinct.
Punctuation characters such as @,
$, and % are not allowed in
identifiers.
Normal Python style is to start class names with an uppercase letter
and other identifiers with a lowercase letter. Starting an identifier
with a single leading underscore indicates by convention that the
identifier is meant to be private. Starting an identifier with two
leading underscores indicates a strongly private identifier; if the
identifier also ends with two trailing underscores, the identifier is
a language-defined special name. The identifier _
(a single underscore) is special in interactive interpreter sessions:
the interpreter binds _ to the result of the last
expression statement evaluated interactively, if any.
4.1.2.2 Keywords
Python has
28 keywords (29 in Python 2.3 and later), which are identifiers that
Python reserves for special syntactic uses. Keywords are composed of
lowercase letters only. You cannot use keywords as regular
identifiers. Some keywords begin simple statements or clauses of
compound statements, while other keywords are used as operators. All
the keywords are covered in detail in this book, either later in this
chapter or in Chapter 5, Chapter 6, or Chapter 7. The keywords
in Python are:
and
|
del
|
for
|
is
|
raise
|
assert
|
elif
|
from
|
lambda
|
return
|
break
|
else
|
global
|
not
|
try
|
class
|
except
|
if
|
or
|
while
|
continue
|
exec
|
import
|
pass
|
yield
|
def
|
finally
|
in
|
print
|
|
4.1.2.3 Operators
Python uses non-alphanumeric characters and
character combinations as operators. Python recognizes the following
operators, which are covered in detail later in this chapter:
+
|
-
|
*
|
/
|
%
|
**
|
//
|
<<
|
>>
|
&
|
|
|
^
|
~
|
<
|
<=
|
>
|
>=
|
<>
|
!=
|
= =
|
4.1.2.4 Delimiters
Python
uses the following symbols and symbol combinations as delimiters in
expressions, lists, dictionaries, various aspects of statements, and
strings, among other purposes:
(
|
)
|
[
|
]
|
{
|
}
|
,
|
:
|
.
|
`
|
=
|
;
|
+=
|
-=
|
*=
|
/=
|
//=
|
%=
|
&=
|
|=
|
^=
|
>>=
|
<<=
|
**=
|
The period (.) can also appear in floating-point
and imaginary literals. A sequence of three periods
(...) has a special meaning in slices. The last
two rows of the table list the augmented assignment operators, which
serve lexically as delimiters but also perform an operation.
I'll discuss the syntax for the various delimiters
when I introduce the objects or statements with which they are used.
The following characters have special meanings as part of other
tokens:
The characters @, $, and
?, all control characters except whitespace, and
all characters with ISO codes above 126 (i.e., non-ASCII characters,
such as accented letters), can never be part of the text of a Python
program except in comments or string literals.
4.1.2.5 Literals
A
literal is a data value that appears directly in
a program. The following are all literals in Python:
42 # Integer literal
3.14 # Floating-point literal
1.0J # Imaginary literal
'hello' # String literal
"world" # Another string literal
"""Good
night""" # Triple-quoted string literal
Using literals and delimiters, you can create data values of other
types:
[ 42, 3.14, 'hello' ] # List
( 100, 200, 300 ) # Tuple
{ 'x':42, 'y':3.14 } # Dictionary
The syntax for literals and other data values is covered in detail
later in this chapter, when we discuss the various data types
supported by Python.
4.1.3 Statements
You can
consider a Python source file as a sequence of simple and compound
statements. Unlike other languages, Python has no declarations or
other top-level syntax elements.
4.1.3.1 Simple statements
A simple statement is one
that contains no other statements. A simple statement lies entirely
within a logical line. As in other languages, you may place more than
one simple statement on a single logical line, with a semicolon
(;) as the separator. However, one statement per
line is the usual Python style, as it makes programs more
readable.
Any expression can stand on its own as a simple statement;
we'll discuss expressions in detail later in this
chapter. The interactive interpreter shows the result of an
expression statement entered at the prompt
(>>>), and also binds the result to a
variable named _. Apart from interactive sessions,
expression statements are useful only to call functions (and other
callables) that have side effects (e.g., that perform output or
change global variables).
An
assignment is a simple statement that assigns a
value to a variable, as we'll discuss later in this
chapter. Unlike in some other languages, an assignment in Python is a
statement, and therefore can never be part of an
expression.
4.1.3.2 Compound statements
A
compound statement contains
other statements and controls their execution. A compound statement
has one or more clauses, aligned at the same
indentation. Each clause has a header that
starts with a keyword and ends with a colon (:),
followed by a body, which is a sequence of one
or more statements. When the body contains multiple statements, also
known as a block, these statements should be
placed on separate logical lines after the header line and indented
rightward from the header line. The block terminates when the
indentation returns to that of the clause header (or further left
from there). Alternatively, the body can be a single simple
statement, following the : on the same logical
line as the header. The body may also be several simple statements on
the same line with semicolons between them, but as
I've already indicated, this is not good Python
style.
|