10.3 File Objects

As discussed earlier in this chapter, file is a built-in type in Python. With a file object, you can read and/or write data to a file as seen by the underlying operating system. Python reacts to any I/O error related to a file object by raising an instance of built-in exception class IOError. Errors that cause this exception include open failing to open or create a file, calling a method on a file object to which that method doesn't apply (e.g., calling write on a read-only file object or calling seek on a non-seekable file), and I/O errors diagnosed by a file object's methods. This section documents file objects, as well as some auxiliary modules that help you access and deal with their contents.

10.3.1 Creating a File Object with open

You normally create a Python file object with the built-in open, which has the following syntax:

open(filename,mode='r',bufsize=-1)

open opens the file named by filename, which must be a string that denotes any path to a file. open returns a Python file object, which is an instance of the built-in type file. Calling file is just like calling open, but file was first introduced in Python 2.2. If you explicitly pass a mode string, open can also create filename if the file does not already exist (depending on the value of mode, as we'll discuss in a moment). In other words, despite its name, open is not limited to opening existing files, but is also able to create new ones if needed.

10.3.1.1 File mode

mode is a string that denotes how the file is to be opened (or created). mode can have the following values:

'r': The file must already exist, and it is opened in read-only mode.
'w': The file is opened in write-only mode. The file is truncated and overwritten if it already exists, or created if it does not exist.
'a': The file is opened in write-only mode. The file is kept intact if it already exists, and the data you write is appended to what's already in the file. The file is created if it does not exist. Calling f.seek is innocuous, but has no effect.
'r+': The file must already exist and is opened for both reading and writing, so all methods of f can be called.
'w+': The file is opened for both reading and writing, so all methods of f can be called. The file is truncated and overwritten if it already exists, or created if it does not exist.
'a+': The file is opened for both reading and writing, so all methods of f can be called. The file is kept intact if it already exists, and the data you write is appended to what's already in the file. The file is created if it does not exist. Calling f.seek has no effect if the next I/O operation on f writes data, but works normally if the next I/O operation on f reads data.

10.3.1.2 Binary and text modes

The mode string may also have any of the values just explained followed by a b or t. b denotes binary mode, while t denotes text mode. When the mode string has neither b nor t, the default is text mode (i.e., 'r' is like 'rt', 'w' is like 'wt', and so on).

On Unix, there is no difference between binary and text modes. On other platforms, when a file is open in text mode, '\n' is returned each time the string that is the value of os.linesep (the line termination string) is encountered while reading the file. Conversely, a copy of os.linesep is written each time you write '\n' to the file.

This widespread convention, originally developed in the C language, lets you read and write text files on any platform, without worrying about the platform's line-separation conventions. However, except on Unix platforms, you do have to know (and tell Python, by passing the proper mode argument to open) whether a file is binary or text. In this chapter, for simplicity, I use \n to refer to the line termination string, but remember that the string is in fact os.linesep in files on the filesystem, translated to and from \n in memory only for files opened in text mode.

Python 2.3 will introduce a new concept, known as universal newlines, letting you open a text file for reading in mode 'u' when you don't know how line separators are encoded in the file. This is useful, for example, when you share files across a network between machines with different operating systems. Mode 'u' guesses what line separator string to use based on each file's contents. However, mode 'u' is not available in Python 2.2 and earlier.

10.3.1.3 Buffering

bufsize is an integer that denotes what buffering you request for the file. When bufsize is less than 0, the operating system's default is used. Normally, this default is line buffering for files that correspond to interactive consoles, and some reasonably sized buffer, such as 8192 bytes, for other files. When bufsize equals 0, the file is unbuffered; the effect is as if the file's buffer were flushed every time you write anything to the file. When bufsize equals 1, the file is line-buffered, which means the file's buffer is flushed every time you write \n to the file. When bufsize is greater than 1, the file uses a buffer of about bufsize bytes, rounded up to some reasonable amount. On some platforms, you can change the buffering for files that are already open, but there is no cross-platform way to do this.

10.3.1.4 Sequential and non-sequential access

A file object f is inherently sequential (i.e., a stream of bytes). When you read from a file, you get bytes in the sequential order in which the bytes are present in the file. When you write to a file, the bytes you write are put in the file in the sequential order in which you write them.

To allow non-sequential access, the built-in file object keeps track of its current position (i.e., the position on the underlying file where the next read or write operation will start transferring data). When you open a file, the file's initial current position is at the start of the file. Any call to f.write on a file object f opened with a mode of 'a' or 'a+' always sets f's current position to the end of the file before writing data to f. Whenever you read or write some number n of bytes on file object f, f's current position advances by n. You can query the current position by calling f.tell, and change the current position by calling f.seek, both covered in the next section.

10.3.2 Attributes and Methods of File Objects

A file object f supplies the attributes and methods documented in this section.

close

f.close(  )

Closes the file. You can call no other method on f after f.close. Multiple calls to f.close are allowed and innocuous.

closed

f.closed is a read-only attribute that is True if f.close( ) has been called, otherwise False.

flush

f.flush(  )

Requests that f's buffer be written out to the operating system, ensuring that the file as seen by the system has exactly the contents that Python's code has written to f. Depending on the platform and on the nature of f's underlying file, f.flush may or may not be able to ensure the desired effect.

isatty

f.isatty(  )

Returns True if f's file is an interactive terminal, otherwise False.

fileno

f.fileno(  )

Returns an integer, the file descriptor of f's file at operating system level. File descriptors were covered in Section 10.2.8 earlier in this chapter.

mode

f.mode is a read-only attribute that is the value of the mode string used in the open call that created f.

name

f.name is a read-only attribute that is the value of the filename string used in the open call that created f.

read

f.read(size=-1)

Reads up to size bytes from f's file and returns them as a string. read reads and returns less than size bytes if the file ends before size bytes are read. When size is less than 0, read reads and returns all bytes up to the end of the file. read returns an empty string only if the file's current position is at the end of the file or if size equals 0.

readline

f.readline(size=-1)

Reads and returns one line from f's file, up to the end of line (\n) included. If size is greater than or equal to 0, readline reads no more than about size bytes. In this case, the returned string may not end with \n. \n may also be absent if readline reads up to the end of the file without finding \n. readline returns an empty string only if the file's current position is at the end of the file or if size equals 0.

readlines

f.readlines(size=-1)

Reads and returns a list of all lines in f's file, each a string ending in \n. If size>0, readlines stops and returns the list after collecting data for a total of about size bytes, rather than reading all the way to the end of the file.

seek

f.seek(pos,how=0)

Sets f's current position to the signed integer byte offset pos from a reference point. how indicates the reference point: when how is 0, the reference is the start of the file; when it is 1, the reference is the current position; and when it is 2, the reference is the end of the file. When f is opened in text mode, the effects of f.seek may not be as expected, due to the implied translations between os.linesep and \n. This troublesome effect does not occur on Unix platforms, nor when f is opened in binary mode, nor when f.seek is called with a pos that is the result of a previous call to f.tell and how is 0. When f is opened in mode 'a' or 'a+', all data written to f is appended to the data that is already in f, regardless of calls to f.seek.

softspace

f.softspace is a read-write attribute that is used internally by the print statement to keep track of its own state. A file object does not alter nor interpret softspace in any way: it just lets the attribute be freely read and written, and print takes care of the rest.

tell

f.tell(  )

Returns f's current position, an integer offset in bytes from the start of the file.

truncate

f.truncate([size])

Truncates f's file. When size is present, truncates the file to be at most size bytes. When size is absent, uses f.tell( ) as the file's new size.

write

f.write(str)

Writes the bytes of string str to the file.

writelines

f.writelines(lst)

Like:

for line in lst: f.write(line)

It does not matter whether the strings in sequence lst are lines: despite its name, method writelines just writes the strings to the file, one after another, without alterations or additions.

xreadlines

f.xreadlines(  )

Like xreadlines.xreadlines(f), as covered in Section 10.4.4 later in this chapter. Method xreadlines will be deprecated in Python 2.3.

10.3.3 Iteration on File Objects

A file object f open for text-mode reading supports iteration. In other words, iter(f) returns an iterator whose items are the file's lines, so that the loop:

for line in f:

iterates on each line of the file. Interrupting such a loop prematurely (e.g., with break) leaves the file's current position with an arbitrary value. Calling methods that modify f's state, such as f.seek, during such a loop has an undefined effect. On the plus side, such a loop has very good performance, since these specifications allow the loop to use internal buffering to minimize I/O. Iteration on file objects is available only in Python 2.2 and later.

10.3.4 File-Like Objects and Polymorphism

An object x is file-like when it behaves polymorphically to a file, meaning that a function (or some other subset of a program) can use x as if x were a file. Code that uses such an object (known as client code of that object) typically receives the object as an argument or obtains it by calling a factory function that returns the object as the result. If the only method that a client-code function calls on x is x.read( ), without arguments, all that x needs to supply in order to be file-like for that function is a method read that is callable without arguments and returns a string. Other client-code functions, however, may need x to implement a broader subset of file object methods. Thus, file-like objects and polymorphism are not absolute concepts, but are instead relative to demands placed upon an object by client code.

Polymorphism is a powerful aspect of object-oriented programming, and file-like objects are an excellent example of polymorphism. A client-code module that writes to or reads from files can automatically be reused for data residing elsewhere, as long as the module does not break polymorphism by the dubious practice of type testing. When we discussed the built-ins type and isinstance in Chapter 8, I mentioned that type testing is often best avoided, since it blocks the normal polymorphism that Python otherwise supplies. Sometimes you may have no choice. For example, the marshal module, covered in Chapter 11, demands real file objects. Therefore, if your client code needs to use marshal, your code must also deal with real file objects, not just file-like ones. However, such situations are rare. Most often, supporting polymorphism in your client code takes nothing more than some care in avoiding type testing.

You can implement a file-like object by coding your own class, as covered in Chapter 5, and defining the specific methods needed by client code, such as read. A file-like object fl need not implement all the attributes and methods of a true file object f. If you can determine which methods client code calls on fl, you can choose to implement only that subset. For example, when fl is only meant to be written, fl doesn't need methods read, readline, and readlines.

When you implement a file-like object fl, make sure that fl.softspace can be read and written if you want fl to be usable by print. You need not alter nor interpret softspace in any way. Note that this behavior is the default when you write fl's class in Python. You need to take specific care only when fl's class overrides special methods _ _getattr_ _ and _ _setattr_ _ or otherwise controls access to its instances' attributes (e.g., by defining _ _slots_ _) as covered in Chapter 5. For example, if your class is a new-style class and defines _ _slots_ _, your class must have a slot named softspace, assuming you want instances of your class to be usable with the print statement.

If the main reason you want to use a file-like object instead of a real file object is to keep the data in memory, you can often make use of modules StringIO and cStringIO, covered later in this chapter. These modules supply file-like objects that hold data in memory while behaving polymorphically to file objects to a wide extent.