Appendix B. Internet and Web Protocols
In this appendix, we introduce the networking protocols and standards of the Internet. The first part give a brief overview of the networking protocol TCP/IP and its basic principles. The second, larger part of this appendix is a discussion of HTTP.
The introduction is brief, and we don't attempt to cover these topics completely. Appendix E provides pointers to selected resources on the topics of the Internet and web protocols.
B.1. The Internet
The Internet had its beginnings in the late 1960s with the development of ARPAnet. A primary goal of ARPAnet was to provide a decentralized network of computing resources that did not rely on any one machine or system to operate; that is, no single point of failure could bring the network down. For a network to achieve this, the topology has to provide multiple paths between the computers connected to the network. Such a topology is shown in Figure B-1. Computers are connected to nodes in the network—or form nodes themselves—and so long as a path can be followed through the links between nodes, the computers can communicate.
Figure B-1. A network topology that provides multiple communication paths
Another feature of ARPAnet was the use of packet switching. Unlike telephone networks, where a dedicated circuit is established to carry the conversation between two parties, ARPAnet carried data between two communicating systems as a stream of packets, each sent as an individual transmission over the network. Sending a message as a stream of packets allows valuable network bandwidth—the amount of data that can be transmitted for a given period of time—to be shared between different communications.
Packet switching adds complexity. The process of breaking a message into small packets, deciding on the path to send packets, and reassembling of the message before presenting the data to the receiving computer system required the development of network protocols. One of the first protocols was the Network Control Protocol (NCP); it was replaced in 1982 by the Transmission Control Protocol (TCP) and the Internet Protocol (IP). The protocol suite is commonly known as TCP/IP.
Other networks using packet technologies were also being developed and, with the introduction of TCP/IP, interconnections between these networks were possible. Small office-based networks could be connected to main backbone networks such as ARPAnet or the CSNET, the university-based Computer Science Network. These backbone networks were connected to similar networks in other countries over satellite links and submarine cables, and the Internet was born. The Internet isn't one single network: it is many interconnected networks.
B.1.1. An Analogy
Before we discuss the TCP and IP protocols further, we present a broader picture of how data is transmitted over the Internet by drawing an analogy to the service provided by a courier company.
Imagine that we want to send some hand-drawn illustrations from our office in Melbourne, Australia, to the O'Reilly & Associates, Inc. office in Cambridge, Massachusetts, U.S.A. We would put our drawings into an envelope addressed to our editor Lorrie at O'Reilly's Cambridge office, and a courier would carry the envelope back to the courier company's city office. At the courier's city office our envelope would be sorted from the locally bound envelopes and packed into an air freight bag for Los Angeles and then on to Boston. A similar process would happen, but in reverse, once the bag was unloaded from the plane in Boston. Not knowing exactly where Cambridge is, the envelope may be put on another plane, a train, or a donkey. Eventually, our drawings arrive on Lorrie's desk. This detail isn't important to us, because the courier company is providing a door-to-door service.
Our courier analogy demonstrates a message service over heterogeneous transport technologies. The details on the envelope are understood by all courier companies regardless of how they operate. At each point in the network of courier offices, someone reads the details and makes a decision about where the envelope should go next, and how. The Internet is many networks interconnected and a set of protocols—just like the addresses and serial numbers on the envelope—that provide an end-to-end service over the heterogeneous transport technologies.
Our analogy fails to demonstrate one other network characteristic. The set of drawings make up one message as far as we and our editor are concerned. If it were not for privacy expectations, our courier company could have opened the envelope and repackaged each sheet of paper into individual envelopes and sent some by air via Sydney, some via Auckland, and even some by sea. No doubt these separate messages would not arrive at the courier's office at Cambridge in order—some might not arrive at all and would have to be sent again—but as long as there was information on each envelope that related them, the original message could be reassembled. The courier's Cambridge office would have to hold on to messages that arrived out of order, decide when to ask for missing envelopes to be resent, then reassemble them into the one envelope and deliver the original message as if nothing had happened. Of course, if a courier company did this, they would go out of business, but this is what happens when applications such as web browsers and servers send a message on the Internet.
The Transmission Control Protocol and the Internet Protocol manage the sending and receiving of messages as packets over the Internet. The two protocols together provide a service to applications that use the Internet: communication through a network.
The World Wide Web is a network application that uses the services of TCP and IP to communicate over the Internet. When a web browser requests a page from a web server, the TCP/IP services provide a virtual connection -- a virtual circuit—between the two communicating systems. Remember that packet-switched networks don't operate like telephone networks that create an actual circuit dedicated to a particular call.
Once a connection is established and acknowledged, the two systems can communicate by sending messages. These messages can be large, such as the binary representation of an image, and TCP may fragment the data into a series of IP datagrams. An IP datagram is equivalent to the couriers' envelope in that it holds the fragment of the message along with the destination address and several other fields that manage its transmission through the network.
Each node in the network runs IP software, and IP moves the datagrams through the network, one node at a time. When an IP node receives a datagram, it inspects the address and other header fields, looks up a table of routing information, and sends it on to the next node. Often these nodes are dedicated routers—systems that form interconnections between networks—but the nodes can also include the computer systems on which the applications are running. IP datagrams are totally independent of each other as far as IP is concerned: the IP software just moves them from node to node through a network.
The size of a datagram is primarily determined by the largest size message that can be sent by any part of the network. Going back to our courier example: if Lorrie at O'Reilly wanted to send three dozen books to our office, a single package would be fine for air freight but would have to be broken up into smaller packages if the last leg of the journey was by bicycle.
TCP software performs the function of gluing the fragments together at the destination using the fragment identifier field in the IP datagram header. Because IP datagrams are transmitted through the network independently, there is no guarantee they will arrive at the destination in order, and TCP stores the fragments in a buffer until all preceding fragments are received.
IP doesn't guarantee that datagrams are delivered. If an IP node receives a corrupt datagram, it throws it away. Datagrams may be missing from the stream the TCP software receives because a datagram was corrupt and not passed on from the IP software or was delayed in the network. TCP buffers the fragments to allow the out-of-order datagrams to arrive. If a missing datagram fails to arrive, TCP eventually requests that it be resent. This can cause datagrams to be received twice; however TCP recognizes and discards the duplicate datagram when it arrives.
B.1.2.1. IP addresses
To allow communication over heterogeneous networks, each with its own addressing standard, every location in a network needs a globally unique IP address. A computer that is connected to the Internet needs at least one IP address; a node that interconnects two networks needs two.
IP addresses are 32-bit numbers that are commonly represented as a series of four decimal numbers between 0 and 255, separated by a period. An example IP address is 22.214.171.124. Some IP addresses have special meanings; for example, the IP addresses 127.0.0.0 and 127.0.0.1 are reserved for loopback testing on a host. If a connection is to be made from a client to server, both running on the same machine, the address 127.0.0.1 can be used. This address loops back to 127.0.0.0, the localhost. The address 0.0.0.0 is used by IP to identify the default route out of a node.
A system's network file contains the links between network devices and IP addresses. The IP network information can usually be found in the file /etc/networks on a Linux system.
When a virtual connection is set up between two communicating systems, each end is tied to a port. The port is an identifier used by the TCP software rather than an actual physical device, and it allows multiple network connections to be made on one machine by different applications.
When a message is received by the TCP software running on a host computer, the data is sent to the correct application based on the port number. By convention, a well-known port is normally used by a server providing a well-known service. A list of well-known ports for various applications is maintained by Internet Assigned Number Authority (IANA) and can be found at http://www.iana.org/assignments/port-numbers. For example, the File Transfer Protocol (FTP) uses port 21, and a web server uses port 80.
Systems with TCP/IP software installed have a services file that lists the ports used on that machine. This file is often preconfigured for well-known applications and is maintained by the system administrator to reflect the actual port usage on the machine. This file is usually /etc/services on a Linux system.
Copyright © 2003 O'Reilly & Associates. All rights reserved.