The Internet Protocol is the glue that holds together modern computer networks. IP specifies the way that messages are sent from computer to computer; it essentially defines a common "language" that is spoken by every computer stationed on the Internet.
This section describes IPv4, the fourth version of the Internet Protocol, which has been used on the Internet since 1982. As this book is going to press, work is continuing on IPv6, previously called "IP: The Next Generation," or IPng. (IPv5 was an experimental protocol that was never widely used.) We do not know when (or if) IPv6 will be widely used on the network.
As we said earlier, at a very abstract level the Internet is similar to the phone network. However, as we look more closely at the underlying protocols, we find that it is quite different. On the telephone network, each conversation is assigned a circuit (either a pair of wires or a channel on a multiplexed connection) that it uses for the duration of the telephone call. Whether you talk or not, the channel remains open until you hang up the phone.
On the Internet, the connections between computers are shared by all of the conversations. Data is sent in blocks of characters called datagrams, or more colloquially, packets. Each packet has a small block of bytes called the header, which identifies its sender and intended destination on each computer. The header is followed by another, usually larger, block of characters of data called the packet's contents . (See Figure 16.3 .) After the packets reach their destination, they are often reassembled into a continuous stream of data; this fragmentation and reassembly process is usually invisible to the user. As there are often many different routes from one system to another, each packet may take a slightly different path from source to destination. Because the Internet switches packets, instead of circuits, it is called a packet-switching network.
We'll borrow an analogy from Vint Cerf, one of the original architects of the ARPANET : think of the IP protocol as sending a novel a page at a time, numbered and glued to the back of postcards. All the postcards from every user get thrown together and carried by the same trucks to their destinations, where they get sorted out. Sometimes, the postcards get delivered out of order. Sometimes, a postcard may not get delivered at all, but you can use the page numbers to request another copy. And, key for security, anyone in the postal service who handles the post cards can read the contents without the recipient or sender knowing about it.
There are three distinct ways to directly connect two computers together using IP:
IP is a scalable network protocol: it works as well with a small office network of ten workstations as it does with a university-sized network supporting a few hundred workstations, or with the national (and international) networks that support tens of thousands of computers. IP scales because it views these large networks merely as collections of smaller ones. Computers connected to a network are called hosts. Computers that are connected to two or more networks can be programmed to forward packets automatically from one network to another; today, these computers are called routers (originally they were called gateways ). Routers use routing tables to determine where to send packets next.
Every interface that a computer has on an IP network is assigned a unique 32-bit address. These addresses are often expressed as a set of four 8-bit numbers, called octets. A sample address is 220.127.116.11. Think of an IP address as if it were a telephone number: if you know a computer's IP address, you can connect to it and exchange information.
Theoretically, the 32-bit IP address allows a maximum of 232 = 4,294,967,296 computers to be attached to the Internet at a given time. In practice, the total number of computers that can be connected is much less, because of the way that IP addresses are assigned. Organizations are usually assigned blocks of addresses, not all of which are used. This approach is similar to the method by which the phone company assigns area codes to a region. The approach has led to a problem with IP addresses similar to that faced by the telephone company: we're running out of numbers.
Here are some more sample Internet addresses:
18.104.22.168 22.214.171.124 126.96.36.199
IP addresses are typically abbreviated ii.jj.kk.ll, where the numbers ii, jj, kk, and ll are between 0 and 255. Each decimal number represents an 8-bit octet. Together, they represent the 32-bit IP address.
The Internet is a network of networks. Although most people think of these networks as major networks, such as those belonging to companies like AT&T, MCI , and Sprint, the networks that make up the Internet are actually local area networks, such as the network in your office building or the network in a small research laboratory. Each of these small networks is given its own network number.
There are two methods of looking at network numbers. The "classical" network numbers were distinguished by a unique prefix of bits in the address of each host in the network. This approach partitioned the address space into a well-defined set of different size networks. However, several of these networks had large "holes" - sets of host addresses that were never used. With the explosion of sites on the Internet, a somewhat different interpretation of network addresses has been proposed, to result in some additional addresses that can be assigned to networks and hosts. This approach is the CIDR ( Classless InterDomain Routing ) scheme. We briefly describe both schemes below.
The CIDR method may not be adequate to provide addresses for all the expected hosts on the network; therefore, as we've mentioned, a new protocol, IPv6, is being developed. This new protocol will provide a bigger address space for hosts and networks, and will provide some additional security features. Host addresses will be 128 bits long in IPv6. As this book goes to press, the features of IPv6 are not completely finalized, so we won't try to detail them here.
There are five primary kinds of IP addresses in the "classical" address scheme; the first few bits of the address (the most significant bits) define the class of network to which the address belongs. The remaining bits are divided into a network part and a host part:
In recent years, a new form of address assignment has been developed. This assignment is the CIDR , or Classless InterDomain Routing , method. As the name implies, there are no "classes" of addresses as in the classical scheme. Instead, networks are defined as being the most significant k bits of each address, with the remaining 32-k bits being used for the host part of the address. Thus, a service provider could be given a range of addresses whereby the first 12 bits of the address are fixed at a particular value (the network address), and the remaining 20 bits represent the host portion of the address. This method allows the service provider to allocate up to 220 distinct addresses to customers.
In reality, the host portion of an address is further divided into subnets. This subdivision is done by fixing the first j bits of the host portion of the address to some set value, and using the remaining bits for host addresses. And those can be further divided into subnets, and so on. A CIDR -format address is of the form k.j.l.(m...n), where each of the fields is of variable length. Thus, the fictional service-provider network address described above could be subdivided into 1024 subnets, one for each customer. Each customer would have 210 bits of host address, which they could further subdivide into local subnets.
The CIDR scheme is compatible with the classical address format, with Class A addresses using an 8-bit network field, Class B networks using a 16-bit network address, and so on. CIDR is being adopted as this book goes to press. Combined with new developments in IP address rewriting, there is the potential to spread out the useful life of IPv4 for many years to come.
Despite the complexity of the Internet and addressing, computers can easily send each other messages across the global network. To send a packet, most computers simply set the packet's destination address and then send the packet to a computer on their local network called a gateway. If the gateway makes a determination of where to send the packet next, the gateway is a router. The router takes care of sending the packet to its final destination by forwarding the packet on to a directly connected gateway that is one step closer to the destination host.
Many organizations configure their internal networks as a large tree. At the root of the tree is the organization's connection to the Internet. When a gateway receives a packet, it decides whether to send it to one of its own subnetworks, or to direct it towards the root.
Out on the Internet, major IP providers such as AT&T , BBN Planet, MCI , and Sprint have far more complicated networks with sophisticated routing algorithms. Many of these providers have redundant networks, so that if one link malfunctions other links can take over.
Nevertheless, from the point of view of any computer on the Internet, routing is transparent, regardless of whether packets are being sent across the room or across the world. The only information that you need to know to make a connection to another computer on the Internet is the computer's 32-bit IP address - you do not need to know the route to the host, or on what type of network the host resides. You do not even need to know if the host is connected by a high-speed local area network, or if it is at the other end of a modem-based SLIP connection. All you need to know is its address, and your packets are on their way.
Of course, if you are the site administrator and you are configuring the routing on your system, you do need to be concerned with a little more than the IP number of a destination machine. You must know at least the addresses of gateways out of your network so you can configure your routing tables. We'll assume you know how to do that, but we will point out that if your routes are fairly stable and simple, you would be safer by statically setting the routes rather than allowing them to be set dynamically with a mechanism such as the routed daemon.
A hostname is the name of a computer on the Internet. Hostnames make life easier for users: they are easier to remember than IP addresses. You can change a computer's IP address but keep its hostname the same. If you think of an IP address as a computer's phone number, think of its hostname as the name under which it is listed in the telephone book. Some hosts can also have more than one address on more than one network. Rather than needing to remember each one, you can remember a single hostname and let the underlying network mechanisms pick the most appropriate addresses to use.
Let us repeat that: a single hostname can have more than one IP address, and a single IP address can be associated with more than one hostname. Both of these facts have profound implications for people who are attempting to write secure network programs.
Hostnames must begin with a letter or number and may contain letters, numbers, and a few symbols, such as the dash (-). Case is ignored. A sample hostname is arthur.cs.purdue.edu . For more information on host names, see RFC 1122 and RFC 1123.
Each hostname has two parts: the computer's machine name and its domain . The computer's machine name is the name to the left of the first period; the domain name is everything to the right of the first period. In our example above, the machine name is arthur and the domain is cs.purdue.edu . The domain name may represent further hierarchical domains if there is a period in the name. For instance, cs.purdue.edu represents the Computer Sciences department domain, which is part of the Purdue University domain, which is, in turn, part of the Educational Institutions domain.
Here are some other examples:
whitehouse.gov next.cambridge.ma.us jade.tufts.edu
If you specify a machine name, but do not specify a domain, then your computer might append a default domain when it tries to resolve the name's IP address. Alternatively, your computer might simply return an "unknown host" error message.
Early UNIX systems used a single file called /etc/hosts to keep track of the network address for each host on the Internet. Many systems still use this file today to keep track of the IP addresses of computers on the organization's LAN .
A sample /etc/hosts file for a small organization might look like this:
# /etc/hosts # 188.8.131.52 server 184.108.40.206 art 220.127.116.11 science sci 18.104.22.168 engineering eng
In this example, the computer called server has the network address 22.214.171.124. The computer called engineering has the address 126.96.36.199. The hostname sci following the computer called science means that sci can be used as a second name, or alias, for that computer.
In the early 1980s, the number of hosts on the Internet started to jump from thousands to tens of thousands and more. Maintaining a single file of host names and addresses soon proved to be impossible. Instead, the Internet adopted a distributed system for hostname resolution known as the Domain Name System ( DNS ). For more information, see the "Name Service" section later in this chapter.
The Internet Control Message Protocol is used to send messages between gateways and hosts regarding the low-level operation of the Internet. For example, ICMP Echo packets are commonly used to test for network connectivity; the response is usually either an ICMP Echo Reply or an ICMP Destination Unreachable message type. ICMP packets are identified by an 8-bit TYPE field (see Table 16.1 ):
Although we have included all types for completeness, the most important types for our purposes are types 3, 4, and 5. An attacker can craft ICMP packets with these fields to redirect your network traffic away, or to perform a denial of service. If you use a firewall (discussed in Chapter 21, Firewalls ), you will want to be sure that these types are blocked or monitored.
TCP provides a reliable, ordered, two-way transmission stream between two programs that are running on the same or different computers. "Reliable" means that every byte transmitted is guaranteed to reach its destination (or you are notified that the transmission failed), and that each byte arrives in the order in which it is sent. Of course, if the connection is physically broken, bytes that have not yet been transmitted will not reach their destination unless an alternate route can be found. In such an event, the computer's TCP implementation will send an error message to the process that is trying to send or receive characters, rather than give the impression that the link is still operational.
Each TCP connection is attached at each end to a port . Ports are identified by 16-bit numbers. Indeed, at any instant, every connection on the Internet can be identified by a set of two 32-bit numbers and two 16-bit numbers:
For example, Figure 16.5 shows three people on three separate workstations logged into a server using the rlogin program. Each process's TCP connection starts on a different host and at a different originating port number, but each connection terminates on the same host (the server) and the same port (513).
The idea that the workstations are all connecting to port number 513 can be confusing. Nevertheless, these are all distinct connections, because each one is coming from a different originating host-port pair, and the server moves each connection to a separate, higher-numbered port.
The TCP protocol uses two special bits in the packet header, SYN and ACK , to negotiate the creation of new connections. To open a TCP connection, the requesting host sends a packet that has the SYN bit set but does not have the ACK bit set. The receiving host acknowledges the request by sending back a packet that has both the SYN and the ACK bits set. Finally, the originating host sends a third packet, again with the ACK bit set, but this time with the SYN bit unset. This process is called the TCP " three-way handshake," and is shown in Figure 16.6 . By looking for packets that have the ACK bit unset, one can distinguish packets requesting new connections from those which are being sent in response to connections that have already been created. This distinction is useful when constructing packet filtering firewalls, as we shall see in Chapter 21 .
TCP is used for most Internet services which require the sustained synchronous transmission of a stream of data in one or two directions. For example, TCP is used for remote terminal service, file transfer, and electronic mail. TCP is also used for sending commands to displays using the X Window System.
Table 16.2 identifies some TCP services commonly enabled on UNIX machines. These services and port numbers are usually found in the /etc/services file. (Note that non -UNIX hosts can run most of these services; the protocols are usually specified independent of any particular implementation.)
The User Datagram Protocol provides a simple, unreliable system for sending packets of data between two or more programs running on the same or different computers. "Unreliable" means that the operating system does not guarantee that every packet sent will be delivered, or that packets will be delivered in order. UDP does make its best effort to deliver the packets, however. On a LAN , UDP often approaches 100% reliability.
UDP 's advantage is that it has less overhead than TCP - less overhead lets UDP -based services transmit information with as much as 10 times the throughput. UDP is used primarily for Sun's Network Filesystem, for NIS , for resolving hostnames, and for transmitting routing information. It is also used for services that aren't affected negatively if they miss an occasional packet because they will get another periodic update later, or because the information isn't really that important. This includes services such as rwho, talk, and some time services.
UDP packets are often broadcast to a given port on every host that resides on the same local area network. Broadcast packets are used frequently for services such as time of day.
As with TCP , UDP packets are also sent from a port on the sending host to another port on the receiving host. Each UDP packet also contains user data. If a program is listening to the particular port and is ready for the packet, it will be received. Otherwise, the packet will be ignored.
Ports are identified by 16-bit numbers. Table 0-71 lists some common UDP ports.
The Internet Protocol is based on the client/server model. Programs called clients initiate connections over the network to other programs called servers, which wait for the connections to be made. One example of a client/server pair is the network time system. The client program is the program that asks the network server for the time. The server program is the program that listens for these requests and transmits the correct time. In UNIX parlance, server programs that run in the background and wait for user requests are often known as daemons .
% telnet athens.com Trying... Connected to ATHENS.COM Escape character is '^]'. 4.4 BSD Unix (ATHENS.COM) login:
When you type telnet, the client telnet program on your computer (usually the program /usr/bin/telnet , or possibly /usr/ucb/telnet ) connects to the telnet server (in this case, named /usr/etc/in.telnetd) running on the computer athens.com . As stated, clients and servers normally reside in different programs. One exception to this rule is the sendmail program, which includes the code for both the server and a client, bundled together in a single application.
The telnet program can also be used to connect to any other TCP port that has a process listening. For instance, you might connect to port 25 (the SMTP port) to fake some mail without going through the normal mailer:
% telnet control.mil 25 Trying 188.8.131.52 ... Connected to hq.control.mil. Escape character is '^]'. 220-hq.control.mil Sendmail 8.6.10 ready at Tue, 17 Oct 1995 20:00:09 -0500 220 ESMTP spoken here HELO kaos.org 250 hq.control.mil Hello kaos.org, pleased to meet you MAIL FROM:<firstname.lastname@example.org> 250 <agent86>... Sender ok RCPT TO:<email@example.com> 550 <agent99>... Recipient ok DATA 354 Enter mail, end with "." on a line by itself To: agent99 From: Max <agent86> Subject: tonight 99, I know I was supposed to take you out to dinner tonight, but I have been captured by KAOS agents, and they won't let me out until they finish torturing me. I hope you understand. Love, Max . 250 UAA01441 Message accepted for delivery quit 221 hq.control.mil closing connection Connection closed by foreign host. %
As we mentioned, in the early days of the Internet, a single /etc/hosts file contained the address and name of each computer on the Internet. But as the file grew to contain thousands of lines, and as changes to the list of names (or the namespace) started being made on a daily basis, a single /etc/hosts file soon became impossible to maintain. Instead, the Internet developed a distributed networked-based naming service called the Domain Name Service ( DNS ).
DNS implements a large-scale distributed database for translating hostnames into IP addresses and vice-versa, and performing related name functions. The software performs this function by using the network to resolve each part of the hostname distinctly. For example, if a computer is trying to resolve the name girigiri.gbrmpa.gov.au , it would first get the address of the root domain server (usually stored in a file) and ask that machine for the address of the au domain server. The computer would then ask the au domain server for the address of the gov.au domain server, and then would ask that machine for the address of the gbrmpa.gov.au domain server. Finally, the computer would then ask the gbrmpa.gov.au domain server the address for the computer called girigiri.gbrmpa.gov.au . (Name resolution is shown in Figure 16.7 .) A variety of caching techniques are employed to minimize overall network traffic.
DNS is based on UDP , but can also use a TCP connection for some operations.
The standard UNIX implementation of DNS is called bind and was originally written at the University of California at Berkeley. This implementation is based on three parts: a library for the client side, and two programs for the server:
More details about DNS and the BIND name server may be found in the O'Reilly & Associates book DNS and BIND by Paul Albitz and Cricket Liu.
In addition to DNS , there are at least four vendor-specific systems for providing nameservice and other information to networked workstations. They are:
All of these systems are designed to distribute a variety of administrative information throughout a network. All of these systems must also use DNS to resolve hostnames outside the local organization.