6.2 Sidestep Slow Hosts

A slow host is one that requires more than a few seconds to accept delivery of a modestly sized email message. To illustrate, consider the following example produced by a verbose transaction of sending email to such a slow host:

% /usr/sbin/sendmail -v -Rslowhost.com -q 
Running /var/spool/mqueues/q.2/df/f0DHnvO02567 (sequence 1 of 1)
bob@slowhost.com... Connecting to mx.slowhost.com. via esmtp...
220 mx.slowhost.com ESMTP Sendmail 8.10.1/8.10.1; Fri, 13 Dec 2002 10:50:20 -0700 
(MST)
>>> EHLO mx.slowhost.com
250-mx.slowhost.com Hello you@yourhost.com [123.45.678.9], pleased to meet you
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-SIZE
250-DSN
250-ONEX
250-ETRN
250-XUSR
250 HELP
>>> MAIL From:<you@yourhost.com> SIZE=16
...
 You wait 2 minutes for slowhost to look you up.
...

This situation can get worse, especially if the slow site runs slow antispam software, because that site can take 9 or 10 minutes to validate you. This can cause sendmail to seem to hang for 9 or 10 minutes, then suddenly to continue with:

250 2.1.0 <you@yourhost.com>... Sender ok
>>> RCPT To:<bob@slowhost.com>
250 2.1.5 <bob@slowhost.com>... Recipient ok
>>> DATA
354 Enter mail, end with "." on a line by itself
>>> .

Furthermore, some mail transfer agents (MTAs) start to place a message on disk only after all the data has been received, so writing to an NFS-mounted disk can appear to hang for several seconds:

250 2.0.0 f0DHoNh91321 Message accepted for delivery
bob@slowhost.com... Sent (f0DHoNh91321 Message accepted for delivery)
Closing connection to mx.slowhost.com.
>>> QUIT
221 2.0.0 mx.slowhost.com closing connection

With all of this to contend with, a simple email message to a slow host might not be delivered for many seconds, or even many minutes. In actual practice, 99% of all hosts are very swift to accept mail. But it takes only one message to a slow host to badly degrade your overall delivery performance.

As distributed, the default timeouts for sending messages are generous.^[6] So generous, in fact, that the following defaults (as found in your .cf file) will never prevent delivery to such slow hosts:

^[6] The recommended timeouts are defined in various RFCs. The sendmail program prefers the long recommended timeouts to prevent unwanted delivery failure.

% grep Timeout /etc/mail/sendmail.cf 
O ConnectionCacheTimeout=5m
#O Timeout.initial=5m
#O Timeout.connect=5m
#O Timeout.aconnect=0s
#O Timeout.iconnect=5m
#O Timeout.helo=5m
#O Timeout.mail=10m
#O Timeout.rcpt=1h
#O Timeout.datainit=5m
#O Timeout.datablock=1h
#O Timeout.datafinal=1h
#O Timeout.rset=5m
#O Timeout.quit=2m
#O Timeout.misc=2m
... etc.

The Timeout.mail=10m, for example, says that sendmail will wait up to 10 minutes for the receiving site to reply to its MAIL FROM: command. During the actual nine minutes that it took in our example, that particular queue-processing daemon did nothing else but wait for a reply. If you deliver many messages to such a slow host, you might find many queue-processing daemons blocked in parallel, waiting for replies. If you were to do a process listing, you would find many sendmail daemons in client MAIL states. For example:

% /usr/ucb/ps axw | grep sendmail | grep -v grep
 2600 ?  IW    0:00 sendmail: f0DHoNh91321 slowhost.com [123.45.67.89] client MAIL
 2608 ?  IW    0:00 sendmail: f0DIofY02647 slowhost.com [123.45.67.89] client MAIL
 2642 ?  IW    0:00 sendmail: f0DIorg02649 slowhost.com [123.45.67.89] client MAIL

Here, three queue-processing daemons wait for a reply to the MAIL FROM: command. None has gathered much time (the :00) because all are spending most of their time blocked on input.

Normally, slow hosts are not a problem. However, if your site needs to send high volumes of email rapidly, such slow hosts can prove a serious impediment to performance. Such high-volume sending sites can include those that:

Handle delivery for many mailing lists
Deliver solicited advertising and announcements on behalf of commercial customers
Send large numbers of notices that need to be delivered in a narrow window of time
Function as an ISP in support of a huge number of outbound mailing clients

6.2.1 Run Separate Fast and Slow sendmail Daemons

One way to handle slow hosts is to take advantage of sendmail's tenacity in its continual attempts to send email messages. When sendmail cannot send a message, and when that message times out during the sending process, sendmail queues or re-queues that message so that its delivery can be tried again later. One reason sendmail sets such generous timeouts by default is because it prefers to deliver all messages on the first try.

Real-world experience has consistently demonstrated that most email^[7] is delivered by sendmail in less than two seconds per message per recipient. You can demonstrate this to yourself by looking at sendmail's log files, and examining the xdelay= equates (xdelay=). This tendency to deliver most email quickly suggests employing a strategy that will allow fast messages to be delivered by a "fast" sendmail daemon, and slow messages to be handled by separate "slow" queue processors.

^[7] Where the typical message is 4K in size.

Consider configuring your main sendmail process to be less tolerant of slow hosts by including the following lines in your mc configuration file:

define(`TO', `2s')
define(`confTO_ICONNECT',         `TO')
define(`confTO_CONNECT',          `TO')
define(`confTO_COMMAND',          `TO')
define(`confTO_DATAINIT',         `TO')
define(`confTO_HELO',             `TO')
define(`confTO_HOSTSTATUS',       `TO')
define(`confTO_INITIAL',          `TO')
define(`confTO_MAIL',             `TO')
define(`confTO_QUIT',             `TO')
define(`confTO_RCPT',             `TO')
define(`confTO_RESOLVER_RETRANS', `TO')
define(`confTO_RESOLVER_RETRY',   `TO')
define(`confTO_RSET',             `TO')
define(`confTO_DATABLOCK',        `1m')
define(`confTO_DATAFINAL',        `1m')

The first line defines the m4 macro TO with the value 2s, for two seconds, the timeout used for all the critical outbound timeouts. A macro is used so that you can easily modify this timeout based on your actual needs. Note that the meaning of each timeout is explained in Chapter 24.

To create a configuration file to be used by a queue-processing daemon that runs often, add the preceding lines to a copy of your normal mc file. Then use that copy to create a cf file with a custom name, such as /etc/mail/fast.cf.

To install a "fast" queue-processing sendmail, edit whatever system startup script starts sendmail on your machine. It might, for example, be /etc/rc.local, or /etc/init.d/sendmail, or /etc/rc, or some other file based on your operating system, and will likely contain an invocation such as this:

/usr/sbin/sendmail -bd -q30m

This line runs a listening daemon (the -bd, -bd) and a queue processor (the -q30m, Section 11.8.1) all at once.^[8]

^[8] Beginning with V8.12 sendmail, you might also see a separate invocation for a local mail-submission daemon:

Make a backup copy of your file, then change the earlier invocation into a new two-line invocation, something such as this:

/usr/sbin/sendmail -L sendmail-fast -C /etc/mail/fast.cf -bd
/usr/sbin/sendmail -L sendmail-slow -C /etc/mail/slow.cf -q5m

These two lines replace the original one-line listening daemon and queue-processor invocation. The first creates a listening daemon for acceptance of inbound mail. The second creates a queue processor that processes the queue once every five minutes. The -L command-line switch (-L) defines how sendmail will label itself in syslog records.

The first line uses the fast.cf configuration file we created earlier that had short timeouts and is intolerant of slow hosts. Any mail that cannot be sent on the first try will be queued for a later try.

In the second line, the queue processor labeled sendmail-slow picks up slow hosts once every five minutes. Its configuration file is called slow.cf and contains generous timeouts to ensure that all queued mail will eventually be delivered.^[9]

^[9] You should probably add a declaration for the MinQueueAge option (MinQueueAge) to this file so that delivery will be retried only once every 30 minutes or so for any given message.

To illustrate, consider a queued message destined for bob@slowhost.com. First sendmail-fast attempts to deliver the message. You can simulate this yourself from the command line like this:

# /usr/sbin/sendmail -C /etc/mail/fast.cf -v -Rslowhost.com -q 
Running /var/spool/mqueues/q.2/df/f0DHnvO02567 (sequence 1 of 1)
bob@slowhost.com... Connecting to mx.slowhost.com. via esmtp...
220 mx.slowhost.com ESMTP Sendmail 8.10.1/8.10.1; Fri, 13 Dec 2002 11:23:42 -0700 
(MST)
>>> EHLO mx.slowhost.com
250-mx.slowhost.com Hello you@yourhost.com [123.45.678.9], pleased to meet you
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-SIZE
250-DSN
250-ONEX
250-ETRN
250-XUSR
250 HELP
>>> MAIL From:<you@yourhost.com> SIZE=16
 again a wait of two minutes for slowhost to look you up, but this time the wait times 
out after two seconds

The message fails to be sent (but does so swiftly, because of the short timeouts), so sendmail-fast queues it for a later delivery attempt.

Once every five minutes, the "slow" queue-processing daemon will attempt to deliver the message. Again you can simulate this for yourself from the command line like this:

% /usr/sbin/sendmail -C /etc/mail/slow.cf -v -Rslowhost.com -q 
Running /var/spool/mqueues/q.2/df/f0DHnvO02567 (sequence 1 of 1)
bob@slowhost.com... Connecting to mx.slowhost.com. via esmtp...
220 mx.slowhost.com ESMTP Sendmail 8.10.1/8.10.1; Fri, 13 Dec 2002 11:35:10 - 0700 
(MST)
>>> EHLO mx.slowhost.com
250-mx.slowhost.com Hello you@yourhost.com [123.45.678.9], pleased to meet you
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-SIZE
250-DSN
250-ONEX
250-ETRN
250-XUSR
250 HELP
>>> MAIL From:<you@yourhost.com> SIZE=16
 again a wait of two minutes

In this instance, you again wait two minutes for slowhost to look up your site. Even if all the waits combine to 15 minutes, the message will eventually be delivered because the "slow" queue processor has generous timeouts.

By combining short-timeout with normal-timeout queue processors, slow hosts can be prevented from bogging down the normal outflow of email.

Note that the timeouts we show in this section are not intended to be authoritative for all sites, and that we have simplified this example for clarity. Many other settings, both inside and outside sendmail, contribute to a successful outflow of email. In addition to understanding the properties of timeouts (See this section), you should also apply the information in Chapter 9, and combine it with an understanding of the Timeout.resolver.retry option (See this section).

Beginning with V8.12 sendmail, you can use queue groups (Section 11.4) to divide mail into separate groups of queues. If you know beforehand, for example, that the domain slowhost.com is always slow, you can use queue groups to have all its mail queued onto inexpensive slow disks. All undefined domains would then be queued onto expensive fast disks. Queue groups, however, cannot be used to set different timeouts per group. Instead, you must use separate configuration files as we have illustrated.

6.2.2 Run a Fallback Host

Another alternative for handling slow email, if you can spare the extra machine, is to set up a separate host with generous timeouts. This "fallback" host is given all mail that fails to be delivered on the first try by other hosts on your network.

You cause failed messages to be sent to that machine by using the FallbackMXhost option (FallbackMXhost) on your fast mail machine. In addition to the short timeouts that we showed in the previous section, you could also add the following declaration to the mc configuration for your fast.cf file:

define(`confFALLBACK_MX', `IP-number')
define(`confFALLBACK_MX', `hostname')

You declare this option with either the IP number of the fallback host or the hostname of the fallback host.

This causes all failed mail to be forwarded to the fallback host, which then attempts to deliver all the problem messages that the fast hosts could not. Because most email is fast, you can expect the fallback host to handle only about 5% to 10% of your total mail volume. But, because unexpected failures are a way of life with email, you should also plan for the fallback host to get half or more of your outbound email in a pinch, and size its disks accordingly.

In theory you could extend this fallback host idea to a series of fallback hosts, where each is given progressively longer timeouts. In actual practice, however, a single fallback host tends to be sufficient because email is generally very fast or very slow. There is rarely any middle ground.

Instead of a series of hosts, consider using different timeouts for initial and subsequent attempts. When a message is first forwarded to a fallback host, the fallback host immediately tries to deliver it. That first, immediate attempt is called the initial attempt. If a message fails to be delivered on the initial attempt, it remains queued on the fallback host for subsequent attempts.

V8.8 and above sendmail allows you to set different timeouts for the initial connection and for subsequent connections. These are timeouts for establishing a TCP/IP or other network connection. Here is a way to set up part of your mc file on the fallback host:

define(`ITO', `20s')                  note initial timeout
define(`TO',  `5m')                   note subsequent timeout
define(`confTO_ICONNECT',        `ITO')
define(`confTO_CONNECT',          `TO')
define(`confTO_COMMAND',          `TO')
define(`confTO_DATAINIT',         `TO')
define(`confTO_HELO',             `TO')
define(`confTO_HOSTSTATUS',       `TO')
define(`confTO_INITIAL',          `TO')
define(`confTO_MAIL',             `TO')
define(`confTO_QUIT',             `TO')
define(`confTO_RCPT',             `TO')
define(`confTO_RESOLVER_RETRANS', `TO')
define(`confTO_RESOLVER_RETRY',   `TO')
define(`confTO_RSET',             `TO')
define(`confTO_DATABLOCK',        `1m')
define(`confTO_DATAFINAL',        `1m')

The initial connection will timeout after 20 seconds. Thereafter, the connection will timeout after five minutes. None of the other timeouts shares this idea of initial versus subsequent timeouts. If two sets of distinctly different timeouts are important to you, you can employ that strategy by running two different daemons as shown in the previous section, but this time running them on the fallback host with much longer timeouts. One daemon would accept network connections and have medium timeouts. A separate queue-processing daemon (using a separate configuration file) would have longer timeouts to ensure delivery of all remaining mail.

On the fallback host, note that the message failed twice before it was turned over to the queue processing daemon. It failed once on the fast server, and so was punted to the fallback host. It failed again when it was immediately retried on the fallback host, and was then left in the queue. Because failure is likely, the queue interval on the queue-processing daemon on the fallback host should be long. We suggest something in the range of one to several hours.

If you are running a very large site, you might need to run multiple fallback hosts. To do this you need to run V8.12 or above sendmail because only those versions look up MX records (Section 9.3) for the fallbackhost, and add those records to the list of fallback MX host addresses. If the DNS zone file for these fallback MX hosts lists MX records with equal costs, the additional MX records will be added in random order. For example, one way to set up part of such a zone file might look like this:

fallback1     IN      A     123.45.67.81
fallback2     IN      A     123.45.67.82
fallback3     IN      A     123.45.67.83
fallback4     IN      A     123.45.67.84
fallback5     IN      A     123.45.67.85

fallback      IN      MX 10 fallback1
              IN      MX 10 fallback2
              IN      MX 10 fallback3
              IN      MX 10 fallback4
              IN      MX 10 fallback5

Here the costs are all equal (the 10s), so any of the fallbacknumber hosts is equally likely to receive a failed message.

Finally, consider using V8.12 sendmail's queue groups (Section 11.4) on the fallback host. With queue groups you can dedicate a separate disk or disks to each of the several well-known large ISPs. By running only a few queue processors in each queue, there will be low impact while a large site is down, but delivery will tend to be mildly parallel yet serialized and reasonably fast when the large site comes back up.