9.5 Prepare for Disaster

Disasters can take many forms and, by their very nature, are unexpected. If DNS and mail are to continue to work, expecting the unexpected is vital. The kinds of disasters that one must anticipate vary from the mundane to the catastrophic:

A reboot or scheduled downtime for maintenance on the mail or DNS server should only cause mail to be delayed, not lost.
A failed component on the mail or DNS server could cause mail delivery to be delayed anywhere from a few hours to a few days. A delay of more than three to five days could cause many hosts to bounce queued mail unless steps are taken to receive that mail elsewhere.
Natural disasters can disrupt site or network connectivity for weeks. The Loma Prieta earthquake on the West Coast of the United States lasted only a few minutes but knocked out electric power to many areas for far longer. Fear of gas leaks prevented repowering many buildings for up to two weeks. A hurricane, flood, fire, or even an errant backhoe could knock out your institution for weeks.

9.5.1 Offsite MX Hosts

When mail can't be received, whether because of a small event or a large disaster, an offsite MX host can save the day. An offsite MX host is simply another machine that can receive mail for your site when your site is unavailable. The location of the offsite machine depends on your situation. For a subdomain at one end of a microwave link, having an offsite host on the other side of the microwave might be sufficient. For a large site, such as a university, a machine at another university (possibly in a different state or country) would be wise.

Before we show how to set up offsite MX hosts, note that offsite MX hosts are a mixed blessing. If an offsite MX host does not handle mail reliably, you could lose mail. In many cases it is better not to have an offsite MX host than to have an unreliable one. Without an MX site, mail will normally be queued on the sending host. A reliable MX backup is useful, but an unreliable one is a disaster.

You should not unilaterally select a host to function as an offsite MX host. To set up an offsite MX host, you need to negotiate with the managers of other sites. By mutual agreement, another site's manager will configure that other machine to accept mail bound for your site (possibly queueing weeks' worth of mail) and configure that site to forward that mail to yours when your site comes back up. Naturally you should do the same thing for that site if requested.

For example, suppose your site is in the state of Iowa, in the United States. Further suppose that in Northern Japan there is a site with which you are friendly. You could negotiate with that site's manager to receive and hold your mail in a disaster. When the site is set up to do so, you first add a high-cost MX record for it:

mailhost.uiowa.edu.   IN   MX 2     mailhost.uiowa.edu.
mailhost.uiowa.edu.   IN   MX 10    backup.uiowa.edu.
mailhost.uiowa.edu.   IN   MX 900   pacific.north.jp.note

To be sure the MX works, send mail to yourself via that new MX site:

% mail you%mailhost.uiowa.edu@pacific.north.jp^[14]

^[14] This example presumes that pacific.north.jp can handle the % "hack." Most places do, so this is probably a safe assumption. If they don't, just use the ConnectOnlyTo option (ConnectOnlyTo) to send mail to yourself directly through them, or add a temporary mailertable entry (FEATURE(mailertable)).

Here, the % in the address causes the message to first be delivered to pacific.north.jp. That machine then throws away its own name and converts the remaining % to an @. The result is then mailed back to you at you@mailhost.uiowa.edu.

This verifies that the disaster MX machine can get mail to your site when your site returns to service.

For this scheme to work, the mail administrator at pacific.north.jp will need all mail to your site relayed through that site. The easy way to do this with V8.10 sendmail and above is to add uiowa.edu to the /etc/mail/relay-domains file.

During a disaster the first sign of trouble will be mail for your site suddenly appearing in the queue at pacific.north.jp. The manager there should notice and set up a separate queue to hold the incoming mail until your site returns to service (Section 11.9.1). When your site recovers, you can contact that manager and arrange for a queue run to deliver the backlog of mail.

If your site is out of service for weeks, the backlog of mail might be partially on tape or some other backup medium. You might even want to negotiate an artificially slow feed so that your local spool directory won't overfill, or for them to send you the backup media so that you can recover it yourself.

Even in minor disasters an MX host can save much grief because delivery will be serialized. Without an MX host, every machine in the world that had mail for your machine might try to send it at nearly the same time—that is, soon after your machine returns to service. That could overload your machine and even crash it, causing the problem to repeat over and over.

9.5.2 Offsite Servers

A disaster MX is good only as long as your DNS services stay alive to advertise it. Most sites have multiple name server machines to balance the load of DNS lookups and to provide redundancy in case one fails. Unfortunately, few sites have offsite name servers as a hedge against disaster. Consider the disaster MX record developed earlier:

mailhost.uiowa.edu.   IN   MX 900   pacific.north.jp.

Ideally, one would want pacific.north.jp to queue all mail until the local site is back in service. Unfortunately, all DNS records contain a time to live (TTL) that might or might not be present in the declaration line:

mailhost.uiowa.edu.   IN   MX 900   pacific.north.jp.
                    
                   TTL implied

mailhost.uiowa.edu. 86400  IN   MX 900   pacific.north.jp.
                     
                    TTL specified as 24 hours in seconds

When other sites look up the local site, they cache this record. They will not look it up again until 24 hours have passed. Therefore, if an earthquake strikes, all other sites will forget about this record after 24 hours and will not be able to look it up again.

In general, records set up for disaster purposes should be given TTLs that are greater than a month:

mailhost.uiowa.edu. 3600000  IN   MX 900   pacific.north.jp.
                     
                    TTL specified as 41 days in seconds

But note that TTLs should be the same for all records so that they will all time out the same:

mailhost.uiowa.edu.  3600000 IN   MX 2     mailhost.uiowa.edu.
mailhost.uiowa.edu.  3600000 IN   MX 10    backup.uiowa.edu.
mailhost.uiowa.edu.  3600000 IN   MX 900   pacific.north.jp.

If you gave the disaster record a long TTL and left the default for your normal MX records, your normal records would time out and disappear from other sites' caches. This would result in all mail suddenly and mysteriously going to the disaster host when there was no disaster to cause it.

Note that long TTLs can negatively affect DNS file updates. Updates won't take effect until the TTL times out. If you anticipate a future change—say, a rearrangement of your MX records—you can change the TTLs to two hours, wait a month for the long TTL to time out, then make and test your changes.^[15]

^[15] And hope that no disaster strikes in the meanwhile. A couple of better techniques are to set up an offsite secondary DNS server with a large TTL in the SOA record; or to set up an offsite primary DNS server with a large TTL that keeps its records synchronized with yours via rsync (or some other protocol).

If many hosts at your site receive mail (rather than a central mail server), it is necessary to add a disaster record for each. Unfortunately, when the number of such hosts at your site is greater than 100 or so, individual disaster MX records become difficult to manage simply because of scale.

At such sites, a better method of disaster preparedness is to set up pacific.north.jp as another primary DNS server for the local site. There are two advantages to this "authoritative" backup server approach:

An offsite primary server eliminates the need to set up individual MX disaster records.
An out-of-country primary server can lower the network impact of DNS lookups of your site.

Unfortunately, setting up an offsite or out-of-country server can be difficult. We won't show you how to do that here. Instead, we refer you to the book DNS and BIND by Paul Albitz and Cricket Liu (O'Reilly & Associates, 4th Edition, 2001).