9.5 Prepare for Disaster
Disasters can take many forms and, by their very nature, are
unexpected. If DNS and mail are to continue to work, expecting the
unexpected is vital. The kinds of disasters that one must anticipate
vary from the mundane to the catastrophic:
A reboot or scheduled downtime for maintenance on the mail or DNS
server should only cause mail to be delayed, not lost.
A failed component on the mail or DNS server could cause mail
delivery to be delayed anywhere from a few hours to a few days. A
delay of more than three to five days could cause many hosts to
bounce queued mail unless steps are taken to receive that mail
elsewhere.
Natural disasters can disrupt site or network connectivity for weeks.
The Loma Prieta earthquake on the West Coast of the United States
lasted only a few minutes but knocked out electric power to many
areas for far longer. Fear of gas leaks prevented repowering many
buildings for up to two weeks. A hurricane, flood, fire, or even an
errant backhoe could knock out your institution for weeks.
9.5.1 Offsite MX Hosts
When
mail can't be received, whether because of a small
event or a large disaster, an offsite MX host can save the day. An
offsite MX host is simply another machine that can receive mail for
your site when your site is unavailable. The location of the offsite
machine depends on your situation. For a subdomain at one end of a
microwave link, having an offsite host on the other side of the
microwave might be sufficient. For a large site, such as a
university, a machine at another university (possibly in a different
state or country) would be wise.
Before we show how to set up offsite MX hosts, note that offsite MX
hosts are a mixed blessing. If an offsite MX host does not handle
mail reliably, you could lose mail. In many cases it is better not to
have an offsite MX host than to have an unreliable one. Without an MX
site, mail will normally be queued on the sending host. A reliable MX
backup is useful, but an unreliable one is a disaster.
You should not unilaterally select a host to function as an offsite
MX host. To set up an offsite MX host, you need to negotiate with the
managers of other sites. By mutual agreement, another
site's manager will configure that other machine to
accept mail bound for your site (possibly queueing
weeks' worth of mail) and configure that site to
forward that mail to yours when your site comes back up. Naturally
you should do the same thing for that site if requested.
For example, suppose your site is in the state of Iowa, in the United
States. Further suppose that in Northern Japan there is a site with
which you are friendly. You could negotiate with that
site's manager to receive and hold your mail in a
disaster. When the site is set up to do so, you first add a high-cost
MX record for it:
mailhost.uiowa.edu. IN MX 2 mailhost.uiowa.edu.
mailhost.uiowa.edu. IN MX 10 backup.uiowa.edu.
mailhost.uiowa.edu. IN MX 900 pacific.north.jp.note
To be sure the MX works, send mail to yourself via that new MX site:
% mail you%mailhost.uiowa.edu@pacific.north.jp
Here, the % in the
address causes the message to first be delivered to
pacific.north.jp. That machine then throws away
its own name and converts the remaining % to an
@. The result is then mailed back to you at
you@mailhost.uiowa.edu.
This verifies that the disaster MX machine can get mail to your site
when your site returns to service.
For this scheme to work, the mail administrator at
pacific.north.jp will need all mail to your site
relayed through that site. The easy way to do this with V8.10
sendmail and above is to add
uiowa.edu to the
/etc/mail/relay-domains file.
During a disaster the first sign of trouble will be mail for your
site suddenly appearing in the queue at
pacific.north.jp. The manager there should
notice and set up a separate queue to hold the incoming mail until
your site returns to service (Section 11.9.1). When
your site recovers, you can contact that manager and arrange for a
queue run to deliver the backlog of mail.
If your site is out of service for weeks, the backlog of mail might
be partially on tape or some other backup medium. You might even want
to negotiate an artificially slow feed so that your local spool
directory won't overfill, or for them to send you
the backup media so that you can recover it yourself.
Even in minor disasters an MX host can save much grief because
delivery will be serialized. Without an MX host, every machine in the
world that had mail for your machine might try to send it at nearly
the same time—that is, soon after your machine returns to
service. That could overload your machine and even crash it, causing
the problem to repeat over and over.
9.5.2 Offsite Servers
A disaster
MX is good only as long as your DNS services stay alive to advertise
it. Most sites have multiple name server machines to balance the load
of DNS lookups and to provide redundancy in case one fails.
Unfortunately, few sites have offsite name servers as a hedge against
disaster. Consider the disaster MX record developed earlier:
mailhost.uiowa.edu. IN MX 900 pacific.north.jp.
Ideally, one would want pacific.north.jp to
queue all mail until the local site is back in service.
Unfortunately, all DNS records contain a
time to live (TTL) that might or might not be present in the
declaration line:
mailhost.uiowa.edu. IN MX 900 pacific.north.jp.
TTL implied
mailhost.uiowa.edu. 86400 IN MX 900 pacific.north.jp.
TTL specified as 24 hours in seconds
When other sites look up the local site, they cache this record. They
will not look it up again until 24 hours have passed. Therefore, if
an earthquake strikes, all other sites will forget about this record
after 24 hours and will not be able to look it up again.
In general, records set up for disaster purposes should be given TTLs
that are greater than a month:
mailhost.uiowa.edu. 3600000 IN MX 900 pacific.north.jp.
TTL specified as 41 days in seconds
But note that TTLs should be the same for all records so that they
will all time out the same:
mailhost.uiowa.edu. 3600000 IN MX 2 mailhost.uiowa.edu.
mailhost.uiowa.edu. 3600000 IN MX 10 backup.uiowa.edu.
mailhost.uiowa.edu. 3600000 IN MX 900 pacific.north.jp.
If you gave the disaster record a long TTL and left the default for
your normal MX records, your normal records would time out and
disappear from other sites' caches. This would
result in all mail suddenly and mysteriously going to the disaster
host when there was no disaster to cause it.
Note that long TTLs can negatively affect DNS file updates. Updates
won't take effect until the TTL times out. If you
anticipate a future change—say, a rearrangement of your MX
records—you can change the TTLs to two hours, wait a month for
the long TTL to time out, then make and test your changes.
If many hosts at your site receive mail (rather than a central mail
server), it is necessary to add a disaster record for each.
Unfortunately, when the number of such hosts at your site is greater
than 100 or so, individual disaster MX records become difficult to
manage simply because of scale.
At such sites, a better method of disaster preparedness is to set up
pacific.north.jp as another primary DNS server
for the local site. There are two advantages to this
"authoritative" backup server
approach:
Unfortunately, setting up an offsite or out-of-country server can be
difficult. We won't show you how to do that here.
Instead, we refer you to the book DNS and BIND
by Paul Albitz and Cricket Liu (O'Reilly &
Associates, 4th Edition, 2001).
|