Failover provides a mechanism for LocalDirector to be redundant by allowing two identical units to serve the same functionality. One LocalDirector unit is considered the "primary" unit while the other is considered the "secondary" unit (determined by the failover cable). The primary unit is also the active unit by default, and it performs normal network functions while the backup unit (standby) only monitors, ready to take control should the active unit fail.
The two units must be running the same version of software (1.6 or later). Configuration replication will occur under the following conditions:
- When the Standby unit completes its initial boot-up the Active unit will replicate its entire configuration to the Standby unit.
- As commands are entered on the Active unit they are sent across to the Standby unit. (The commands are sent via the failover cable.)
- Entering the write standby command on the Active unit forces the entire configuration to the Standby unit.
The active unit uses the system IP address and the MAC address of the primary unit. The standby unit uses the failover IP address and the secondary MAC address. Because the active unit uses the same IP and MAC addresses (regardless of which physical unit it is), no ARP entries need to change or timeout anywhere on the network.
Failover monitors failover communications, the power status of the other unit, and hello packets that are received on each interface. A failure of any of these parameters on the active unit will cause the standby unit to take active control. The standby unit assumes the active role using the system IP address and the primary MAC addresses. When a failure or switch occurs SYSLOG messages are generated indicating the cause of the failure.
To take a unit out of the "failed" state, cycle the power or use the failover reset command. When a failed primary unit is fixed and brought back on line it will not automatically resume as the active unit. This ensures that active control will not resume on a unit that could immediately enter a failed state again. However, if a failure is due to a lost signal on a network interface card, failover will "auto-recover" when the network is available again.
Use the failover active command to initiate a failover switch from the standby unit, or the no failover active command from the active unit to initiate a failover switch. You can use this feature to return a failed unit to service, or to force an active unit offline for maintenance. Because the standby unit does not keep state information on each connection, all active connections will be dropped and must be re-established by the clients.
Note Because configuration replication is automatic from the active unit to the standby unit, configuration changes should only be entered from the active unit.
With LocalDirector version 1.6.3, failover works in a switched environment. The section "Highly Redundant, Fault-Tolerant Configuration" in Chapter 3, "Configuring LocalDirector," provides details for configuring failover to work with switches.
Failover also works with the FDDI interface. Figure A-1 illustrates failover using FDDI interfaces. Note that Port-B is on the top of the FDDI card, and Port-A is on the bottom.
Figure A-1: Failover with FDDI Interfaces
Attach the end of the cable labeled "Primary" to the unit that will be the primary unit, as shown in Figure A-2. Attach the other end to the secondary unit. Connect interface 0 on both LocalDirector units to the hub or switch that goes to the outside network, and connect interface 1 on both LocalDirector units to the hub or switch that connects to your servers.
Use the failover ip address command to set the IP address for the standby unit.
Figure A-2: Attach Failover Cable
The show failover command indicates the status of the connection and which unit is active. See the "Show Failover Output" section for more information about this command. The show ip address command shows the current IP address of the unit. If the unit is active the system IP address is displayed, and if the unit is standby the failover IP address is displayed.
If a failure is due to a condition other than a loss of power on the other unit, failover will begin a series of tests to determine which unit is failed. This series of tests will begin when hello messages are not heard for two consecutive 15-second intervals. Hello messages are sent over both network interfaces and the serial cable.
The purpose of these tests is to generate network traffic in order to determine which (if either) unit is failed. At the start of each test, each unit clears its received packet count for its interfaces. At the conclusion of each test, each unit looks to see if it has received any traffic. If it has, the interface is considered operational. If one unit receives traffic for a test and the other unit does not, the unit that received no traffic is considered failed. If neither unit has received traffic, they go to the next test.
Note If the failover IP address has not been set, failover will not work and the Network Activity, ARP, and Broadcast ping tests are not performed.
- This is a test of the NIC card itself. If an interface card is not plugged in to an operational network, it is considered failed (for example, the hub or switch is failed, has a failed port, or a cable is unplugged).
- This is a received network activity test. The unit will count all received packets for up to 5 seconds. If any packets are received at any time during this interval the interface is considered operational and testing stops. If no traffic is received, the ARP test begins.
- The ARP test consists of reading the units ARP cache for the 10 most recently acquired entries. One at a time the unit sends ARP requests to these machines attempting to stimulate network traffic. After each request the unit counts all received traffic for up to 5 seconds. If traffic is received, the interface is considered operational. If no traffic is received, an ARP request is sent to the next machine. If at the end of the list no traffic has been received, the test moves to the ping test.
- The ping test consists of sending out a broadcast ping request. The unit then counts all received packets for up to 5 seconds. If any packets are received at any time during this interval the interface is considered operational and testing stops. If no traffic is received the testing starts over again with the ARP test.
In the messages that follow, P|S can be either Primary or Secondary depending on which LocalDirector is sending the message. Failover messages always have a SYSLOG priority level of 2, which indicates critical condition. All failover SYSLOG messages are also sent as SNMP SYSLOG traps.
To receive SNMP SYSLOG traps (SNMP failover traps), you must configure the SNMP agent to send SNMP traps to SNMP management stations, define a SYSLOG host, and also have compiled the Cisco SYSLOG MIB into your SNMP management station. See the snmp-server and syslog command descriptions in Chapter 4, "Command Reference" for more information.
The SYSLOG messages sent to record failover events are:
- System okay messages:
- "P|S: Cable OK."
- "P|S: Disabling failover." The no failover command was entered.
- "P|S: Enabling Failover." Either a LocalDirector is booting that has the failover command in its configuration file or the failover command was just entered in the current configuration.
- "P|S: Mate ifc number OK." The interface (ifc) is now working correctly after being brought back online after a failure. The number corresponds to the physical interface number labeled on the back panel.
- "P|S: Monitoring on interface <number> normal." Monitoring on this interface has started, and all operations of failover are normal.
- Cabling problem messages:
- "P|S: Bad cable." The cable is connected on both units, but is not a Cisco failover cable or has developed a wiring problem.
- "P|S: Cable not connected my side." The cable on the current LocalDirector is not connected.
- "P|S: Cable not connected other side." The cable on the current unit is connected, but the connector on the other unit is disconnected.
- "P|S: Error reading cable status." The cable state cannot be determined. Ensure that you are using a Cisco failover cable and all connectors are securely attached.
- Failure in process messages:
- "P|S: Monitoring on interface <number> waiting." This means that the mentioned interface has not yet heard 2 hello packets from the other unit. Failover monitoring on the interface has not started.
- "P|S: Lost Failover communications with mate on interface <number>." This means that 2 consecutive hello packets have not been heard on this interface. This will result in testing of that interface and the failure of one of the units.
- "P|S: Testing Interface <number>." The unit has detected a problem on that interface and is performing testing to determine if the problem is with itself or with the other unit. This will result in the failure of one of the units.
- "P|S: Testing on interface <number> <Passed | Failed>." The results of the testing.
- "P|S: Link status Down on interface <number>." The interface has been forced down, or is not plugged in to an operational port. This will result in the failure of this unit.
- "P|S: No response from mate." The other LocalDirector has not responded in the last 30 seconds over the failover cable.
- "P|S: Power failure other side." The other unit has lost power.
- "P|S: Mate ifc number failed." The interface (ifc) for the other unit failed.
- "P|S: Mate says I'm failed." This unit has been determined failed by the other unit.
- "P|S: Mate reporting failure." The other unit has failed itself.
- Status messages:
- "P|S: Switching to FAILED." The unit has entered a failed state.
- "P|S: Switching to ACTIVE." The unit has brought the network back online and is receiving connections. This message also occurs if you force a unit to active with the failover active command, or forced the other unit inactive with the no failover active command.
- "P|S: Switching to STANDBY." The active unit has switched to the standby mode. This could be due to a failure, or be a result of entering no failover active on the active unit or failover active on the standby unit.
- "P|S: Switching to OK." The unit has been cleared of any failures with the failover reset command and is restarting failover.
- Configuration replication messages:
- "P|S: Begin Configuration Replication: Receiving from mate." This is the standby unit and it is being configured by the active unit.
- "P|S: Begin Configuration Replication: Sending to mate." This is the active unit and it is configuring the standby unit.
- "P|S: End Configuration Replication." Configuration replication is complete.
- "WARNING Configuration replication is NOT performed from Standby unit to Active unit." The configuration you are performing is not being replicated on the active unit.
- "Configuration replication to Standby unit FAILED for command." An error occurred during configuration replication.
The following is the normal output of the show failover command. Note that the IP address that each unit is using is displayed.
ld-prim(config)# show failover
Failover On
Cable status: Normal
This host: Primary - Active
Active time: 6885 (sec)
Interface 0 (192.168.89.1): Normal
Interface 1 (192.168.89.1): Normal
Other host: Secondary - Standby
Active time: 0 (sec)
Interface 0 (192.168.89.2): Normal
Interface 1 (192.168.89.2): Normal
Failover will not start monitoring the network interfaces until it has heard the second hello packet from the other unit on that interface. This should happen within 30 to 60 seconds.
If the unit is attached to a switch running spanning tree, this will take twice the forward delay time configured in the switch (typically 15 seconds) plus 30 seconds. This is because at bootup (and immediately following a failover event) the network switch will detect a temporary bridge loop. When this bridge loop is detected, the switch will stop forwarding packets for the duration of the forwarding delay time. It will then enter "listen" mode for an additional forward delay time during which time the switch is listening for bridge loops but still not forwarding traffic (and thus not forwarding failover hello packets).
After twice the forward delay time (30 seconds) traffic should resume. The LocalDirector will remain in "waiting" mode until it hears two hello packets (1 every 15 seconds for a total of 30 seconds). During this time the LocalDirector is passing traffic, and it will not fail the unit based on not hearing the hello packets. All other failover monitoring is still occurring (power, interface, and failover cable hello).
Note If a failover IP address has not been entered,
show failover will display 0.0.0.0 for the IP address, and monitoring of the interfaces will remain in "waiting" state. A failover IP address must be set in order for failover to work.
The following example shows the output if failover has not started monitoring the network interfaces:
ld-prim(config)# show failover
Failover On
Cable status: Normal
This host: Primary - Active
Active time: 6930 (sec)
Interface 0 (192.168.89.1): Normal (Waiting)
Interface 1 (192.168.89.1): Normal (Waiting)
Other host: Secondary - Standby
Active time: 15 (sec)
Interface 0 (192.168.89.2): Normal (Waiting)
Interface 1 (192.168.89.2): Normal (Waiting)
Note Waiting indicates that monitoring of the other unit's network interfaces has not yet started.
The following example shows that a failure has been detected. Note that interface 1 on the primary unit is the source of the failure. The units are back in waiting mode because of the failure. The failed unit has removed itself from the network (interfaces are down) and it is no longer sending hello packets on the network. The active unit will remain in the waiting state until the failed unit is replaced and failover communications start again.
ld-prim(config)# show failover
Failover On
Cable status: Normal
This host: Primary - Standby (Failed)
Active time: 7140 (sec)
Interface 0 (192.168.89.2): Normal (Waiting)
Interface 1 (192.168.89.2): Failed (Waiting)
Other host: Secondary - Active
Active time: 30 (sec)
Interface 0 (192.168.89.1): Normal (Waiting)
Interface 1 (192.168.89.1): Normal (Waiting)
This section contains some frequently asked questions about the failover feature.
- Can the failover feature work without using the failover cable?
- No, failover will not work without the cable. If you run without the failover cable you are essentially running two separate LocalDirectors. This will result in a bridge loop and flood the network. The failover cable is an essential part of failover.
- Can modems be used to extend the length of the failover cable?
- No, the cable cannot be extended using modems or other RS-232 line extenders. Part of what the failover cable does is indicate the presence and power status of the other unit. When you place line extenders in this path you are relaying the status of the line extender rather than of the other LocalDirector unit.
- What happens when failover is triggered?
- A switch can be initiated by either unit. When a switch takes place each unit changes state. The newly active unit assumes the IP address and MAC address of the previously active unit and begins accepting traffic for it. The new standby unit assumes the IP address and MAC address of the unit that was previously the standby unit. The two units do not share connection states. Any active connections will be dropped when a failover switch occurs. The clients must re-establish the connections through the newly active unit.
- How is startup initialization accomplished between two units?
- When a unit boots up it defaults to Failover Off and Secondary, unless the failover cable is present or failover has been saved in the configuration. The configuration from the active unit is also copied to the standby unit. If the cable is not present, the unit automatically becomes the active unit. If the cable is present, the unit that has the primary end of the failover cable plugged into it becomes the primary unit by default.
- How can both units be configured the same without manually entering the configuration twice?
- The configuration is automatically replicated, and can be forced with the write standby command.
- What happens if a primary unit has a power failure?
- When the primary active LocalDirector experiences a power failure, the standby LocalDirector comes up in active mode. If the Primary unit is powered up again it will become the Standby unit.
- What happens if an interface card is disconnected?
- When the primary active LocalDirector is failed by disconnecting the e0(e1) interface (cable pull), the standby LocalDirector comes up in active mode as it should. When the interface is plugged back in, the unit will automatically recover.
- Does failover work in a switched environment?
- Yes, if you are running LocalDirector version 1.6.3 on both units.
- What constitutes a failure?
- Fault detection is based on the following:
- Failover hello packets are received on each interface. If hello packets are not heard for two consecutive 15 second intervals, the interface will be tested to determine which unit is at fault.
- Cable errors. The cable is wired so that each unit can distinguish between a power failure in the other unit, and an unplugged cable. If the standby unit detects that the active unit is powered off (or resets) it will take active control. If the cable is unplugged, a SYSLOG is generated but no switching occurs. An exception to this is at boot-up, at which point an unplugged cable will force the unit active. If both units are powered up without the failover cable installed they will both become active creating a duplicate IP address conflict on your network. The failover cable must be installed for failover to work correctly.
- Failover communication. The two units share information every 15 seconds. If the standby unit doesn't hear from the active unit in two communication attempts (and the cable status is OK) the standby unit will take over as active.
- How long does it take to detect a failure?
- Network errors are detected within 30 seconds (two consecutive 15-second intervals).
- Power failure (and cable failure) is detected within 15 seconds.
- Failover communications errors are detected within 30 seconds (two consecutive 15-second intervals).
- What maintenance is required?
- SYSLOG messages will be generated when any errors or switches occur. Evaluate the failed unit and fix or replace it.