Wednesday, March 7, 2012

HIGH AVAILABILITY (HA) technologies - Cisco NSF

Cisco NSF and Timer Manipulation for Fast Convergence


Cisco NSF with SSO is a Cisco innovation for systems with dual route processors. Cisco NSF with SSO allows a router that has experienced a hardware or software failure of an active route processor to maintain data link layer connections and to continue forwarding packets during the switchover to the standby route processor. This forwarding can continue despite the loss of routing protocol peering arrangements with other routers. Routing information is recovered dynamically in the background, while packet forwarding proceeds uninterrupted.

Initially, it appears that Cisco NSF and OSPF/ISIS/EIGRP timer manipulation have complimentary objectives. Each feature is dedicated to achieving the fastest possible convergence in the event of a failure on a router. However, more careful analysis reveals that these technologies also have conflicting goals. Cisco NSF attempts to maintain the flow of traffic through a router that has experienced a failure; conversely, OSPF/ISIS/EIGRP timer manipulation tries to quickly redirect the flow of traffic away from a router that has experienced a failure towards an alternate path. While not mutually exclusive, the two technologies try to address different aspects of the same problem in disparate ways. It is therefore important to carefully consider the network design goals and establish precedence for redundancy.

The network designer has three alternatives

1. Raise the IGP hold-timers to seven seconds to accommodate all failure scenarios. Setting the timer to this value would account for the situation in which the route processor has to be detected via IPC keep-alive failure (3 seconds) plus the safe value for post-switchover behavior (4 seconds for the Cisco 10000 and 12000 Series Internet Routers).
2. Leave the IGP hold-timers at 4 seconds. This will allow Cisco NSF with SSO to operate as expected in the majority of failure scenarios. In the exception cases, where the route processor needs to use IPC keep-alive to determine the need to switchover to the redundant route processor, the traffic will failover to a redundant path on a different system. Remember, the keep-alive procedure is a "failsafe" mechanism while the internal switchover signaling procedures are expected to cover most failures.
3. Lower the IPC keep-alive timer. This can be achieved with the command "redundancy/main-cpu/switchover timeout <milliseconds>". By default, this timer is set for 3 seconds, and can be lowered with the preceding command. It should be strongly emphasized that there is an element of risk to lowering this timer. If the standby route processor does not hear from the active route processor within the timeout period, an route processor switchover will be initiated. Thus, if this timer is set to a very low value, there is the danger of false alarms-causing an route processor switchover when one is not required. In addition, there will be increased CPU and IPC bandwidth usage associated with setting this timer to a very low value.

From:Cisco NSF and Timer Manipulation for Fast Convergence

No comments:

Post a Comment