Fault detection and recovery, Fault recovery – Allied Telesis AlliedWare Plus Operating System Version 5.4.4C (x310-26FT,x310-26FP,x310-50FT,x310-50FP) User Manual

Page 1486

EPSR Introduction and Configuration

Software Reference for x310 Series Switches

57.4

AlliedWare Plus

Operating System - Version 5.4.4C

C613-50046-01 REV A

Fault Detection and Recovery

EPSR uses the following methods to detect outages in a node or a link in the ring:

■

Master node polling fault detection

■

Transit node unsolicited fault detection

Master node

polling

The master node issues healthcheck messages from its primary port as a means of
checking the condition of the EPSR network ring. These messages are sent at regular
periods, controlled by the hellotime parameter of the

epsr command on page 58.3

. A

failover timer is set each time a healthcheck message leaves the master node’s primary
port. The timeout value for this timer is set by the failover parameter of the

epsr

command on page 58.3

. If the failover timer expires before the transmitted healthcheck

message is received by the master node’s secondary port, the master node assumes that
there is a fault in the ring, and implements its fault recovery procedures. Because this
method relies on a timer expiry, its operation is inherently slower than the “transit node
unsolicited detection method” described next.

Transit node

unsolicited

Transit node unsolicited fault detection relies on transit nodes detecting faults at their
interfaces, and immediately notifying master nodes about the break. When a transit node
detects a connectivity loss, it sends a “links down” message over its good link. Because a
link spans two nodes, both nodes send the “links down” message back to the master node.
These nodes also change their state from “links up” to “links down,” and change the state
of the port connecting to the broken link, from “forwarding” to “blocking.”

Fault Recovery

When the master node detects an outage in the ring by using its detection methods, it
does the following:

Declares the ring to be in a “failed” state.

Unblocks its secondary port to enable the data VLAN traffic to pass between its
primary and secondary ports.

Flushes its own forwarding database (FDB) for (only) the two ring ports.

Sends an EPSR Ring-Down-Flush-FDB control message to all the transit nodes, via
both its primary and secondary ports.

Transit nodes respond to the Ring-Down-Flush-FDB message by flushing their forward
databases for each of their ring ports. As the data starts to flow in the ring’s new
configuration, each of the nodes (master and transit) re-learn their Layer 2 addresses.
During this period, the master node continues to send health check messages over the
control VLAN. This situation continues until the faulty link or node is repaired. For a multi-
domain ring, this process occurs separately for each domain within the ring.

The following figure shows the flow of control frames under fault conditions.

Note

When VCStack is used with EPSR, the EPSR failovertime must be set to at least 5
seconds to avoid any broadcast storms during failover. Broadcast storms may
occur if the switch cannot failover quickly enough before the EPSR failovertime
expires. See the

epsr

command for further information about the EPSR

failovertime.
See the

reboot rolling

command for further information about VCStack

failover.