Failover and failback, Automatic failover – HP MSA 2040 SAN Storage User Manual

Page 14

Using SRM for disaster recovery

makes changes at both sites that require significant time and effort to reverse. Because of this, the privilege

to test a recovery plan and the privilege to run a recovery plan must be separately assigned.
When SRM test failovers to the recovery site are requested, the SRA will perform the steps listed.

Select the replicated volumes.

Identify the latest complete Remote Copy snapshot.

Delete any temporary writable space on that snapshot to ensure an unedited snapshot is presented to

ESX hosts.

Configure authentication for ESX hosts to directly mount snapshots.

When testing stops, to conserve space on the SAN, delete the temporary writable space that was used

during the test.

Failover and failback

Failback is the process of setting the replication environment back to its original state at the protected site

prior to failover. Failback with SRM 5.0 or greater is now an automated process that occurs after

recovery. This makes the failback process of the protected virtual machines relatively simple in the case of

a planned migration. If the entire SRM environment remains intact after recovery, failback is done by

running the “reprotect” recovery steps with SRM, followed by running the recovery plan again, which will

move the virtual machines configured within their protection groups back to the original protected SRM

site.
In disaster scenarios, failback steps vary with respect to the degree of failure at the protected site. For

example, the failover could have been due to an array failure or the loss of the entire data center. The

manual configuration of failback is important because the protected site may have a different hardware or

SAN configuration after a disaster. Using SRM, after failback is configured, it can be managed and

automated like any planned SRM failover. The recovery steps can differ based on the conditions of the last

failover that occurred. If failback follows an unplanned failover, a full data re-mirroring between the two

sites may be required. This step usually takes most of the time in a failback scenario.
All recovery plans in SRM include an initial attempt to synchronize data between the protection and

recovery sites, even during a disaster recovery scenario.
During the disaster recovery, an initial attempt will be made to shut down the protection group’s virtual

machines and establish a final synchronization between the sites. This is designed to ensure that virtual

machines are static and quiescent before running the recovery plan, in order to minimize data loss

wherever possible. If the protected site is no longer available, the recovery plan will continue to execute

and will run to completion even if errors are encountered.
This new attribute minimizes the possibility of data loss during a disaster recovery, balancing the

requirement for virtual machine consistency with the ability to achieve aggressive recovery-point objectives.

Automatic failover

SRM automates the execution of recovery plans to ensure accurate and consistent execution. Through the

vCenter Server you can gain full visibility and control of the process, including the status of each step,

progress indicators, and detailed descriptions of any error that occurs.
In the event of a disaster when an SRM actual failover is requested, the SRA will perform the following

steps:

Select the replicated volumes.

Identify and remove any incomplete remote copies that are in progress and present the most recently

completed Remote Copy as a primary volume.

Convert remote volumes into primary volumes and configure authentication for ESXi hosts to mount

them.

If an actual failover does not run completely for any reason, the failover can be called many times to try to

complete the run. If, for example, only one volume failed to restore and that was due to a normal snapshot

being present, the snapshot could be manually deleted and the failover be requested again.