Failover and failback, Automatic failover – HP MSA 2040 SAN Storage User Manual
Page 14
14
Using SRM for disaster recovery
makes changes at both sites that require significant time and effort to reverse. Because of this, the privilege
to test a recovery plan and the privilege to run a recovery plan must be separately assigned.
When SRM test failovers to the recovery site are requested, the SRA will perform the steps listed.
1.
Select the replicated volumes.
2.
Identify the latest complete Remote Copy snapshot.
3.
Delete any temporary writable space on that snapshot to ensure an unedited snapshot is presented to
ESX hosts.
4.
Configure authentication for ESX hosts to directly mount snapshots.
5.
When testing stops, to conserve space on the SAN, delete the temporary writable space that was used
during the test.
Failover and failback
Failback is the process of setting the replication environment back to its original state at the protected site
prior to failover. Failback with SRM 5.0 or greater is now an automated process that occurs after
recovery. This makes the failback process of the protected virtual machines relatively simple in the case of
a planned migration. If the entire SRM environment remains intact after recovery, failback is done by
running the “reprotect” recovery steps with SRM, followed by running the recovery plan again, which will
move the virtual machines configured within their protection groups back to the original protected SRM
site.
In disaster scenarios, failback steps vary with respect to the degree of failure at the protected site. For
example, the failover could have been due to an array failure or the loss of the entire data center. The
manual configuration of failback is important because the protected site may have a different hardware or
SAN configuration after a disaster. Using SRM, after failback is configured, it can be managed and
automated like any planned SRM failover. The recovery steps can differ based on the conditions of the last
failover that occurred. If failback follows an unplanned failover, a full data re-mirroring between the two
sites may be required. This step usually takes most of the time in a failback scenario.
All recovery plans in SRM include an initial attempt to synchronize data between the protection and
recovery sites, even during a disaster recovery scenario.
During the disaster recovery, an initial attempt will be made to shut down the protection group’s virtual
machines and establish a final synchronization between the sites. This is designed to ensure that virtual
machines are static and quiescent before running the recovery plan, in order to minimize data loss
wherever possible. If the protected site is no longer available, the recovery plan will continue to execute
and will run to completion even if errors are encountered.
This new attribute minimizes the possibility of data loss during a disaster recovery, balancing the
requirement for virtual machine consistency with the ability to achieve aggressive recovery-point objectives.
Automatic failover
SRM automates the execution of recovery plans to ensure accurate and consistent execution. Through the
vCenter Server you can gain full visibility and control of the process, including the status of each step,
progress indicators, and detailed descriptions of any error that occurs.
In the event of a disaster when an SRM actual failover is requested, the SRA will perform the following
steps:
1.
Select the replicated volumes.
2.
Identify and remove any incomplete remote copies that are in progress and present the most recently
completed Remote Copy as a primary volume.
3.
Convert remote volumes into primary volumes and configure authentication for ESXi hosts to mount
them.
If an actual failover does not run completely for any reason, the failover can be called many times to try to
complete the run. If, for example, only one volume failed to restore and that was due to a normal snapshot
being present, the snapshot could be manually deleted and the failover be requested again.