Failover and Failback

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Failover and Failback

Version: vCenter SRM 5.0

When I started writing this chapter I really struggled to explain how much SRM has improved, given that it now has an automated method of failing back after a DR event. When I finished writing the chapter it became very clear to me. In my previous books this chapter ran to about 50 pages and 10,000 words; with the improvements that VMware has made to SRM this chapter will probably run to about 25 pages and 5,000 words. Those small stats will hopefully give you some idea of how much has improved since SRM 4.0. By simplifying the process significantly, VMware has given me less work to do, and you can’t say that isn’t a good deal for everyone concerned.

The one thing we have yet to discuss or cover is what SRM is all about. A disaster occurs and you must trigger your Recovery Plan for real. This is sometimes referred to as “Hitting the Big Red Button.” I left this discussion for this chapter because it’s a major decision that permanently changes the configuration of your SRM environment and your virtual machines, and so it is not to be undertaken lightly. It’s perhaps worth considering how long it takes to get senior management approval to run the Recovery Plan. The time it takes for decisions to be made can be added to your hoped-for recovery time objective. An actual DR event responds well to prompt and decisive leadership.

My second reason for covering this issue now is that before I started this chapter I wanted to change the viewpoint of the book completely, to cover bidirectional configuration. The previous chapter was a precursor to preparing a failover and failback situation. I didn’t want to trigger a real run of the DR plan before making sure a bidirectional configuration was in place and understood. A real execution of your Recovery Plan is just like a test, except the Recovery Step mode of the plan is actually executed. You can view the recovery steps by selecting a Recovery Plan and changing the View options (see Figure 14.1).

Failover-and-failback- (00).jpg

Figure 14.1 The pull-down options next to View enable the SRM administrator to see the steps taken during recovery.

The Recovery Steps mode differs from a test step in a few ways. First, it includes a step to presynchronize storage. This attempts to ensure that during the process of planned migration the VMs in the Recovery Site are in the same consistent state as the Protected Site. It occurs twice: once when the VMs are powered on (step 1), and then after the VMs are powered off (step 6). Of course, this will only be successful during a planned migration when both the Protected and Recovery Sites are available at the same time. In a real disaster this step would be ignored because it’s unlikely that there would be any communication paths available between the sites—what happens when the Recovery Plan is run very much depends on the nature of the disaster itself. Second, during a planned migration SRM attempts to gracefully shut down the VMs in the Protected Site that are covered by the plan. In other words, if possible, SRM will power off VMs at the Protected Site (New York) if it’s available. Unlike the test of the plan, no attempt is made to “clean up” after the plan, which resets all the recovery VMs. It leaves them up and running. Essentially, the VMs that once lived in the Protected Site—in my case, New York—will live in the Recovery Site of New Jersey.

In the real world, clicking the big red button requires senior management approval, usually at the C-Class level (CEO, CTO, CIO), unless these guys are in the building that was demolished by the disaster itself, and someone farther down the management chain is delegated to make the decision. You could regard this issue as part and parcel of the DR/BC plan. If we have lost senior decision makers either temporarily or permanently, someone else will have to take on their roles and responsibilities. Additionally, there will be subtle and important changes at the storage array. The storage vendors’ SRA will automatically stop the normal cycle of replication between the Protected Site and the Recovery Site, and will usually change the status of the LUNs/volumes from being secondary/slave/remote/replica (or whatever terminology the storage vendor uses) to primary or master. All this is done without you having to bug those guys on the storage team. For example, if you were using an HP P4000 VSA you would see the volumes that are normally marked as being “Remote” switched to being “Primary.”

Triggering the plan from SRM is very easy to do—some might say too easy. You click the Recovery button, read a warning, click the radio button to confirm that you understand the consequences, and click OK. Then you watch the sparks fly depending on the plan’s success and whether you sought higher approval! Actually running the recovery is very easy and I think is a good use case for deploying a correct permissions and rights model so that recoveries are not run accidentally. Of course, a run of your disaster Recovery Plan need not be executed merely for an out-and-out loss of a site. If you have planned both the datastore and Protection Groups properly, there should be no reason why you can’t failover and failback on an application level. This becomes very important when you consider that, by default, VMs at the Protected Site are powered down. If you have not planned your LUN/volume layout correctly, and you have different applications and business units mixed together on the same datastores, it will be very difficult to run a Recovery Plan without it powering off and moving over a VM that you actually were not ready to failover to the new location. If this is the case, I would consider a root and branch review of your datastore layout, and seriously consider an amount of Storage vMotion to relocate VMs into more discrete datastores. SRM does not react well to folks who just sort datastores based on their amount of free space, and use that as the criterion for where they store their VMs. You have been warned!

There are some larger issues to consider before hitting the big red button. Indeed, we could and should argue that these issues need addressing before we even think about implementing SRM. First, depending on how you have licensed SRM, you may need to arrange the transfer of the SRM license between the Protected and Recovery Sites to remain covered by the EULA with VMware. VMware issues SRM licenses on the basis of trust. You need a license assigned to the site that has protected VMs located in the vCenter. As a failback is essentially a reversal of the day-to-day configuration, the Recovery Site will need licensing, albeit temporarily, to facilitate the failback process. VMware assumes you don’t abuse this license; using a single-site SRM license to protect two sites would be regarded as a breach of the license agreement. VMware does allow this to happen merely to allow failback to occur without the purchase of additional licenses. Of course, if you have bidirectional configuration, this license concern does not apply, as both sites are licensed since they both possess Protection Groups. The process is significantly simplified by the use of linked mode as the license follows the VM as it is moved from one site to another. SRM 5 will never prevent a failover from happening due to license issues. It will (if a license has expired or doesn’t exist) prevent you from creating new Protection Groups but it won’t stop Recovery Plans running to completion. If you are using the Enterprise SRM Edition it will still let you protect VMs; you’ll just get warnings about your violation of the EULA.

Second, if you are changing the IP addresses of the virtual machines, your DNS systems will need to have the correct corresponding IP address and hostnames in DNS. Ideally, this will be achieved in the main by using your DNS Server Dynamic DNS Name Registrations feature, but watch out for any static records in DNS and the caching of those DNS records on other systems.

Planned Failover

The main obvious difference when running the Recovery Plan when the Protected Site is available is that the virtual machines in the Protected Site get powered off based on the order specified in the plan. However, a subtler change is effected as well: the suspension of the replication or snapshot between the Protected Site and the Recovery Site. The diagram in Figure 14.2 illustrates the suspension of the normal replication cycle. This must happen to prevent replication conflicts and data loss; after all, it’s the virtual machines at the Recovery Site which users will be connecting to and making data changes to. For all intents and purposes they are the primary virtual machines after a failover has occurred. In the figure, the volumes in the Recovery Site are marked as primary/master and are set as being read-write by the SRA from your storage vendor.

File:Failover-and-failback- (01).jpg

Figure 14.2 Here the X represents the suspension of the normal pattern of replication.

As you can see, the X indicates replication of data has been suspended, and the LUNs that were once marked as R/O in our tests are being marked as R/W in our execution of the Recovery Plan. In manual DR without SRM, this is normally a task triggered at the storage array by a human operator using the vendor’s failover/failback options, but as the SRA has administrative rights to the storage array this can be automated by SRM. Once the plan has successfully completed, you should be able to see this change in the status pages of your given storage system. For example, in a NetApp system you will see that the SnapMirror relationship will be broken off and the destination volume held in New Jersey is no longer in a SnapMirror relationship (see Figure 14.3).

Additionally, for each and every volume affected by the planned recovery, the array manager’s configuration, when viewed from the Recovery Site, will display that replication has been stopped between the two sites with a broken gray line (see Figure 14.4). In this example I’m going to assume that the New York site will be unavailable for some hours or days. This could be caused by an incoming hurricane that local weather has given us reliable information about, or a warning from the power company that there will be a major outage of the mains supply for some time as part of a power grid upgrade and maintenance plan.

Failover-and-failback- (02).jpg

Figure 14.3 The vol6_web volume is marked as no longer being in a snapmirrored state, and the relationship state is marked as “broken-off.”

Failover-and-failback- (03).jpg

Figure 14.4 SRM indicates a broken replication relationship with a broken gray line rather than the blue arrow that shows the direction of replication.

STOP!: Before running the plan, you might want to carry out a test first to make sure all VMs recover properly. VMs that display errors will be left merely as placeholders in the Recovery Site.

To run the plan, follow these steps.

1. Select the Recovery Plan and click the Recovery button. In my case, I picked the Web Recovery Plan within the New York Recovery Plan folder.

2. Read the Confirmation text, enable the checkbox that indicates you understand the results of running the plan, and select the Planned Migration radio button (see Figure 14.5). Note that the Recovery dialog box shows both the Planned Migration and Disaster Recovery options only if the SRM plug-in is able to connect to both the Protected and Recovery Sites at the same time. In a truly unexpected DR event you most likely would find the Planned Migration option is, by definition, unavailable.

If everything goes to plan (forgive the pun) you won’t see much difference between a true run of the plan and a test of the plan. What you will see are power-off events at the Protected Site. I have a live recording of me running this very plan on the RTFM website. If you want to see what happened when this plan was executed you can watch it at www.rtfm-ed.co.uk/srm.html.

Failover-and-failback- (04).jpg

Figure 14.5 The Planned Migration option is only available if you are connected to both the Protected and Recovery Sites at the same time.

3. Once the Recovery Plan has been successful it will be marked with a status of “Recovery Complete” and the SRM administrator with be left with a default message indicating the next stages (see Figure 14.6).

Alongside this message you should see that any of the Protection Groups that were leveraged during the running of the Recovery Plan will have had their status changed, and be marked as “recovered” (see Figure 14.7).

Before we look at the reprotect and failback process, I want to look at each storage vendor array that I’m using, and reveal the changes made by the SRA to the storage systems themselves. This is pulling back the covers to reveal what goes on at the storage layer when you carry out a planned failover.

Failover-and-failback- (05).jpg

Figure 14.6 The option to Reprotect inverts the replication process to facilitate the running of the Recovery Plan for failback purposes.

Failover-and-failback- (06).jpg

Figure 14.7 In this case, just the Web Recovery Plan was run, and therefore only the Web Protection Group is so marked.

Dell EqualLogic and Planned Recovery

When you run a Recovery Plan for planned migration a number of changes take place at the Dell EqualLogic array, as the plan stops the replication between the Protected Site and the Recovery Site. Assuming you have access to the storage array at the Protected Site you would see that volumes affected by the scope of the Protection Groups utilized by the Recovery Plan would be taken “offline” (see Figure 14.8).

At the Recovery Site, the replication object under Inbound Replicas would be removed for the affected volumes, and new volumes created with the hosts in the Recovery Site would automatically be added to the access control list (see Figure 14.9).

At the time of this writing, the Dell EqualLogic SRA does not establish a schedule for replication. EqualLogic arrays hold their schedules at the source of the replication, not the destination. Currently, this means whenever you failover or failback with SRM, you will need to remember to create a new schedule for replication to the volumes where they reside.

Failover-and-failback- (07).jpg

Figure 14.8 At the Protected Site, volumes are marked as “offline” after a Recovery Plan is executed.

Failover-and-failback- (08).jpg

Figure 14.9 At the Recovery Site, the volumes touched by the Recovery Plan are brought online and marked as being read-writable.

NetApp and Planned Recovery

When you run a Recovery Plan with NetApp-based storage a number of changes take place to your existing configuration. At the Protected Site location you will see that because the SnapMirror relationship has been “broken off,” no management can be done there (see Figure 14.10). All you can do is view the SnapMirror. As with all relationships, if the break was not done in a timely manner there isn’t much you can do about it. I don’t think the current state of my NetApp filers says much about the relationship that New Jersey enjoys with New York. That’s another one of my jokes, by the way— laughing is not mandatory.

Additionally, if you were viewing the volumes in FilerView rather than System Manager you would see the affected volumes are no longer marked as being “read-only.” In my case, I ran a Recovery Plan just against the VMs held in the Web volume. As you can see in Figure 14.11, unlike the other volumes, the replica_of_vol6_web volume does not show the “read-only” or “snapmirrored” option, indicating it is available as a writable volume to the ESX hosts in the Recovery Site.

Failover-and-failback- (09).jpg

Figure 14.10 In NetApp System Manager, vol6_web is no longer in a snapmirrored state, and the relationship state is marked as “broken-off.”

Failover-and-failback- (10).jpg

Figure 14.11 The replica_of_vol6_web volume is not marked as being “snapmirrored” or “read-only.”

Automated Failback from Planned Migration

As you probably know, this release of SRM marks a significant milestone in terms of improving the process of failback to the Protected Site. It’s worth remembering that previous versions of SRM supported a failback process. SRM 5.0 makes that process much easier. From a storage perspective, failback means inverting your normal path of replication from the Protected Site to the Recovery Site. Previously this was a manual task carried out with great care following the storage array vendor’s documentation on how to complete the reconfiguration. SRM 5.0 introduces a reprotect feature that now handles this process for you by using an enhanced and improved SRA that handles the communication to the array for this process. As well as handling the array relationships, this reprotect feature also removes much of the manual cleanup that was required in previous releases. The automatic failback is really the combination of three discrete stages: the reprotect, the test of the Recovery Plan, and assuming that was successful, a planned migration back to the original site. VMware introduced reprotect and failback as a result of direct consultation with customers; VMware found existing customers of SRM were doing many planned migrations and failbacks, and it was these events that will really benefit from reprotect. Remember, a true physical site loss event will always require some kind of manual intervention.

If the array at the Protected Site has not been destroyed in the disaster, the data held on it will be out of sync with the Recovery Site. How much out of sync depends on how long you were at the Recovery Site, and the rate of data change that has taken place. If the data change rate is small, it is possible to merely replicate the differences (the deltas) that have accrued over this time. Alternatively, if it is massively out of sync, you might find yourself actually bringing the new array to the Recovery Site and doing the replication locally or even resorting to using a courier service to move large quantities of data around. Much depends on how long your VMs were active at the Recovery Site, and the quantity of data change that has taken place.

After the use of planned migration, you will find the Recovery Plan will have finished, indicating that you should now carry out the reprotect process. For this to be successful you need a placeholder datastore for both the Protected Site and the Recovery Site; if you don’t do this, the reprotect process will fail. You can see if this is the case in a few areas. First, there should be a placeholder datastore assigned for both sites. This can be viewed on the properties of the sites affected in the Placeholder Datastores tab (see Figure 14.12).

Additionally, you should be able to see that in the old Protected Site (New York), the VMs there have a different label next to their name. Each VM affected by the planned migration should have an underscore character appended to the beginning of the name, as shown in Figure 14.13.

Failover-and-failback- (11).jpg

Figure 14.12 NYC_SRM_Placeholders, allocated to the New York site, resides at the Recovery Site and holds the Recovery Plan place-holder files.

Failover-and-failback- (12).jpg

Figure 14.13 These VMs are ready to be made into full-fledged placeholder files during the reprotect phase.

These are not real VMs, as those are now residing over at the Recovery Site (in my case, New Jersey). Instead, they are “shadow” VMs waiting to be converted into fully protected VMs; when you carry out the reprotect procedure these shadow VMs’ icons will change to be the default for protected VMs—the so-called “boxes-in-boxes” logo with a lightning bolt appended to them. You should see that these shadow VMs are being held on the placeholder datastore configured for the site (see Figure 14.14).

Once you are satisfied that the old Protected Site is ready for the reprotect phase, you can proceed to run this from your system. Just as there are test and recovery steps there are reprotect steps; you can view them by selecting the Recovery Plan you wish to use, and switching from View mode to Reprotect mode (see Figure 14.15).

As you can see, the reprotect process reverses the direction of replication and reverses the use of Protection Groups for the failback process. In previous releases of SRM there would have been a convoluted “cleanup” phase that is now handled by SRM. It’s hard to appreciate the level of automation that has been added here by VMware engineers. Previously the step required for failback would have comprised 20 to 30 individual processes, and would require very detailed knowledge of the storage array system. In my previous books, this chapter was exceedingly long as I detailed those steps.

Failover-and-failback- (13).jpg

Figure 14.14 NJ_SRM_Placeholders is used as part of the reprotect process, and holds VMs protected by the SRM administrator at the Recovery Site.

Failover-and-failback- (14).jpg

Figure 14.15 Switching to Reprotect reveals the reprotect steps, including those instructing the array to reverse the direction of replication.

You can trigger the reprotect steps by selecting the Recovery Plan and then clicking the Reprotect button or link. When this is done you will receive a warning box that will summarize the changes being made by SRM and the SRA. You can view this process as a reversal of personality in your SRM configuration. At the end of the reprotect process your Recovery Site will be regarded as the Protected Site, and the Protected Site will be the Recovery Site. The path of replication will be inverted so that, in my case, New York receives updates from New Jersey (see Figure 14.16). The synchronization process should not take too long; how long depends on the volume of the changes since the failover process. As both the source and destination have been initialized before (albeit in the opposite direction), all that needs to be synchronized are the changes to the volume since the failover event.

Failover-and-failback- (15).jpg

Figure 14.16 It’s important to confirm that after the replication path is inverted, both sites are in sync and the normal replication pattern is reestablished.

If the reprotect is successful you should see that the shadow VMs are now full-fledged SRM placeholders (see Figure 14.17). Additionally, you should notice that the cycle of replication has now been inverted so that volumes that were being replicated from the Protected Site to the Recovery Site (New York A New Jersey) are now being replicated back in the other direction (New York @ New Jersey). You can see this by viewing the array manager configuration. In my case, you can see in Figure 14.18 that /vol/replica_of_vol6_web is now being replicated back to the original source /vol/vol6_web. In other words, the source is now the destination and the destination is now the source. Essentially, the Recovery Site now “owns” this datastore. The replica_of_vol6_web datastore is read-writable, whereas vol/ vol6_web is read-only and merely receiving updates.

This situation can have some interesting and unexpected consequences. Suppose, for example, that you had a Recovery Plan that contained Protection Groups that contained datastores replicating five datastores from New York to New Jersey. However, a separate Recovery Plan is executed to failover just one of those five datastores, followed by the reprotect process. It is possible for a Recovery Plan to contain a mix of Protection Groups that are replicating in opposite directions; if this happens a Recovery Plan is marked as having a “direction error” (see Figure 14.19). Therefore, some Protection Groups containing datastores were replicating from Site A to Site B, and other Protection Groups were replicating from Site B to Site A.

Failover-and-failback- (16).jpg

Figure 14.17 The _vmname placeholders are registered as full-fledged placeholder VMs replete with their special SRM protected icon.

Failover-and-failback- (17).jpg

Figure 14.18 The replica_of_vol6_web datastore is now being replicated from New Jersey to New York.

This is a direction error, and it’s precisely the configuration I have at the moment in my environment. Some datastores are being replicated from Site A to Site B, whereas others referenced in the same Recovery Plan are being replicated from Site B to Site A.

Failover-and-failback- (18).jpg

Figure 14.19 In the DR Recovery Plan there are multiple recovery Protection Groups that have opposing replication directions.

You might not find the help contained in the SRM message useful. It may be easier to complete the process of reprotect and failback such that all the datastores referenced in the Protection Groups point in the right direction. If you like, the message is generated by the process that has yet to be fully completed by the SRM administrator.

So, it is important to consider how you want to run your Recovery Plans. For a planned failover, I would prefer to take applications from the Protected Site to the Recovery Site in a staged manner—perhaps taking each application over to the new location step by step, or perhaps carrying out the planned migration on a business unit by business unit basis. As each application or business unit was recovered I would validate that those VMs were functioning before embarking on my next block of moves. Admittedly, this would depend highly on how much time I had for the planned migration—the less advance notification there is the less time there is for time-consuming validation. I’d be more tempted to run my DR Recovery Plan in the event of a real disaster where time was of the essence.

To be honest, I wasn’t expecting to see this happen. I’ve found 99% of what I’ve learned in life has been from mistakes or unexpected errors; this doesn’t mean, however, that I make mistakes or errors 99% of the time. Thought about logically, this message makes perfect sense. Normally, a Protection Group contains datastores that are being replicated in one direction. It’s understandable that SRM would be irritated by a Protection Group that contains multiple datastores that are being replicated in different directions. In the situation above, so long as I failback my Web VMs relatively quickly, the DR Recovery Plan will return to a normal state with all the Protection Groups pointing in the same direction.

Once we are satisfied that the reprotect process had been successful, the next stage is to test the Recovery Plan, in which we would test the recovery of the VMs to the original location. Before you run the Recovery Plan to failback the VMs, I think it would be a best practice to confirm that the LUNs/volumes on the arrays at the Protected and Recovery Sites are in sync with each other. Another best practice is to test the Recovery Plan. In my case, this means checking that when I run the Web Recovery Plan the VMs power on correctly at the New York location. Once that test is successful, we run the Recovery Plan as part of a planned migration, this time from New Jersey back to New York. Of course, once that is complete, we would need to carry out yet another reprotect process to reestablish the normal pattern of New York being protected by the New Jersey Recovery Site. If you want a high-level view of this, I think it could be easily summarized once you understand the concept of failover, reprotect, and failback. So, for a planned failover, the workflow would be as follows.

1. Test the Recovery Plan from the Protected Site (New York) to the Recovery Site (New Jersey).

2. Run the Recovery Plan using the Planned Migration option.

3. Run the reprotect steps.

At this point you could stop if you wanted the VM’s failover to the Recovery Site to remain there for the foreseeable future—say, as part of a longer-term datacenter move. Once you were ready you could decommission the old datacenter using the approach I outlined in Chapter 13, Bidirectional Relationships and Shared Site Configurations. Alternatively, if this was just a temporary move to cover the business—say, from a planned power outage—you would continue at a later stage.

4. Test the Recovery Plan from the Protected Site (New Jersey) to the Recovery Site (New York).

5. Run the Recovery Plan using the Planned Migration option.

6. Run the reprotect steps.

7. Test the Recovery Plan as part of your normal round of testing of your preparedness for disaster.

It is certainly possible to observe this reprotect process at the storage management layer if you wish, and some of these systems do a good job of showing you in real time what is happening. For example, Dell EqualLogic Group Manager does a good job of showing you the reversal of the replication process during the reprotect stages in the Failback tab on volumes contained in the Outbound Replicas folder (see Figure 14.20).

The status bar in the Dell EqualLogic Group Manager shows active tasks. In Figure 14.20 it is demoting the primary volume to be secondary, and then creating a replica that sends updates from New Jersey to New York. Not all SRAs reinstate the schedule for replication during this reprotect phase. It is worth validating with storage management software that the LUNs/volumes are in sync, and that replication has returned to the normal, expected cycle.

Failover-and-failback- (19).jpg

Figure 14.20 Many storage vendors’ management tools show SRA tasks in real time. There is educational value in watching this process.

Finally, it’s worth stating that after a failover, reprotect, and failback, you may have some nominal cleanup work to do of your own. First, if you use datastore folders, as I do, to organize your datastores logically you will probably find yourself having to relocate datastores into the correct location. By default, when a datastore is remounted it is placed in the default datacenter container rather than its original folder. This is because SRM sees this datastore as brand new to the system. In Figure 14.21 you can see my Web datastore has been dropped into the NYC datacenter rather than in its normal location of SRM Datastores.

Additionally, if you are dealing with block-based storage and a VMware VMFS file system, you will probably need to reinstate the original VMFS volume name. As you might recall, when either a test or a recovery takes place VMFS volumes are resigna-tured (see Figure 14.22).

Of course, when you carry out a reprotect process the VMFS volume, including its metadata, is replicated to the original Protected Site—this includes the VMFS resigna-tured volume name. For neatness you might wish to rename this VMFS volume back to its proper name, and if you are using datastore folders you would want to drag and drop the volume to its intended location. Renaming is very easy to do—just click the VMFS volume once, and edit out the reference to “snap-NNNNNNNN” in the datastore label.

Failover-and-failback- (20).jpg

Figure 14.21 SRM does not know to place your recovered datastores into the correct datastore folder if you use them.

Failover-and-failback- (21).jpg

Figure 14.22 Because it’s often tricky to know if volumes are snapshots created by tests or recovered volumes, you should rename them to their original names after running a Recovery Plan.

Unplanned Failover

Since my planned test, I’ve put my configuration back in place, and I even went so far as to test my Recovery Plan for my protected virtual machines in New York to make sure they work properly. Now I want to document the same process of failover, reprotect, and failback based on a total loss of the Protected Site (New York).

Protected Site Is Dead

To emulate this I did a hard power off of all my ESX hosts in the Protected Site using my ILO cards. I also stopped communications between New York and New Jersey by disabling my routers—the effect of this is to stop all forms of IP-based replication between the two locations. To achieve the same scenario at the Fibre Channel level, I modified my zone configuration on the FC switches to stop replication from occurring between the two locations. This emulates a total catastrophic failure—nothing is running at the Protected Site (New York) anymore. I did a hard power off to emulate a totally unexpected and dirty loss of the system. My main reason for doing this is so that I can document how it feels to manage SRM when this situation happens. You might never get to try this until the fateful day arrives.

The first thing you’ll notice when you arrive at the Recovery Site—apart from a lot of worried faces, that is—is that your storage management tools will not be able to communicate to the Protected Site. The main difference when the Protected Site (New York) is unavailable is that as you log in to the Recovery Site (New Jersey) the vSphere client will ask you to log in to the vCenter in the Protected Site (New York)—and this will fail, because remember, the vCenter at the Protected Site is dead. You will be presented with a dialog box stating that there are problems communicating with the vCenter at the Protected site, and if you are in linked mode you’ll find that the vCenter for the Protected Site will not appear in the inventory. If you have the vSphere client open at the time of the site outage you will see that the vCenter server and the SRM server will be unavailable. Additionally, you will see in the bottom right-hand corner of the vSphere client that connectivity to the Protected Site vCenter has been lost (see Figure 14.23).

Of course, in this scenario you would want to run your Recovery Plan. It may be possible in this state to run a test of the Recovery Plan, if you wish. Whether you can test a Recovery Plan depends highly on the storage array vendor as some do not allow tests in this DR state. When you do switch to the SRM view you will be asked for the username and password of the vCenter at the Protected Site. There is little point in trying to complete the authentication dialog box as the Protected Site is down and unavailable. So you may as well just click Cancel. Once you have bypassed the authentication dialog box, you will find that both the Protected and Recovery Sites will be marked “Not Connected” and “Unknown” (see Figure 14.24).

In my case, this is because the Recovery Site of New Jersey cannot connect to its Protected Site SRM server in New York. The status of the New York site is unknown because we are unable to communicate or authenticate to it.

Even in this state it may be possible to test the Recovery Plan prior to running the recovery test. When you do this you will find that the option to “Replicate recent changes to recovery site” will be dimmed in the dialog box (see Figure 14.25), because in this disconnected state the SRM and SRA cannot work together to sync up the storage on the Protected and Recovery Sites.

Failover-and-failback- (22).jpg

Figure 14.23 There are many reasons you cannot connect to a vCenter. Here, the New York location was made unavailable deliberately.

Failover-and-failback- (23).jpg

Figure 14.24 The status of the New York site is Unknown. The New Jersey site cannot connect to New York as it is down and unavailable.

Failover-and-failback- (24).jpg

Figure 14.25 If the Protected and Recovery Sites are not connected the ability to replicate recent changes becomes unavailable.

Similarly, when you attempt to run the Recovery Plan you should find the option to carry out a planned migration will be disabled, as this is only available when both the Protected and Recovery Sites are connected (see Figure 14.26).

When you run the Recovery Plan you will see a number of errors in the early stages. The SRM service will attempt to trigger a replication event at step 1, “Pre-synchronize storage,” in an attempt to capture recent changes, but clearly there is no guarantee that this will be possible. If the storage arrays are unable to communicate, SRM will bypass the synchronization process. It will also attempt to power off VMs in the Protected Site, despite the fact that the Protected Site is unavailable. After all, SRM cannot power off VMs at the Protected Site if it’s a smoking crater—there would be nothing to power off. Additionally, you will see errors as SRM at the Recovery Site attempts to “Prepare Protected Site VMs for Migration” and “Synchronize Storage” a second time. You should regard these errors as benign and to be expected, given that they are all tasks in the Recovery Plan that require a connection from the Recovery Site to the Protected Site. As you can see in Figure 14.27 at step 8, “Change Recovery Site Storage to Writeable,” the storage has been successfully presented and VMs are in the process of being powered on.

At the end of the plan you should see that the plan is marked as “Disaster Recovery Complete” and other plans that contain the same Protection Groups will be marked with the same status (see Figure 14.28).

Failover-and-failback- (25).jpg

Figure 14.26 Planned migrations are only available when both the Protected and Recovery Sites are connected together.

Failover-and-failback- (26).jpg

Figure 14.27 Certain errors are expected as SRM tries to execute all the recovery stages but cannot because the Protected Site is unavailable.

Failover-and-failback- (27).jpg

Figure 14.28 Green arrows on Protection Groups, together with a Disaster Recovery Complete status, indicate a successful outcome from a DR plan.

Planned Failback after a Disaster

Of course, the reprotect and failback process can only proceed by definition if the Protected Site (New York) is available again. In this respect, it shouldn’t be that different from the failback process covered earlier in this chapter. Nonetheless, for completeness I wanted to cover this. Now, I don’t intend to cut and paste the entire previous section. For brevity I only want to cover what made this failback look and feel different. The extent of your work to failback will depend highly on the nature of your disaster. Let’s say, for example, you invoked DR and ran your Recovery Plan because a hurricane was on its way. At the time of running the Recovery Plan you have no idea of the extent of damage you might incur. You might be lucky and find that although many structures have been destroyed, your facility has more or less remained intact. In this scenario, you would confirm the underlying integrity of your physical infrastructure (power, servers, storage, and network) was in a fit state, as well as your virtual infrastructure in the shape of vSphere and SRM was in a good state, and carry out quite a graceful and seamless failback. If, on the other hand, you experienced extensive damage you would be faced with a rebuild of your vSphere environment (and indeed, major repair work to your building!). This could well entail a brand-new installation of vSphere and SRM, and necessitate re-pairing the sites, configuring inventory mappings, and creating Protection Groups—in short, everything we covered in the early chapters of this book.

To emulate this I powered back on my ESX—I allowed the system to come up in any order to generate a whole series of errors and failures. I wanted to make this as difficult as possible, and so made sure my storage array, SQL, vCenter, and SRM server were all back online again, but starting up in the wrong order. I thought I would repeat this process again to see if there were any unexpected gotchas I could warn you about.

Once you have logged in to both the Protected and Recovery Sites, the first thing you will see is that the Recovery Plans will have a status warning on them stating that “Original Protected VMs are not shutdown,” and you will be asked to run the recovery process again (see Figure 14.29). This does not affect the recovered VMs. As the Protected Site was not contactable during the disaster the Recovery Plan needs to be run again to clean up the environment. As far as the Protected Site is concerned, it thinks the relationship is still one where it is the Protected Site (New York), and the other site was the recovery location (New Jersey). The failover to the Recovery Site was carried out when it was in an unavailable state. If you like, it’s as if both New York and New Jersey believe they are the Protected Site. Running the Recovery Plan again corrects this relationship to make New York the Recovery Site and New Jersey the Protected Site.

In this scenario, the VMs will still be registered on the Protected Site, rather than being replaced with the shadow VMs that have the underscore as their prefix. Once this mandatory recovery has been completed, you will be at the stage at which a reprotect phase can be run, and you can consider a failback to the original Protected Site. Before issuing the reprotect plan it’s perhaps worth thinking about the realities of a true disaster. If the disaster destroyed your storage layer at the Protected Site, and new storage has to be provisioned, it will need to be configured for replication. That volume of data may be beyond what is reasonable to synchronize across your site-to-site links. It’s highly likely in this case that the new array for the Protected Site will need to be brought to the DR location so that it can be synchronized locally first, before being shipped to the Protected Site locale.

Failover-and-failback- (28).jpg

Figure 14.29 As SRM was unable to gracefully shut down VMs in the Protected Site, these are marked as not being shut down.

Summary

As you have seen, the actual process of running a plan does not differ that greatly from running a test. The implications of executing a Recovery Plan are so immense that I can hardly find the words to describe them. Clearly, a planned failover and failback is much easier to manage than one caused by a true failure. I’ve spent some time on this topic because this is chiefly why you buy the product, and perhaps if you are lucky you will never have to do this for real. As with all insurance against disaster, SRM is a waste of money until you have to file a claim against the policy.

Despite this, if you look back through the chapter, most of what I have written is about failback, not failover. This improved automation will be very welcome by some of the big companies that use SRM. I know of certain banks, financial institutions, and big pharmaceuticals which test their DR strategies rigorously—some to the degree of invoking them for real once per quarter, despite not actually experiencing a real disaster. The idea behind this is twofold. First, the only way to know if your DR plan will work is if you use it. Think of it like a UPS system—there’s nothing like pulling the power supply to see if a UPS actually works. Second, it means the IT staff is constantly preparing and testing the strategy, and improving and updating it as the Protected Site changes. For large organizations the lack of an automated failback process was a significant pain point in the previous releases of the SRM product.

Perhaps this will be a good opportunity to move on to another chapter. I’m a firm believer in having a Plan B in case Plan A doesn’t work out. At the very least, you could abandon SRM and do everything we have done so far manually. Perhaps the next chapter will finally give you the perspective to understand the benefits of the SRM product. One of the frequent cries I hear about SRM is folks saying, “Well, we could script all that.” Of course, these people are right; they could more or less cook up a home-brewed DR solution using scripting tools. What they forget is that this needs to be constantly maintained and kept up-to-date. Our industry is somewhat notorious for people building their own solutions, and then poorly documenting them. If I could say one thing about home-brewed scripting solutions it would be this: “There used to be this clever guy called Bob, who built this excellent scripted solution. Anyway, he left the company about six months ago, and now I’m in charge…”