Recovery Site Configuration

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Recovery Site Configuration

Version: vCenter SRM 5.0

We are very close to being able to run our first basic test plan. I’m sure you’re just itching to press a button that tests failover. I want to get to that stage as quickly as possible so that you get a feel for the components that make up SRM. I want to give you the “bigger picture” view, if you like, before we get lost in the devil that is the detail.

So far all our attention has been on a configuration held at the Protected Site’s vCenter. Now we are going to change tack to look at the Recovery Site’s vCenter configuration. The critical piece is the creation of a Recovery Plan. It is likely you will have multiple Recovery Plans based on the possibility of different types of failures, which yield different responses. If you lost an entire site, the Recovery Plan would be very different from a Recovery Plan invoked due to loss of an individual storage array or, for that matter, a suite of applications.

Creating a Basic Full-Site Recovery Plan

Our first plan will include every VM within the scope of our Protection Group with little or no customization. We will return to creating a customized Recovery Plan in Chapter 11, Custom Recovery Plans. This is my attempt to get you to the testing part of the product as soon as possible without overwhelming you with too many customizations. The Recovery Plan contains many settings, and you have the ability to configure

• The Protection Group covered by the plan

• Network settings during the testing of plans

To create a plan follow these steps.

1. Log on with the vSphere client to the Recovery Site’s vCenter.

2. Click the Site Recovery icon.

3. Select the Recovery Plans tab, and click the Create Recovery Plan button as shown in Figure 10.1.

As with Protection Groups, Recovery Plans can now be organized into folders, should you require them. This is especially useful in bidirectional or shared-site configurations where you may want to use folders to separate the Recovery Plans of site A from those of site B.

4. Select the Recovery Site where the plan will be recovered, as shown in Figure 10.2. In my case, this will be the New Jersey site. The appearance of “(Local)” next to a site name is dictated by which vCenter you logged in to when opening the vCenter server. As a general rule, Protection Groups reside at the Protected Site and Recovery Plans at the Recovery Site.

5. Select the Protection Groups that will be included in the plan. In my case, I just have one demo Protection Group, shown in Figure 10.3. But later on I will develop a model where each Protection Group represents an application or service I wish to protect, with datastores created for each application. This will allow for the failover and failback of a discrete application or business unit.

Recovery-site-configuration- (01).jpg

Figure 10.1 As with Protection Groups, Recovery Plans now support folders.

Recovery-site-configuration- (02).jpg

Figure 10.2 Selecting the Recovery Site where the plan will be recovered

Recovery-site-configuration- (03).jpg

Figure 10.3 Protection Groups can contain many datastores. Here there is just one.

Creating a Recovery Plan for vSphere Replication (VR) is no different from creating one based on array-based replication, so long as you have re-created a Protection Group that covers your vSphere replicated VMs.

6. In the Test Networks dialog box you can control what happens to the networking of VMs (see Figure 10.4); specifically when you test a Recovery Plan. You may want for the networking to behave differently while you’re testing the Recovery Plan compared to when you’re running the Recovery Plan. Some customers like to set up a distinct network used just for testing purposes.

The Auto option creates an “internal” Standard vSwitch (SvSwitch) called a “bubble.” This ensures that no IP or NetBIOS conflicts can occur between the Protected Site VMs and the Recovery Site VMs. As Figure 10.4 shows, you can override this behavior and map the port group to a Recovery Site vSwitch that would allow communication between the VMs—but watch out for the possibility of creating conflicts with your production VMs.

On the surface the Auto feature sounds like a good idea; it will stop conflicts based on IP or NetBIOS names. However, it can also stop two virtual machines that should communicate to each other from doing so. Here is an example. Say you have four ESX hosts in a DRS cluster; when the virtual machines are powered on you will have no control over where they will execute. They will automatically be patched to an internal SvSwitch, which means, by definition, that although the virtual machines on that vSwitch will be able to communicate to each other, they will be unable to speak to any other virtual machine on any other ESX host in the cluster. This is what happens if you use the Auto option. The consequences for this are clear:

Recovery-site-configuration- (04).jpg

Figure 10.4 You can set up a dedicated VLAN if you don’t want the system to auto-matically create a new isolated network environment.

Despite your ability to order the power on of virtual machines to meet whatever service dependency issues you have, those network services would fail to meet those dependencies, and therefore fail to start properly.

For the moment, the Auto feature in the Create Recovery Plan Wizard is best regarded as a “safety valve” that allows you to test a plan without fear of generating an IP or NetBIOS name conflict in Windows VMs. In my case, I’m going to leave Auto selected simply so that I can show you the effect of using this option. In the real world, I would actually choose a test VLAN configuration.

7. Set a friendly name for the Recovery Plan, as shown in Figure 10.5. As this Protection Group contains all my VMs, I called mine “Complete Loss of Site Plan – Simple Test.”

You should spend some time thinking of good names and descriptions for your Recovery Plans, as they will appear in any history reports you might export. You might want to develop your own versioning system which you can increment as you make major changes to the Recovery Plan. The Description field is very useful. It is your opportunity to express how the Recovery Plan is used, what is being recovered, and what the expected RTP/RPO values are. Try to view the Description field as your opportunity to embed documentation directly within the SRM product. As such, you could include information about the storage backend and network VLANs used.

Recovery-site-configuration- (05).jpg

Figure 10.5 A Recovery Plan being created. Try to come up with more meaningful and descriptive names for your plans than I have.

8. Click Next and then click Finish to create the Recovery Plan.

As with Protection Groups, Recovery Plans can be much more sophisticated than the plan I just created. Once again, I will return to Recovery Plans in Chapter 11.

Testing Storage Configuration at the Recovery Site

By now you’re probably itching to hit the big blue button in SRM, labeled “Test” (see Figure 10.6).

But before you do, if you want your test to complete correctly it is worth confirming that the ESX hosts in the Recovery Site will be able to access the storage array at the Recovery Site. Previously when we were setting this up we focused on making sure the ESX hosts in the Protected Site had access to the VMFS volume. You may have to take the same considerations into account at the Recovery Site as well.

It might be a good practice to make sure the ESX hosts in the Recovery Site have visibility to the storage, especially if you’re using iSCSI where post-configuration of the ESX hosts is required to allow access to the storage array. In most cases, the hosts at both the Protected and Recovery Sites will already have been granted access to the array location where they reside, especially if you have set up and configured placeholder datastores on the very same arrays that hold the volumes you are replicating.

Recovery-site-configuration- (06).jpg

Figure 10.6 The Test button

You may not even have to manually allow the ESX hosts in the Recovery Site when executing the Recovery Plan to a volume or LUN. For example, in the HP PS4000 SRA it will automatically allocate the ESX hosts access to the very latest snapshot if you have set these up as in the Scheduled Remote Copy format. The HP PS4000 VSA knows how to do this because that is one of its main jobs, and because we provided the IP address and user credentials during the array manager configuration at the Protected Site. This may not be the case in other storage vendors’ management systems; you may well need to create management groups in the storage array, and allow your ESX hosts to access them for SRM to then present replicated LUNs/volumes to the ESX hosts. If you are unsure about this, refer back to Chapters 2 through 6 where I covered array-based replication for different storage vendors. The way to learn more about your vendor’s SRA functions and capabilities is to locate the vendor’s ReadMe files and release notes. Ask your storage representative to point you in the direction of implementation and deployment guides; remember, in life you don’t get what you don’t ask for.

This level of automation does vary from one storage array vendor to another. For example, with your type of storage array you may need to use your “LUN masking” feature to grant your Recovery Site ESX host access to the storage group (a.k.a. volume group, contingency group, consistency group, recovery group) that contains the replicated or snapshot LUN. It is worth double-checking the ReadMe file information that often ships with an SRA to confirm its functionality. Additionally, many storage vendors have such good I/O performance that they create a snapshot on-the-fly for the test and then present this snapshot to the ESX hosts in the Recovery Site. At the end of the test, they will normally delete this temporary snapshot—this is the case with NetApp’s FlexClone technology. Figure 10.7 shows what happens at the storage layer during a Recovery Plan test, regardless of whether you use synchronous or asynchronous replication.

The only thing you need to configure on the array is for the ESX hosts in the Recovery Site to have access to the storage groups, which include the replicated LUNs. As long as you do this, when the test is executed the storage vendor’s SRA will send an instruction to the storage array to create a snapshot on-the-fly, and it will then instruct the array to present the snapshot (not the R/O replicated LUN) to the ESX hosts (this is indicated by the dashed lines in the diagram). This means that when tests are executed, your production system is still replicating changes to the Recovery Site. In short, running tests is an unobtrusive process, and does not upset the usual pattern of replication that you have configured, because the ESX hosts in the Recovery Site are presented a snapshot of the replicated volume which is marked as read-write, whereas the replicated volume is marked as read-only—and is still receiving block updates from the Protected Site storage array. Storage vendors that do not create a snapshot on-the-fly will mount the latest snapshot in the replication cycle. This is the case with the HP PS4000 VSA.

Recovery-site-configuration- (07).jpg

Figure 10.7 Array-based snapshots present a copy of the replicated volume without interrupting the normal cycle of replication.

Overview: First Recovery Plan Test

Well, we’re finally here! If all goes to plan you should be able to run this basic Recovery Plan we have created and find that the VMs in the Recovery Site are powered on. A great many events take place at this point. If you have some software that records the screen, such as HyperCam or Camtasia, you might even want to record the events so that you can play them back. If you want to watch a video of the test, you can view one that I captured and uploaded to RTFM:

What Do We Mean by the Word Test?

Before we actually “test” our Recovery Plan, I think we should discuss what constitutes a proper test of your DR plan. In many ways the Test button in SRM is actually testing that the SRM software works and that your SRM Recovery Plan functions as expected. For many organizations, a real test would be a hard test of the Recovery Plan—literally hitting the red button, and actually failing over the Protected Site to the Recovery Site.

Think of it this way. If you have a UPS system or a diesel generator at your site, you could do all manner of software tests of the power management system, but you won’t really know if the system behaves as hoped until you’ve lost power. With this attitude in mind, it’s not unheard of for large companies to invoke and hard-test their DR plans twice a year. This allows them to identify flaws in the plan, to update their “run books” accordingly, and to keep the team in charge of controlling the DR plan up-to-date with those procedures and unexpected events that can and do happen. In short, clicking the Test button in SRM does not prove or guarantee that the business IT functions will still operate after a disaster. What it does allow is for nonintrusive tests to take place on a much more frequent basis than previously possible—this in itself is a huge advantage. It means you can validate your Recovery Plan on a daily or weekly basis, without ever having to interrupt normal business operations; and for many customers this is one of SRM’s biggest advantages.

A common question customers ask is whether it is possible to trigger a plan automatically, perhaps with a script. Currently, the answer to that question is yes, but I urge you to use caution if you take this approach. No PowerCLI cmdlets exist that would allow you to test or run a Recovery Plan—although my sources at VMware say the company plans to develop cmdlets for SRM, as it has with other products in its lineup. There is, however, a SOAP-based API that allows for some automation of the tasks in SRM. For example, a method called RecoveryPlanStart can be invoked with the SRM API.

With that said, even if there were cmdlets for SRM, many people would worry about false positives—in other words, a DR plan could be acted on automatically, even if no disaster has actually occurred. Generally, people who ask this question are trying to make SRM an availability technology like VMware HA or FT. Nonetheless, some of the more enterprising members of the community have investigated use of the SRM API and .NET objects. So the message is this: Where there is a will, there is always a way, but the idea of automating the execution of a Recovery Plan without a human operator or senior management approval is an idea best avoided.

Another common question is whether it is possible to run two plans simultaneously. The answer is that it depends very much on your storage array vendor’s architecture. For example, on my EMC NS-120 I can run two Recovery Plans almost at the same time. Remember that SRM 5.0 actually increases the number of plans you can run simultane-ously from three in SRM 4.0 to 30 in SRM 5.0. However, you do need the resources at the Recovery Site in order for this to be successful. It stands to reason that the more Recovery Plans you run currently that depend on the number and configuration of those VMs, the more likely you are to run out of the necessary physical resources to make that viable use case. In the real world, customers often have to make compromises about what they recover based on the limited capabilities of their DR location.

What Happens during a Test of the Recovery Plan?

A significant number of changes take place at the Recovery Site location when a test is run. You can see these basic steps in the Recovery Steps tab shown in Figure 10.8. In previous releases the “test” phases of the plan would include a “cleanup phase” once the SRM administrator was satisfied that the test was successful. In this release, this cleanup phase has now been hived off into separate discrete steps. This will allow you to keep the Recovery Plan running in the “test” phase for as long as your storage at the Recovery Site has space for snapshots. One “guerrilla” use of this new functionality is being able to spin up an exact copy of your live environment for testing purposes, something some of my customers have been doing with SRM since its inception.

When you click the Test button, you can follow the progress of your plan. All errors are flagged in red and successes are marked in green; active processes are marked in red with a % value for how much has been completed. I would like to drill down through the plan and explain in detail what’s happening under the covers when you click the Test button. Initially, SRM runs the Synchronize Storage process that allows the SRM administrator to say she wants to carry out an initial synchronization of data between the arrays covered by the Protection Group. This ensures that the data held at the Recovery Site exactly matches the data held at the Protected Site. This can take some time to complete, depending on the volume of data to be synchronized between the two locations and the bandwidth available between the two locations. When you test a Recovery Plan a dialog box appears explaining this process, as shown in Figure 10.9. This dialog box gives the administrator the opportunity to run the test without storage synchronization taking place.

Recovery-site-configuration- (08).jpg

Figure 10.8 Steps 1 and 2 in the Recovery Plan are new to SRM 5.0.

Recovery-site-configuration- (09).jpg

Figure 10.9 Tests now include the option to bring the Protected and Recovery Sites in sync with each other.

If you don’t want the storage synchronization event to take place, merely uncheck the option that states “Replicate recent changes to recovery site.” This replication of recent changes can take some time, and if you just want to crack on and test the plan you may wish to bypass it. After you click Next and the test is triggered, the SRA communicates to the storage array and requests that synchronization take place, and then a snapshot is taken of the volumes containing the replicated VMs. This takes place at the Create a Writable Storage Snapshot phase of the plan. Between these two stages, the plan takes the opportunity to unsuspend any ESX hosts that might be in a standby power state, and, if specified, suspend any unneeded VMs in the Recovery Site.

If you are working with either Fibre Channel or iSCSI-based storage, the ESX hosts’ HBAs are rescanned. This causes ESX to discover the snapshot volumes that contain the VMs replicated from the Protected Site to the Recovery Site. In the case of NFS, these snapshots are then mounted, but with FC/iSCSI-based storage the ESX hosts refresh to see VMFS volumes. This replicated VMFS snapshot is resignatured and given a volume name of “snap-nnnnnnn-virtualmachines” where “virtualmachines” is, in my case, the original VMFS volume name. In the screen grab in Figure 10.10 you can see the Recovery Plan has been started, and the storage layer of the ESX hosts is in the process of being rescanned. In the screen grab in Figure 10.11 you can see a VMFS volume has been mounted and resignatured. All resignatured values start with “snap-” so that they can be clearly identified. In this case, the Recovery Plan was executed using the Dell EqualLogic array I have in my test lab.

Recovery-site-configuration- (10).jpg

Figure 10.10 Several steps take place during a test. You may wish to watch one of my prerecorded videos to see this in real time.

Recovery-site-configuration- (11).jpg

Figure 10.11 A temporary snapshot created on a Dell EqualLogic system. SRM will clean these up once you are finished with your test.

Once the snapshot of the replicated datastore has been presented to the ESX hosts at the Recovery Site, the next step is to unregister the placeholder VMs. Depending on the number of VMs covered by the Protection Groups and Recovery Plans, this can produce many “Delete file” entries in the Recent Tasks pane. This can be a little bit misleading. The placeholder files haven’t been physically deleted from the placeholder datastore; rather, they have been “unregistered” from vCenter, as shown in Figure 10.12.

Recovery-site-configuration- (12).jpg

Figure 10.12 Placeholder VMs are unregistered, and replaced with the real VMs located on the mounted snapshot.

Next, the real VMX files located on the snapshot volume are loaded and then reconfigured, as shown in Figure 10.13. This reconfiguration is caused by changes to their port group settings. As you might recall, these VMX files would have an old and out-of-date port group label: the port group they were configured for at the Protected Site. These need to be remapped either to the internal switch that is created (if you used Auto when you defined your Recovery Plan; see Figure 10.14) or to the port group on the virtual switches at the Recovery Site.

At this stage, the VMs start to be powered on in the list, as controlled by their location in the priority groups. By default, all VMs from all Protection Groups backing the Recovery Plan are mapped to internal switches with the label “srmvs-recovery-plan-NNNNN” as shown in Figure 10.15. This SvSwitch also contains a port group called “srmpg-recoveryplan-NNNN.” This naming convention is used to guarantee that both the SvSwitch and the port group are uniquely named.

Previous versions of SRM created a vSwitch called “testBubble-1 vSwitch” with a port group called “testBubble-1 group.” For this reason, historically this vSwitch has been called the “bubble” vSwitch because its lack of a vmnic essentially means the VMs are locked in a network bubble on which they communicate to one another on the same vSwitch. So, if you come across colleagues who have used SRM before, it’s this format of networking that they are referring to. Remember, this behavior only happens if you used the Auto feature when you first created the Recovery Plan.

Recovery-site-configuration- (13).jpg

Figure 10.13 VMs are registered to the hosts during the Recovery Plan.

Recovery-site-configuration- (14).jpg

Figure 10.14 If Auto has been used as the switch type for the test, SRM adds a virtual standard switch on each host in the affected cluster.

Recovery-site-configuration- (15).jpg

Figure 10.15 The Auto option generates a standard switch called “srmpg-recovery-plan.”

IMPORTANT NOTE: Occasionally when the test hangs or goes seriously wrong, in previous versions of SRM I’ve seen the cleanup phase fail, and this error subsequently fails to remove the vSwitch/ port group. It is safe to remove it manually once the test has completed. If your Recovery Plan fails halfway through the process, you can have components left over. For this reason, SRM 5.0 now includes a separate process to run a cleanup. If this cleanup fails for whatever reason, the cleanup plan can be run in a Forcible Cleanup mode. I have found this cleanup process to be very reliable, and the separation of the Recovery Plan from the cleanup process offers more flexibility over previous releases of SRM.

Practice Exercise: First Recovery Plan Test

To practice creating a Recovery Plan, follow these steps.

1. Log on with the vSphere client to the Recovery Site’s vCenter.

2. Click the Site Recovery icon.

3. Open the Recovery Plans tab.

4. Select your plan. In my case, the plan was called “Complete Loss of Site Plan – Simple Test.”

5. Click the Test button.

6. In the dialog box that opens, select whether you want to replicate recent changes.

WARNING: Do not click the Recovery button. This actually invokes DR proper. Unless you are in a lab environment, you will need to seek higher approval from the senior management team in your organization to do this.

During the test phase, the icon of the Recovery Plan will change, and the status will change to Test In Progress, as shown in Figure 10.16.

Once all the VMs are in the process of being powered on, the process will progress to around 56% and SRM will wait for a successful response from the VMware Tools heartbeat service, indicating that the plan has been successful. At that point, the Recovery Plan will change to a Test Complete status indicated by a green arrow, as shown in Figure 10.17.

This icon will be accompanied by a “message” event in the Recovery Plan, as shown in Figure 10.18. Messages can be viewed in any of the tabs of a Recovery Plan. The administrator can add these messages to pause the plan to allow for some kind of manual intervention in the recovery steps, or to confirm that a recovery phase has completed successfully. (We will discuss custom messages in more detail in Chapter 11.) At this point, the test of your recovery has completed and the recovery VMs should be powered on. Of course, it is useful to examine any errors as a way to troubleshoot your configuration.

For example, I had two errors: The first appears to be benign, and the second is a configuration error. The first error indicated a timeout waiting for a heartbeat signal from the VM called db02. I checked the VM at the Recovery Site, and found that VMware Tools was running but was not up-to-date. The second error was caused by a mistake in my inventory mappings for a template. I had not correctly allocated a valid port group for my template VM in the inventory. I had to let the Recovery Plan complete before I could review my inventory mappings.

Recovery-site-configuration- (16).jpg

Figure 10.16 Pay close attention to plan status information; it can prompt you for the next step in the process, such as clean up or reprotect.

Recovery-site-configuration- (17).jpg

Figure 10.17 A successfully completed plan

Recovery-site-configuration- (18).jpg

Figure 10.18 A “message” event in the Recovery Plan

Cleaning Up after a Recovery Plan Test

SRM 5.0 introduces a new and subtle change to the way Recovery Plans work. Previously, the Recovery Plan included a built-in cleanup phase. The Recovery Plan would be halted by a message step, leaving the administrator with a “continue” button. Once the button was clicked, the Recovery Plan would clean up after the test. In this release, VMware decided to separate these steps from each other, to give SRM administrators more control over when the cleanup phase begins. The other secondary use of the cleanup option is to force a cleanout if a Recovery Plan has gone wrong due to, for example, a failure of the vCenter or SRM service, or (more likely) an unexpected error at the storage layer, such as a failure to communicate to the array at the Recovery Site. In the past, if something went wrong during the plan it would be up to the SRM administrator to manually clean out any objects created during the Recovery Plan. A good example of this is manually deleting the temporary SvSwitch that was created when you used the Auto feature for handling the network during recovery.

If you’re happy with the success of your test, you can trigger the cleanup process by clicking the Cleanup button or link (see Figure 10.19).

This will produce a pop-up dialog box asking you if you want to proceed with the cleanup process. Notice how the dialog box in Figure 10.20 has the option to Force Cleanup if something has gone wrong with the test of the plan.

Recovery-site-configuration- (19).jpg

Figure 10.19 The Cleanup button is used to power off and unregister VMs, as well as remove the temporary snapshot created during the test.

Recovery-site-configuration- (20).jpg

Figure 10.20 The Force Cleanup option is only available when the first attempt to clean up has failed.

This cleanup phase will carry out three main steps at the Recovery Site, as shown in Figure 10.21.

First, all the VMs recovered during the test will be powered off. Then, if VMs were suspended at the Recovery Site to save precious memory resources, for example, these VMs are resumed. Finally, the temporary writable snapshot of the replicated volume is un-mounted from the ESX hosts; this will normally trigger a rescan of the ESX host HBAs if you are using Fibre Channel or iSCSI-based storage. This reset of storage actually includes a number of steps. In the case of iSCSI-based VMFS volumes, you will first see SRM unmount the iSCSI volumes once the recovered VMs have been powered off and unregistered. Next, the iSCSI LUN will be detached and any static iSCSI targets will be removed from the ESX iSCSI initiator. Finally, SRM will rescan all the HBAs of the ESX hosts and carry out a general refresh of the storage system. Figure 10.22 shows the list of cleanup activities.

Recovery-site-configuration- (21).jpg

Figure 10.21 The steps used in the cleanup phase of carrying out a test

Recovery-site-configuration- (22).jpg

Figure 10.22 The complete list of cleanup tasks for a Recovery Plan test

Controlling and Troubleshooting Recovery Plans

This section addresses management of your Recovery Plan.

Pause, Resume, and Cancel Plans

In previous releases of SRM, you were able to pause, resume, and stop a Recovery Plan. These features have been deprecated in this release. Manually canceling a test was not without consequences if you didn’t allow the system to complete the test, as doing so could leave the SRM in a pending state where it would think a test was still running when it had been canceled. I think this is why VMware removed these options; either they created more problems than they solved, or other methods were deemed more efficient. With that said, there is still an option to cancel a plan if you need to—perhaps you have encountered a serious problem or tested a plan accidentally. If you do cancel a plan, you will most likely have to run a cleanup process, and that will likely require use of the new forcible cleanup capability in many cases. If possible, you should let the plan continue normally.

The resume option appears only if the SRM service has failed while testing a plan at the Recovery Site. Once the SRM service is restarted, you will be given the option to resume the plan.

Personally, if I want to pause or resume the progress of a Recovery Plan regularly, I always prefer to add a message to the plan at the appropriate point. For example, the SRM service can issue a “Test Interrupted” message if it detects a serious error, such as an inability to access a replicated LUN/snapshot, or if it believes another test is in progress or has hung. The screen capture in Figure 10.23 shows such a situation where the SRM service failed on my New Jersey SRM server.

The problem can be quickly remedied by restarting the SRM service using the Services MMC, or with the command net start vmware-dr. If the SRM service fails in the middle of testing a plan, once you reconnect to the service using the vSphere client you will be prompted with the error message shown in Figure 10.24.

Generally, rerunning the plan as indicated by the message is enough to fix the issue. However, sometimes this may be unsuccessful. For example, in one of my tests the recovery process worked but stopped midway through. Although the VMs had been registered with the temporary writable snapshot, they were still seen by the SRM host in the Recovery Site as though they were placeholder VMs complete with their special icon. As vCenter saw these VMs as only placeholders, vCenter could not power them on. Simply rerunning the plan didn’t fix the problem; it just started another Recovery Plan that wouldn’t complete. In the end I fixed the issue by manually unregistering each VM and then reregistering it. This allowed the plan to finish, and I was able to carry out a forcible cleanup of the plan. Once that process was completed, I was able to remove these VMs from the vCenter inventory and repair the placeholders that had not been restored by the Recovery Plans. Problems like this are rare, but it’s important to know they are recoverable. The way to achieve this is to understand what is happening under the covers of SRM, so if it does fail you know how to handle the cleanup phase manually as well.

Recovery-site-configuration- (23).jpg

Figure 10.23 A failure of the SRM service while testing a Recovery Plan. Once the SRM service has restarted, you can resume the test.

Recovery-site-configuration- (24).jpg

Figure 10.24 The message prompt generated if the test is interrupted for whatever reason

Error: Cleanup Phase of the Plan Does Not Always Happen with iSCSI

The cleanup and reset phase of the test plan does not always automatically stop access to the replicated iSCSI LUN/volumes. In my experience of using SRM, it’s not unusual behavior to see that the replicated LUN/volume is still listed under datastores for the recovery ESX hosts after a test has completed. This was especially true in vSphere 4. In vSphere 5 and SRM 5 I have seen marked improvement in the process of unpresenting iSCSI volumes to the ESX host.

Of course, what can happen is that between one test and another a new snapshot could occur. Most SRAs default to using the most recent snapshots. However, some SRAs do not deny access to the snapshot after the test has completed. This can lead to a situation where the VMFS volume remains visible to the ESX host after the test is completed. Whatever happens, by default SRM will always prefer to use the most recent snapshot.

The actual cause of this is quite tricky to explain, as it depends on the time the test plan was run compared with the cycle of replication adopted at the storage array. The error is caused if SRM fails to manually resignature both volumes. There is an easy workaround to this issue: Rename your “older” snapshot VMFS volume to something else, such as “test1-virtualmachines.” This should allow the additional snapshot to be presented without causing the rename annoyance.

You might be interested in knowing why this problem specifically afflicts iSCSI, and does not affect SAN and NFS systems. I’ve seen this a number of times back in the days of ESX 3.5 and ESX 4.0 with or without SRM; it does seem to be the case that even if you make big changes to the ESX host (changing its IQN, blocking the iSCSI port 3260, or denying access to an iSCSI volume) a mounting of iSCSI volumes persists. While I was at VMworld 2009, it was my very good fortune to finally meet VMware’s Andy Banta. I’d known Andy for some time via the forums but hadn’t met the guy face to face. Andy was part of the team involved in developing the ESX host’s iSCSI stack. So, after the event, I made a point of asking Andy why this happens with iSCSI and ESX. It turns out that it all hinges on how your storage vendor implements iSCSI, and that ESX hosts will keep iSCSI sessions alive because it is too dangerous to simply tear them down when they could be in use by other targets. Here is Andy’s explanation, albeit with a little tidying up by yours truly:

First off, there’s a distinction to be made: CLARiiONs serve LUNs, whereas HP PS4000 and EqualLogic systems present targets. The behavior on a CLARiiON should be the same for both FC and iSCSI: LUNs that go away don’t show up after a rescan.

What you’re seeing are the paths to targets remaining after the target has been removed. In this case, there are not sessions to the targets and the system no longer has any access to the storage.

However, ESX does hang on to the handle for a path to storage even after the storage has gone away. The reason for this is to prevent transient target outages from allowing one piece of storage to take over the target number for another piece of storage. NMP uses the HBA, target, and channel number to identify a path. If the paths change while I/O is going on, there’s a chance the path won’t go to the target that NMP expects it to. Because of this, we maintain a persistent mapping of target numbers to paths.

We also never get rid of the last path to storage, so in this case, since SRM used the snapshot as storage, the ESX storage system won’t get rid of it (at least for a while). In 3.5 we used an aging algorithm to let us know when we could reuse a target ID. What Dell EqualLogic finally ended up recommending to customers for 3.5 was to rescan after removal ten times (our length of aging before we gave up on the target).

Error: Loss of the Protection Group Settings

Occasionally, I’ve seen Recovery Plans lose their awareness of the storage setup. An SRM administrator deleting the Protection Group at the Protected Site usually causes this. If the administrator does this, all the placeholder VMs disappear. The Recovery Plan then becomes “orphaned” from the storage configuration at the other location—and doesn’t know how to contact the storage array to request access to the replicated volumes/LUNs. Essentially it becomes a plan without any VMs to recover. If you delete a Protection Group or disable the last Protection Group within a plan, the plan’s status will be updated to reflect this error state (see Figure 10.25).

You can fix this problem by reconfiguring the Recovery Plan and ensuring that it can see the Protection Group(s). If the Protection Group has been deleted, you will need to re-create it at the Protected Site.

1. Right-click each Recovery Plan.

2. Choose Edit.

3. Click Next to accept the existing plan site location.

4. Ensure that the checkboxes next to the affected Protection Groups are selected.

Recovery-site-configuration- (25).jpg

Figure 10.25 Recovery Plans can have their Protection Groups removed from them, resulting in this error message.

5. In the Edit Recovery Plan, set the options to handle networking when you run a test.

6. Click Finish.

As you can see, casually removing or deleting Protection Groups has a huge, huge impact on your configuration, essentially destroying the hard work you put into creating an effective Recovery Plan. For this reason you should, at all costs, avoid the Windows Administrator solution to all problems, which is “Let’s just delete it and add it back in, and see if it works again.” I highly doubt you will pay much heed to this warning—until you experience the problem firsthand. Like me, you probably learn as much from your mistakes as from your successes. In fairness you could say that the software should at least warn you about the impact of removing or deleting Protection Groups, and to some degree I think that’s a fair comment.

Error: Cleanup Fails; Use Force Cleanup

When a Recovery Plan fails you might find that the test completes (with errors). When you run the cleanup part of the plan this can fail as well (see Figure 10.26).

Recovery-site-configuration- (26).jpg

Figure 10.26 The error message if the cleanup failed

In my case, this was caused by a problem in presenting a writable snapshot to the recovery hosts. This was triggered by an error in the replication setup in one of my arrays. When the Recovery Plan was run no volume was presented to the host, and the cleanup failed as it attempted to destroy a volume that didn’t exist (see Figure 10.27).

If this happens, you will have no choice but to run the cleanup process again, using the Force Cleanup option (see Figure 10.28).

Error: Repairing VMs

Occasionally, I’ve seen errors on both my Protection Groups and my Recovery Plans caused by the ESX host, in which the placeholder VMs for the affected VMs become “disconnected” or connectivity to the placeholder datastores is lost. Generally, I have to restart the management services on the ESX host, and then use the Restore All or Restore Placeholder option to fix the problem. It’s important to resolve the cause of the problem before clicking the Repair All option; I’ve seen SRM wait a very long time before it gives up on the process.

Error: Disconnected Hosts at the Recovery Site

Before embarking on a test or a Recovery Plan, it’s worth confirming that your ESX hosts in vCenter are active and are not displaying any errors or warnings. You might want to do this for a number of reasons. Any ESX host that is unavailable in a VMware cluster could become excluded by DRS and initial placement, and then will not be selected as the target for powering on VMs. That could significantly reduce the amount of compute resources at the Recovery Site, and as a consequence degrade the overall effectiveness of the Recovery Plan or degrade the quality of service provided by the Recovery Site. In some cases, a disconnected host can cause unwanted errors in a Recovery Plan. For example, in one of my Recovery Plans I asked for a VM in the Recovery Site to be suspended. Regrettably, at the time of running the plan an ESX host had become disconnected in the vCenter inventory—and that host was running the VM that was a target to be suspended. It caused the Recovery Plan error shown in Figure 10.29.

Recovery-site-configuration- (27).jpg

Figure 10.27 The system failing to remove the snapshot on one of my arrays

Recovery-site-configuration- (28).jpg

Figure 10.28 Failure to clean up successfully will result in a Force Cleanup.

Recovery-site-configuration- (29).jpg

Figure 10.29 Failure to suspend a VM called “test01” because the host it was on was disconnected from vCenter

Recovery Plans and the Storage Array Vendors

In this section I will expose the changes taking place at the storage layer when a Recovery Plan is tested. A number of changes take place, and of course they vary significantly from one array vendor to another. On a day-to-day basis you shouldn’t have to concern yourself with this functionality—after all, that’s the point of SRM and SRA from your storage vendor. However, I feel it’s useful to know more about these processes, for a couple of reasons. First, I think it’s interesting to know what’s actually going on under the covers. Second, the SRM administrator may need to explain to your storage team the changes that are taking place, and the more you understand the process the more confidence they should have in you. Third, if something does go wrong with the SRA’s interaction with the storage array, you may have to manually clean up the array after a test or a run of a Recovery Plan if the plan fails or if the cleanup process with SRM is not successful.

Dell EqualLogic and Testing Plans

During the test of a Recovery Plan, you should see that the new VMFS volume is mounted and resignatured by the Dell EqualLogic SRA (see Figure 10.30).

The snapshot in Figure 10.30 relates to the snapshot in Figure 10.31 of the volume being replicated at the New Jersey group.

Remember, for Recovery Plans to work with iSCSI you will need to ensure that the ESX hosts in the Recovery Site are configured for iSCSI, and critically, that they have a target IP set so that they can communicate with the iSCSI array in the Recovery Site. If you don’t do this, you will receive an error message in the Recovery Plan similar to the following.

Error – Failed to create snapshots of replica devices. Failed to create snapshot of replica device 8d90d4d838-virtualmachines. 1. No initiators were found for hosts.

Recovery-site-configuration- (30).jpg

Figure 10.30 As seen earlier, a snapshot of a Dell EqualLogic volume

Recovery-site-configuration- (31).jpg

Figure 10.31 The temporary snapshot called “test-failover,” created at the New Jersey Recovery Site

EMC Celerra and Testing Plans

During the test of a Recovery Plan, you should see that the new VMFS volume is mounted and resignatured by the Celerra SRA. In my case, this was a LUN with an ID of 128 at the ESX host (see Figure 10.32).

You can see this from the Unisphere management system, under the Recovery Site array and the sharing and iSCSI options (see Figure 10.33).

In the same windows, under the LUN Masks tab, you can see that LUN 128 has been presented to the ESX hosts in the Recovery Site (see Figure 10.34).

Recovery-site-configuration- (32).jpg

Figure 10.32 Snapshot of an EMC Celerra iSCSI volume

Recovery-site-configuration- (33).jpg

Figure 10.33 The Unisphere management system and creation of a temporary snapshot with a LUN ID of 128

Recovery-site-configuration- (34).jpg

Figure 10.34 The LUN masks

Remember, for Recovery Plans to work with iSCSI you will need to ensure that the ESX hosts in the Recovery Site are configured for iSCSI, and critically, that they have a target IP set so that they can communicate with the iSCSI array in the Recovery Site. If you don’t do this, you will receive an error message in the Recovery Plan similar to the following.

Error – Failed to create snapshots of replica devices. Failed to create snapshot of replica device 8d90d4d83 8-virtualmachines. 1. No initiators were found for hosts.

NetApp and Testing Plans

With a NetApp FSA you should see when you’re testing plans that the ESX hosts in the Recovery Site mount the replicated NFS datastore automatically. With NFS there is no resignature process to be concerned with, as VMware’s file system (VMFS) is not in use. In the screen grab in Figure 10.35 you can see esx3 and both mounted in an NAS datastore. The name of the datastore is exactly the same as the name in the Protected Site location.

The NetApp filer should also create a temporary FlexClone of the volume that contains the replicated VMs. This allows the existing schedule of SnapMirror to continue working during the period of the test. This means SRM tests can be carried out during operational hours without affecting the system. Using FilerView within NetApp FSA you should see in +Storage and +Volumes that a “testfailoverClone” is created for each volume you have set up for SnapMirror that is configured for a Recovery Plan in SRM. This temporary FlexClone is deleted from the NetApp FSA when the test is completed (see Figure 10.36).

Recovery-site-configuration- (35).jpg

Figure 10.35 The mounting of an NFS-based snapshot from a NetApp system. The mount point name remains the same as in the Protected Site.

Recovery-site-configuration- (36).jpg

Figure 10.36 A temporary snapshot called “testfailoverClone_nss_v” in NetApp System Manager


In this chapter I explained how to run a Recovery Plan. In fact, that’s been my intention from the very beginning of this chapter, believe it or not, as I think that seeing a product “in action” is the quickest way to learn it.

As you saw, clicking the Test button generated a great deal of activity and changes. VMware’s SRM is a very dynamic product in that respect. My hope is that you are following my configuration for real while you read. I know that is a big thing to ask, so if you were not in this fortunate position I highly recommend that you watch the video at No matter how much I screen-grab and document what happens in text, you won’t really get the feel for running a Recovery Plan that you would if you worked your way through the process yourself or watched a video of the event.

I also tried to explain some of the “strangeness” you might see in the SRM product. In fact, it’s not strangeness at all; it’s by design. It’s all about understanding how your storage layer’s cycle of replication interacts with and intersects your activity in SRM. Critically, the quality of your vSphere build will have an impact on your SRM implementation. For example, suppose you incorrectly scaled your VMware cluster environment such that the cluster was so overloaded or misconfigured that you began to have “admission control.” VMware uses the term admission control to describe a condition statement that is applied to resources. It essentially asks the system, “Are there enough resources in the cluster for both performance and failover to satisfy the VM’s requirements?” By default, if the answer to this question is no, the VM will not be powered on and you will see something similar to the error shown in Figure 10.37.

Anyway, it’s time to move away from this behind-the-scenes approach to Recovery Plans now that you are up to speed with the principle. In the next chapter you will spend most of your time creating custom plans which leverage all the features of SRM so that you can test your DR plans against one another, and for different scenarios. So far this book has been about getting the SRM product to work; the next chapter is really about why your organization bought the product in the first place.

Recovery-site-configuration- (37).jpg

Figure 10.37 The type of error you will see if the cluster lacks the performance and failover resources to satisfy the VM