Custom Recovery Plans

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Custom Recovery Plans

Version: vCenter SRM 5.0

So far we have simply accepted the default settings from the Recovery Plan. As you might expect and hope, it is possible to heavily customize the Recovery Plan. Customized Recovery Plans allow you to control the flow of the recovery process. Together with customized virtual machine mappings they allow you to completely automate the common tasks performed when invoking your DR plan. The creation of multiple Recovery Plans with different settings allows you to deal with different causes and scenarios that trigger the use of your Recovery Site—and additionally allows you to test those plans to measure their effectiveness. With custom Recovery Plans you can control and automate a number of settings. For example, you can

• Power on virtual machines at the Recovery Site by priority or order

• Create VM dependency groups that reflect the relationships between VMs

• Control VM start-up times and timeouts

• Suspend VMs at the Recovery Site that are not needed

• Add steps to a Recovery Plan to send a prompt or run a script

• Invoke scripts in the guest operating system

• Change the IP settings of the virtual machines

• Set custom per-VM inventory mappings

Additionally, in this chapter I will delve a little deeper into SRM to discuss managing changes at the Protected Site, and cover how changes in the operating environment impact your SRM configuration. I also will use this chapter to analyze the impact of changes to the storage on your SRM configuration. In short, I will be examining these types of issues:

• Creating/renaming/moving vCenter objects at the Protected/Recovery Site

• Using VMware’s raw device mapping (RDM) feature

• More complicated storage scenarios where there are VMs with multiple virtual disks, held on multiple VMFS datastores with potential use of VMFS extents

• Adding new virtual machines, new storage, and new networks

• The impact of cold migration with file relocation and Storage vMotion

Before I begin, it’s worth mentioning that some of the events that occur in the Recovery Plan will be executed depending on whether you are merely testing your Recovery Plan or actually running failover for real. For example, when you run a plan in Planned Migration mode, SRM will gracefully power down virtual machines at the Protected Site. This never happens in a Recovery Plan test, because it would impact your production operations.

Essentially, the next couple of pages of this book cover everything you can do with Recovery Plans. Once you know you can include scripts embedded in your plans the sky is really the limit. Your plan could include scripted components that affect a physical system or even modify the settings of the very VMs you are recovering. For now it’s important to notice that the scope of our customization is limited to what happens when you run a test of a Recovery Plan. We do have other controls that allow us to control the cleanup process, recovery steps (when you execute the plan for real), and reprotect steps (carried out after a failover). I will be covering this aspect of Recovery Plans as we move on to look at real failover and failback.

SRM 5.0 has a new View component shown in Figure 11.1 that allows you to see the four main types of “steps” associated with the core features of SRM: test, cleanup, recovery, and reprotect. If you toggle between these views you can see a high-level value of what’s involved in each process.

Custom-recovery-plans- (01).jpg

Figure 11.1 The new View pull-down list allows you to see the different steps that SRM is currently capable of carrying out.

Controlling How VMs Power On

It’s this aspect of the Recovery Plan that many regard as being one of the most important configuration steps to complete. Our virtual machines in the Recovery Site must be brought online in the correct order for multitier applications to work. Core infrastructure systems such as domain controllers and DNS will need to be on-stream first. Most people have these services up and running in the Recovery Site before they even begin, and the services are not protected by SRM but use their own internal systems for availability, such as Microsoft Active Directory replication and integrated DNS transfers. Perhaps the earliest systems to be recovered are email and database systems. Those database system services will no doubt be using domain accounts during start-up, and without the directory service running, those services are likely to fail to start with the availability of the core Microsoft Active Directory and DNS services. Additionally, there’s little point in bringing up front-end services such as Web servers if the back-end services with which they have dependencies are not yet functional. In fact, that’s the shorthand we normally use for this concept: “service dependencies.” It is often the case that for VM3 to function, VM1 and VM2 must be running, and for VM2 to work, VM1 must be started first, and so on.

Of course, determining the exact order you need for your plan to be successful is beyond the scope of this book, and is highly specific to your organization. SRM comes with a whole gamut of features to enable you to configure your VMs correctly, including the ability to configure such features as priorities, VM Dependencies, and start-up times. These features are all designed to ensure that VMs come up in a graceful and predictable manner. Therefore, one of the major tasks before you even think of deploying SRM is to appreciate how your applications and services work. In larger organizations this will often mean liaising with the application owner. Don’t imagine for a moment that this will be an easy process, as regrettably, many application owners don’t really understand the service dependencies surrounding the applications they support. Those that do may have already implemented their own “in-guest” availability solution that they will want to “stretch” across two sites if they have not done so already. You might find that these application owners are hostile or skeptical about protecting their applications with SRM. In an enter-prise environment, expect to see as many different availability solutions in use as there are business units. VMware does have technologies you can use to assist in this process, such as Application Discovery Manager which can map out the dependency relationships that make an application work that are both physical and virtual.

Configuring Priorities for Recovered Virtual Machines

Priorities replace the use of the High, Low, and Normal settings which were available in the previous versions of SRM. Without this feature configured, as you might have seen, the virtual machines are more or less powered on randomly, though they are all contained under priority group 3. In this release, SRM introduces five priority levels, with 1 being the highest and 5 being the lowest, and with VMs being powered on relative to their location in the priority group. If it helps, you could regard each priority as representing different tiers of applications or services in your environment. If you can, try to map the VM priorities with the business priorities within the organization.

SRM, in conjunction with vCenter, attempts to start al the VMs within one priority group before starting the next batch of VMs within another priority group. By default, all the VMs are started in parallel unless ordered with the new VM Dependencies feature, which I will cover shortly. VM dependencies replace the old method of ordering the start-up of VMs that involved the use of up and down arrows to move the VM around in the plan. So, regardless of which priority order you use (1, 2, 3, 4, or 5) the VMs are started up in parallel, not serially, unless VM dependencies are configured.

For me this makes perfect sense. After all, if every virtual machine was started up serially rather than in parallel, people with very large numbers of virtual machines would have to wait a very long time to get their virtual machines up and running. Some VMs can be powered on at the same time because there is no dependency between them; others will have to start first in order to meet our precious service dependencies. Of course, the use of parallel start-up by SRM does not mean that all VMs within a priority group are powered on simultaneously. The number of VMs you can power on at any one time is governed by the amount of available compute resources in the Recovery Site. SRM does not power on all the VMs simultaneously, as this runs the risk of overloading your VMware clusters with an excessive burst or demand for resources. SRM will work together with DRS to ensure the efficient placement of VMs. With this said, SRM cannot defy the laws of VMware clusters, and your admission control settings on the cluster and resource pools will take precedence. If you do try to power on more VMs than you have resources available—for example, if a number of ESX hosts are down and unavailable—expect to see admission control error messages within the Recovery Plan, as shown in Figure 11.2.

Whatever you do, remember that you cannot fit War and Peace on a postage stamp. By this I mean you cannot power on your entire estate of VMs from a 32-node cluster on a two-node ESX cluster at the Recovery Site. The buck will have to stop somewhere, and in the world of virtualization, that usually means memory. You’d be shocked at the number of customers who don’t think along these lines.

Custom-recovery-plans- (02).jpg

Figure 11.2 A poorly implemented vSphere deployment will result in problems within Recovery Plans such as an inconsistent Standard vSwitch configuration.

To change the priority level of a VM, follow these steps.

1. Click your Recovery Plan, and select the Recovery Steps tab.

2. Expand the priority that contains your VM.

3. Right-click the VM and select the Priority menu option, as shown in Figure 11.3.

4. Select the priority level appropriate for this VM.

Additionally, you can modify priority orders by using the Configure option of each VM, as shown in Figure 11.4.

Custom-recovery-plans- (03).jpg

Figure 11.3 You might find that priority levels meet your service dependency needs, without having to use VM dependencies.

Custom-recovery-plans- (04).jpg

Figure 11.4 Priority orders can be set on the properties of the VM listed in the Recovery Plan.

When the Recovery Plan is tested, all VMs, regardless of their priority status, will be taken through the primary step of storage configuration. They will then pause and wait for the power-on instruction.

Although these per-VM methods may be helpful for just one or two VMs that may need to be moved, if you have many VMs that need to be relocated to the correct priority level you can use the Virtual Machines tab on the properties of the Recovery Plan, as shown in Figure 11.5. In the list that appears, you can use the Ctrl or Shift key to select the VMs you wish to locate to a new priority.

Adding VM Dependencies

It’s unlikely that priorities alone will deliver the Recovery Plan you need. Critically, you will also want to control the start-up order to meet your service and application dependencies. Within a priority order, it’s possible to configure a power-on order on the property of a VM. The important point to know is that these new VM dependencies are ignored if you include a VM that is not in the same priority group. As an example, my priority 1 category has four VMs (db01, db02, web01, and web02); I want to configure them so that they power on in the correct order. You apply the VM dependency settings by selecting a VM, and on the properties of that VM, indicating which VMs will be started before it. So, if the databases (db01 and db02) must start before the Web servers (web01 and web02) I would select the Web servers in the UI first, and then use their configuration options to set the dependency. The way to look at it is the Web servers have a dependency on the database, so it’s the Web servers that need to be selected to express that relationship. To configure VM dependencies, follow these steps.

Custom-recovery-plans- (05).jpg

Figure 11.5 The bulk method of setting priority orders is a welcome enhancement in SRM 5.0.

1. Click your Recovery Plan and select the Recovery Steps tab.

2. Expand the priority that contains your VM.

3. Right-click the VM and select the Configure menu option.

4. In the dialog box that opens select VM Dependencies from the list, as shown in Figure 11.6.

5. Use the Add button to add a VM to the list to indicate that it must be started first, before the currently selected VM.

Custom-recovery-plans- (06).jpg

Figure 11.6 Express dependencies by selecting a VM and indicating which VMs must start before it. VMs must reside within the same priority group.

In my case, I indicated that db01 must be powered on before web01. Notice that if the power on of db01 failed, this would not stop the power on of web01. Although a failure to start db01 would create a warning in the Recovery Plan, it would not halt the recovery process. If I added to this list the VM called db02, both db01 and db02 would have to power on before SRM would power on web01. In that case, I would be stating that web01 had dependencies on two database systems (db01 and db02).

This process does reorder the VMs in the Recovery Plan. In my case, I repeated this VM dependency configuration for both db02 and web02; as a result, you can see in Figure 11.7 that the VMs are ordered in the priority list accordingly.

Sadly, in the Recovery Steps tab the UI’s 2D representation can give a false impression that this power-on process will happen serially, especially because the numbering system suggests that db01 will start first, followed by db02. In fact, in this case db01 and db02 will be started in parallel. Then, depending on which database completes its power on first (db01 or db02), the VM dependency will trigger the power on of either web01 or web02. For example, Figure 11.8 shows that when I ran this test for the first time, both web01 and web02 completed their storage configuration process before db01 and db02. They then waited for db01 and db02 (at steps 5.3.2 and 5.4.2) to complete their power-on process. The VM db02 was actually the first VM to complete the power-on process (at step 5.2.2), and as such, this triggered the power-on event for web02.

I think it would be nice if the UI of the recovery steps could be improved to show the VM dependency relationship, and for the VMs that will be started in parallel to be marked in some way; it is my hope that VMware will improve the UI in the future to make this more intuitive. If you want to see a list of your VM dependencies you can view them in the Virtual Machines tab of the Recovery Plan, as shown in Figure 11.9.

Custom-recovery-plans- (07).jpg

Figure 11.7 In this case, db01 and db02 will start simultaneously. Web01 and web02 will not start until the “parent” VM has started first.

Custom-recovery-plans- (08).jpg

Figure 11.8 Sadly, the rather 2D representation of VM dependencies makes it difficult to “visualize” the relationship between VMs.

Custom-recovery-plans- (09).jpg

Figure 11.9 The Virtual Machines tab allows the SRM administrator to see the VM dependencies that have been configured.

Configuring Start-Up and Shutdown Options

Our third and last method of controlling the boot process of a VM in a Recovery Plan is via the VM’s start-up and shutdown options. By default, all VMs that are configured in a Protection Group and enrolled in a Recovery Plan are powered on. It’s possible to turn off this behavior and allow a VM to be recovered but not powered on automatically. This would allow you to make a manual choice about whether a VM is available.

A good example of this is where you have an application that can be scaled out for performance and availability. Let’s say you have ten VMs, and in your capacity planning you assume you will only need seven up and running for production-level quality of service. You could protect all ten, but only power on seven. This would give you a “Plan B” on a number of levels. First, if one of the seven malfunctions in the recovery process for whatever reason, you would have two spare VMs ready to take their place. Second, if your capacity planning was incorrect, you would still have three additional VMs to power on to add capacity, assuming your vSphere environment has enough resources. If, however, the seven come up correctly without error, you have saved resources by not powering on an unnecessary number of VMs. So, by default, any VM that is protected is added to the Recovery Plan by virtue of being included in a Protection Group. If you do disable this automatic power on of a VM, it disables it for all Recovery Plans where it is present.

In addition to controlling whether a VM is recovered or recovered and powered on, there are also controls for how long the Recovery Plan waits until powering on the next VM. These are quite closely tied to the VM Dependencies feature we saw just a moment ago. After all, it’s one thing to say that db01 must be booted up before web02, but certain conditions control whether that is regarded as a successful boot process—not to mention the time it takes for the operating system to boot up and for the services it hosts to be available too. By default, the next VM powers on in the sequence based on the interplay of two values in the start-up options: time and the VMware Tools heartbeat service. The default behavior is that a VM will wait five minutes before proceeding with the next step in the plan, unless SRM receives a response from the VMware Tools heartbeat service sooner. In many cases, when you run a Recovery Plan the VMs will start in parallel so long as there are enough compute resources to allow this. But where a relationship has been made between VMs, as is the case with VM dependencies, the start-up conditions will come into play. It’s at this point that the start-up delay times become significant, as you may have a service that takes longer to start than normal, and longer to start than VMware Tools. In this case, you may need to adjust the gap between one power-on event and another. A classic example of this is when you have a two-node Microsoft cluster pair that has been virtualized—one of the nodes in the MSCS cluster must come up and be started fully, before the second partner joins the cluster. If the power-on process is not staged correctly, the cluster service will not start properly, and in some extreme cases could cause the corruption of data.

Finally, start-up actions in SRM also control the shutdown process. As I hope you saw when you ran a test of a Recovery Plan, the VMs that were recovered by the test were shut down gracefully by vSphere using a Guest Shutdown option. These shutdown options only apply when you clean up the plan. Some VMs do not respond well to this command, so you may wish to carry out a hard power-off for certain VMs that react in this way. As with start-up actions, shutdown actions also have a timeout configuration that allows the next VM to be shut down, based on a condition controlled by time, or when a VM has been powered off completely. By selecting a VM in the plan and clicking the Configure option in the menu, you can control these start-up and shutdown parameters, as shown in Figure 11.10.

Beware of removing the tick next to “VMware Tools are ready” as this can add a signif-icant wait time before the next VM is powered on. It is entirely possible to recover a VM but not power it on. Also, by removing the tick you could significantly increase the time it takes for a VM to be classed as completed, to the default value of five minutes.

If you want to change the start-up action in terms of whether a VM is powered on or off, you can use the standard multiple select options in the Virtual Machines tab of a Recovery Plan, as shown in Figure 11.11.

Custom-recovery-plans- (10).jpg

Figure 11.10 SRM has a sophisticated set of power-on and guest shutdown options.

Custom-recovery-plans- (11).jpg

Figure 11.11 The Virtual Machines tab on a Recovery Plan allows for bulk changes of the power-on state of VMs.

Suspending VMs at the Recovery Site

As I mentioned earlier in the book, it is possible to suspend VMs at the Recovery Site. If you include this as part of your Recovery Plan, the selected VMs are suspended. By configuring this option, you are freeing up valuable compute resources, which should mean you can recover more VMs. Of course, suspending a VM is not free from a resource perspective; during the suspend process the selected VM’s memory contents are flushed to a suspend file. This process is similar to using the “hibernate” option that is available on most Windows-based PCs. So, critically, suspending VMs generates disk IOPS as the suspend file is created—and you need the spare disk space to accommodate the suspend file. I think the final point to make about this feature is that although a VM might seem trivial to you, bear in mind that it might be in use when you run a test! To configure VMs to be suspended, follow these steps:

1. Select the Recovery Plan in the SRM inventory.

2. Right-click the Suspend Non-critical VMs at the Recovery Site option. By default, this should be step 3, but adding additional steps above it can alter this number.

3. In the context menu, select the Add Non-Critical VM option to suspend VMs at the Recovery Site (see Figure 11.12).

4. In the Add Local VMs to Suspend window select the VMs you consider noncritical. In my case, I have just one, the Test & Dev VM, that will be suspended whenever the Recovery Plan is tested or run, as shown in Figure 11.13, but this dialog does allow for multiple selections.

When the plan is run you should see this as a successful event. You should also see that select VMs have the “paused” icon next to their names (see Figure 11.14). These VMs are “resumed” as part of the standard cleanup process when you have finished with your test. During this time, it is possible to manually resume the VMs that have been suspended (see Figure 11.15).

Custom-recovery-plans- (12).jpg

Figure 11.12 Selecting the Add Non-Critical VM option frees up spare CPU and memory resources for recovered VMs.

Custom-recovery-plans- (13).jpg

Figure 11.13 You can suspend test or ancillary VMs during a test or run of a Recovery Plan. These VMs are “resumed” during cleanup.

Custom-recovery-plans- (14).jpg

Figure 11.14 The “pause” icon on a VM that has been suspended by SRM

Custom-recovery-plans- (15).jpg

Figure 11.15 VMs that have been suspended by SRM can be manually resumed if needed.

Adding Additional Steps to a Recovery Plan

Within SRM you are not limited to the predefined steps within the Recovery Plan. It’s possible to add additional steps where you deem it appropriate. You can add additional steps to a Recovery Plan to prompt the operator running the plan, as well as to execute scripts to further automate the recovery process.

Adding Prompt Steps

It is possible to interrupt the flow of a Recovery Plan to send a prompt to the operator. If you have been following this book as a tutorial you will have seen this already. It is a default that when all Recovery Plans are created and tested there is a built-in message at the end of the test informing you that the test has completed, and prompting you to carry out the cleanup phase of the test once you are satisfied with the results. In this case, the message is intended to give the operator an opportunity to review the results of the test and confirm/diagnose the success (or otherwise) of the Recovery Plan.

It’s possible to add your own messages to your custom Recovery Plan. In my case, I would like a message to occur after all my priority 1 virtual machines have been powered on. I want a message to appear before the priority 2 VMs are started, asking me to confirm that the primary VMs are up and functioning before allowing the other VMs to be powered on.

1. In the Recovery Plan select Power on Priority 2 VMs; by default, this would be step 6.

2. Click the Add Step icon, as shown in Figure 11.16. Alternatively, you can right-click and choose Add Step.

3. In the Add Step dialog box, select the radio button labeled “Prompt (requires a user to dismiss before the plan will continue).” Enter a friendly name in the Name field for the additional step in the plan, and then enter helpful text in the Content field, as shown in Figure 11.17. Remember that a real execution of the plan might not be carried out by the administrator who configures SRM, so these prompts must be meaningful to all concerned. With that said, you want to avoid anything overly specific that is open to change in the longer term, such as “Call John Doe on 123-456-7891,” as without proper maintenance this type of content can quickly become stale and meaningless.

Timeout settings do not apply to prompts; only to scripts, which we will configure shortly. Selecting the “Before selected step” setting will make my prompt occur before the priority 2 VMs are run. This will cause a global renumbering of all the steps that come after the prompt as well (see Figure 11.18).

When the plan is tested or run, it will halt at the prompt statement. You should see the icon for the Recovery Plan change (see Figure 11.19), as well as the status value being adjusted to Waiting for User Input.

Finally, it is possible to insert messages and commands on the properties of each virtual machine. In the Virtual Machines tab of the Recovery Plan, each virtual machine can be edited and per-virtual machine messages can be added. These are referred to as Pre-Power On Steps and Post Power On Steps, as shown in Figure 11.20. In fact, you can have as many steps as you need before a VM powers on or off. These run in the order they are listed, and there are small up and down arrows that allow you to reorder these steps on the VM.

Custom-recovery-plans- (16).jpg

Figure 11.16 The Add Step button can be used to add a message or command steps to a Recovery Plan.

Custom-recovery-plans- (17).jpg

Figure 11.17 Adding a message prompt

Custom-recovery-plans- (18).jpg

Figure 11.18 Adding prompt steps or command steps causes a renumbering within the Recovery Plan.

Configuration of these VM steps is almost the same as the Recovery Plans steps. The only difference is that they are the property of the VM, as you can have more than one step backing a VM in the interface, allowing you to control their running order. I will be using these pre-power-on steps when we cover scripting later in this section. Figure 11.21 shows that the properties of a VM in the Recovery Plan change as you add message or command steps.

Custom-recovery-plans- (19).jpg

Figure 11.19 The Recovery Plan’s status has changed, indicating that the plan is waiting for a human operator to accept the message.

Custom-recovery-plans- (20).jpg

Figure 11.20 SRM demonstrates its granularity by allowing as many pre-power-on and post-power-on steps as you need.

Custom-recovery-plans- (21).jpg

Figure 11.21 A prompt step is indicated by an icon of a human, and a command step is indicated by an icon of a scroll or script with a checkmark.

Adding Command Steps

As with messages, it’s possible to add commands to the Recovery Plan. These commands could call scripts in a .bat, .cmd, .vbs, .wmi, PowerShell, or Perl file to automate other tasks. In fact, you’re not limited to these scripting languages; as long as the guest operating system supports the command-line engine you want to run, it should work. As I hope you can tell, SRM is all about providing a friendly environment within which you can test Recovery Plans or carry out datacenter moves. It’s unlikely that you will be able to get the plan to do what you want it to do without some level of scripting. So try not to see SRM as being a killer to scripts. Instead, try to see it as removing most of the heavy lifting required to get virtual recovery working—and empowering you to spend more time handling the unique requirements of your organization. SRM it is not a silver bullet.

When you call these scripts you must provide the full path to the script engine and the script file in question. For example, to run a Microsoft .bat or .cmd script you would supply the following path:

c:\windows\system32\cmd.exe /C c:\alarmscript.cmd

The path in this sample is relative to the SRM server. So, in this case, the script would be located on the C: drive of the Recovery Site SRM server. Of course, there’s no need to store your scripts on the disk of the SRM host. They could be located on a network share instead.

IMPORTANT NOTE: Although I’ve used an uppercase /C at the command line that is just a convention. The command will work with a lowercase /c. Also, remember that in the world of Windows, if you have filenames or directory names which contain spaces you will need to use “speech marks” around these paths.

These scripts are executed at the Recovery Site SRM server, and as a consequence you should know they are executed under the security context of the SRM’s “local system” services account. As a test, I used the msg command in Windows to send a message to another Windows system. In case you don’t know, the msg command replaced the older net send command and the older Messenger service. So, in the alarmscript.cmd file, I set up the msg script and stored it on the C: drive of the Recovery Site host:

@echo off

msg /server:mf 1 administrator "The Process has completed. Please come back to SRM and examine the Plan.

1. In the Recovery Plan select where you would like the step to be located. In my case, I chose “+ 6. Prompt: Priority 1 Completed.”

2. Click the Add Step icon.

3. Select the Command on SRM Server radio button, as shown in Figure 11.22. Enter a friendly name for the step, the path to the script command interpreter, and the script you would like to run.

4. Configure a timeout value that befits the duration that the script needs to execute successfully.

5. Select whether your step is triggered before or after your selection in the plan.

I sometimes add a pop-up message to my management PC at the end of the Recovery Plan (see Figure 11.23). This means I can start a Recovery Plan and go off and perform other tasks, such as checking my email, browsing Facebook, and tweeting my every thought in a stream-of-consciousness manner that James Joyce would be proud of. My pathetic attempts at social media will be conveniently interrupted by the pop-up message that reminds me I’m supposed to be doing real work.

Custom-recovery-plans- (22).jpg

Figure 11.22 A simple script that will use cmd.exe to run a basic Microsoft “batch” file

Custom-recovery-plans- (23).jpg

Figure 11.23 A command prompt has been added to the Recovery Plan.

Adding Command Steps with VMware PowerCLI

Although using classic-style “batch” files to carry out tasks is nice, it’s not an especially powerful API for manipulating and modifying the vSphere platform. If you want to make subtler scripts, you really need a more robust scripting engine. Fortunately, VMware has for some time embraced the Microsoft PowerShell environment, supplementing it with cmdlets that are specific to managing VMware vSphere and other technologies such as VMware View. If you are new to PowerCLI and PowerShell I heartily recommend following Alan Renouf’s “Back to Basics” series of tutorials that will get you quickly up to speed. Alan helped me with this chapter immensely, and some of his scripts are reproduced here with his kind permission:

PowerShell offers the SRM administrator a plethora of scripting opportunities. Power-Shell could be used to send commands to services running in the Windows operating system, or VMware PowerCLI could be used to actually modify the properties of the Recovery Site or the VMs being recovered. Start by downloading and installing VMware PowerCLI to the SRM server in the Recovery Site. I think it is very likely that you will want to use these scripts in your failback process as well, so install PowerCLI to the Protected Site SRM as well:

After the installation, confirm that the PowerCLI execution policy is correctly set, and that you can successfully communicate to the vCenter to which the SRM server is configured. In order to allow PowerCLI scripts to execute silently I set the policy to be remote-signed. You may need to consult your datacenter policies regarding this configuration, as you might find you need a higher level of trusted certificate signing for your scripts. Remember, your PowerCLI scripts may run a combination of both 32-bit and 64-bit cmdlets. These are controlled by separate execution policies, so make sure the policy is correctly configured for both types of cmdlets.

set-executionpolicy remotesigned

Next, I confirm that I can connect successfully to the vCenter server with the connect - viserver cmdlets, like so:

connect-viserver vcnj -username corp\administrator -password vmware

If you fail to configure the execution policy correctly you will get this error message:

Warning - The command 'c:\windows\system32\cmd.exe' has returned a non-zero value: 1

This error is normally caused by a typo in the string in the Edit step or an error in your script—for example, typing “redirect.bat” instead of “redirect.cmd” or giving a script file a bad name, such as “ 1.ps1”. Test your scripts manually from an inter-active login prompt at the SRM server, and invoke them with the same redirect.cmd file that the Recovery Plan will use. I also recommend running each line of a PowerCLI script to confirm that it works. Another good tip is to create a Recovery Plan with just a couple of VMs. This will allow you to test your Recovery Plan and associated scripts quickly. There’s nothing more annoying than discovering you have a typo in a script—where the Recovery Plan takes 30 minutes to test each time, and ten minutes to clean up. I speak from bitter experience. Nine times out of ten any problems you have will arise from typos and you will kick yourself when you find out it’s your fault and not the system’s!

One of the most common questions asked on the SRM Forums is how to reduce the amount of RAM used by the VMs during the recovery process. This is because people sometimes have less powerful ESX hosts in the Recovery Site. For example, an organization could have ESX hosts in the Recovery Site that are configured with less physical memory than the production ESX hosts in the Protected Site. Using PowerCLI we can automate the process of reducing the VMs’ RAM allocation by running .ps scripts before the power-on event. There are a couple of ways to do this with PowerCLI.

Example 1

You could have a .ps 1 script for every VM and reduce its memory. The scripts that modify the configuration of the VMs themselves would need to execute before the VM powers on for them to take effect.

Following is a sample .ps script that will do just that for my web01 VM. This script uses the set-vm cmdlet to reduce the recovery VM’s memory allocation to 1,024MB. The –confirm: $false command prevents the script from waiting for a human operator to confirm the change. The disconnect statement at the end of the script is important. Without the disconnect statement, a session could remain open at the vCenter, resulting in wasted resources at the vCenter.

connect-viserver vcnj -user corp\administrator -password vmware

Set-VM web01 -MemoryMB "1024" -Confirm:$FALSE

Disconnect-VIServer -Server vcnyc -Confirm: $FALSE

Example 2

Of course, a .ps script for every VM would be very administrator-intensive, but it may be necessary, as VMs can be quite varied in their requirements. So you might prefer to search for VMs based on their name, and make changes that affect many VMs simultaneously. For example, in the following .ps script, the get -vm cmdlet is used to find every VM which starts with the text “web01” and then “pipelines” this to the set -vm command. This will modify the memory of VMs web01, web02, and so on.

connect-viserver vcnj --user corp\administrator --password vmware

start-sleep –s 300

get-vm web* | Set-VM -MemoryMB "1024" -Confirm: $FALSE

Disconnect-VIServer -Server -Confirm:$FALSE

The line that pauses the script for 300 seconds (five minutes) needs a little explanation. Clearly, we can’t use PowerCLI to decrease the amount of memory after the VM is powered on. Additionally, we cannot start modifying the VMs’ settings until we are sure the Configure Storage stage on each affected VM has completed; you know when this process has properly completed when all the “Reconfigure virtual machine” events have been completed. If we carried out this task too quickly in the Recovery Plan we might find the script changes the memory settings on the placeholder, not on the recovered VM. As you might recall from Chapter 10, Recovery Site Configuration, it’s during the Configure Storage step in a Recovery Plan that a placeholder VM is removed and replaced with the real .vmx file from the replicated storage. How long it takes for all your VMs to be converted from placeholders to VMs will vary based on the number of VMs in the Recovery Plan. If the script were attached directly to the VM (as would be the case with Example 1) as part of its pre-power-on step this sleep stage would not be required. If the script only modifies VMs in the lower-priority lists such as level 2, 3, 4, or 5 it’s likely that these VMs will already be in the right format for the PowerCLI script to modify them. So you can regard this sleep stage as being somewhat overcautious on my behalf, but I would rather have a script that works perfectly all the time than a script that only works intermittently.

Example 3

One of the difficulties with using the VM’s name as the scope for applying a change like this is the fact that you cannot always rely on a robust naming convention being used within an organization from the very first day of deploying virtualization. Additionally, VMs are easily open to being renamed by an operator with even very limited rights. So you might prefer to perform your scripts against a resource pool or folder where these objects can be protected with permissions. This would allow for bulk changes to take place in all VMs that were in that resource pool or folder location.

connect-viserver -user corp\administrator -password vmware

start-sleep –s 300

get-folder web | get-vm * | Set-VM -MemoryMB "1024" -Confirm:$FALSE

Disconnect-VIServer -Server -Confirm:$FALSE

Example 4

Perhaps a more sophisticated script would not set a flat amount of memory, but instead would check the amount of memory assigned to the VM and then reduce it by a certain factor. For example, perhaps I want to reduce the amount of memory assigned to all the recovered VMs by 50%. The following script finds the current amount of memory assigned to the VM, and then reduces it by 50%. For each VM found with the web* string in its name, it finds the amount of memory assigned and then uses the set -vm cmdlet to set it correctly by dividing the VM. MemoryMB amount by 2.

connect-viserver -user corp\administrator -password vmware

start-sleep –s 300

Foreach ($VM in Get-VM web*)


$NewMemAmount = $VM.MemoryMB / 2
Set-VM $VM -MemoryMB $NewMemAmount -Confirm:$FALSE


Disconnect-VIServer -Server -Confirm:$FALSE

In my case, I decided to use this final method to control the amount of memory assigned to the Web VMs. I would like to thank Al Renouf from the UK, as he helped write this last example. In case you don’t know, Al is very handy with PowerShell, and his Virtu-Al blog ( is well worth a read. He has also coauthored a recent book on VMware PowerCLI, and in 2011 became a VMware employee.

Step 1: Create a redirect.cmd File

The next phase involves getting these .ps files to be called by SRM. One method is not to call the .ps script directly, but instead to create a .cmd file that will call the script at the appropriate time. This helps to reduce the amount of text held within the Add Step dialog box. By using variables in the .cmd/.bat file, we can reuse it time and time again to call any number of .ps files held on the SRM server. I first came across the redirect.cmd file while reading Carter Shaklin’s PowerCLI blog which discussed using .ps scripts with vCenter alarms:

And with help from Virtu-Al’s website, I was able to come up with a .cmd file that would call my .ps 1 PowerShell files. The script loads the Microsoft PowerShell environment together with the PowerShell Console file (.psc1) that allows VMware’s PowerCLI to function. The variable at the end (%1) allows for any .ps1 file to be called with a single redirect.cmd file. The only real change in recent times is my move to using Windows 2008 R2 64-bit, as this changed the default paths for the location of the vim.psc1 file. The contents of the file are then saved as “redirect.cmd”:

@echo off

C: \WINDOWS\system32\windowspowershell\v1. 0\powershell .exe -psc "C: \Program Files (x86) \VMware\Infrastructure\vSphere PowerCLI\vim.psc1" "& '%1'"

Step 2: Copy the redirect.cmd and PowerCLI Scripts to the Recovery SRM Server

Now you need to copy your redirect.cmd and .ps file(s) to a location on the recovery SRM server if they are not there already, or to some shared location that is used to hold your scripts. It doesn’t really matter where you place them, so long as you correctly type the path to the script when you add a command to the Recovery Plan. In this case, web01-ram. ps1, 1, and 1 represent some of the different examples discussed previously (see Figure 11.24).

Step 3: Add a Command to the Recovery Plan

At this point, you need to add a command step to the Recovery Plan. In this example, I’m using the script I created to reduce all the Web-based VMs’ memory allocation by half. I am adding the step high in the plan, before any of the VMs have powered on, to make sure that wherever my Web VMs are located—be they in priority 1 or priority 5—they are properly reconfigured.

Custom-recovery-plans- (24).jpg

Figure 11.24 Arrangement of different PowerShell script files together with my redirect.cmd file

1. In the Recovery Plan, select Priority 1.

2. Click the Add Step button.

3. Enter the full path to the command interpreter (cmd.exe) and include the redirect. cmd file and the .ps file you would like to execute (see Figure 11.25). In my case, this was the script:

c:\windows\system32\cmd.exe /C c:\redirect.cmd c:\web-ram-half.ps1

This will appear in the plan as shown in Figure 11.26.

Custom-recovery-plans- (25).jpg

Figure 11.25 A command step being added to run the script that reduces the RAM allocation by 50%

Custom-recovery-plans- (26).jpg

Figure 11.26 The command step appears before the first priority level, enabling the reconfiguration to complete before power-on events begin.

Step 4: Add a Command to the Individual VM

My script assumes that al VMs that are Web servers need to have their RAM increased. But what if I have exceptions to this? What if I have a VM for which I want to increase the allocation of memory, or have some unique per-VM setting? This is where VM-based pre-power-on and post-power-on scripts come in handy. They work in much the same way as scripts added as major steps to the Recovery Plan.

1. In the Recovery Plan, select the VM. In my case, I chose the web01 VM that I had created a specific script for earlier.

2. Right-click and select Configure in the context menu

3. In the property list select the Pre-power On Step option, and in the dialog box click Add.

4. Enter the full path to the command interpreter (cmd.exe) and include the redirect. cmd file and the .ps file you would like to execute (see Figure 11.27). In my case, this was:

c:\windows\system32\cmd.exe /C c:\redirect.cmd c:\web01-ram.ps1

So long as any pre-power-on script executes after any other PowerCLI script, you can be pretty sure that it will take effect over any other scripts you run.

Managing PowerCLI Authentication and Variables

So far I’ve focused on using PowerCLI with the –username and –password fields to handle authentication settings that are hardcoded to the PowerShell script file. It strikes me that most system administrators would prefer to use these scripts in a more secure manner. This would also allow the SRM administrator more flexibility to be able to “pipe” variables to the PowerCLI scripts. For example, say you want to use the ability to reduce memory on a VM by specifying the vCenter, VM, memory allocation, and a log file location to confirm that the script has executed.

Custom-recovery-plans- (27).jpg

Figure 11.27 Pre-power-on scripts allow for “exceptions to the rule” situations where every VM is configured the same way, except web01.

It’s a little-known fact that PowerCLI supports Active Directory authentication natively, and as such there is no need to specify the username and password credentials so long as the user account you log in as is the domain account with the correct privileges in vCenter to carry out the required task. For this to work in SRM an account must be created in Active Directory and granted privileges at the Recovery Site vCenter. As you will undoubtedly need the same rights at the Protected Site vCenter, this account could be used as part of any failback procedure you undergo, which would reverse or undo the changes made while you were in a failover situation. As you may recall, by default the SRM service account is the “local system,” and as such, for this “pass-through authentication” to be successful this will need to change, as shown in Figure 11.28. You can do this by locating the VMware vCenter Site Recovery Manager services in the Services MMC, and using the This Account option to reset the default account used; by default, the process grants the selected user the right to log on as a service.

Of course, this service account under which the SRM scripts will execute needs to be granted privileges in vCenter in order for the scripts to run successfully. In my case, I allowed user srmnj the rights to run scripts only in the scope of the resource pool where the recovered VM would be located (see Figure 11.29). There are many ways to handle this delegation of privileges, such as using an existing group, or granting the service account privileges over more objects in the Recovery Site vCenter. The important point here is to ensure that the rights do cover the scope of your scripting activities, and that they are the least permissive available.

Custom-recovery-plans- (28).jpg

Figure 11.28 PowerCLI pass-through authentication requires a domain account, and permissions and rights in vCenter.

Custom-recovery-plans- (29).jpg

Figure 11.29 The srmnj account being given administrator privileges to the NYC_DR resource pool where New York’s VMs will be recovered

Example 5

In terms of the PowerCLI script that offers more flexibility, an example follows. The Param value allows the author of the script to define parameters that can be passed along the command line to PowerCLI. In this case, four parameters have been defined: the name or IP address of the vCenter, the VM to be modified, the amount of memory to allocate, and log file location for the script itself. This log file allows the administrator to check a text file to confirm the result of the script.

Param (






If ($LogFile) {

Start-Transcript $LogFile


if (! (get-pssnapin -name VMware.VimAutomation.Core -erroraction silentlycontinue)) {

add-pssnapin VMware .VimAutomation.Core


Connect-VIServer $vCenter

Set-VM -VM $VM -MemoryMB $Memory -Confirm:$false

Disconnect-ViServer * -Confirm:$false

If ($LogFile) {



This script can be added to the Pre-power On Step for a VM. Remember, the full path to the PowerShell.exe must be specified in the dialog box together with the path to the script and log file location, as shown in Figure 11.30. I can invoke the script on a per-VM basis using the following syntax. Notice in this case the PowerShell environment is being called directly, with the .ps 1 file containing the necessary lines to load the VMware PowerCLI extensions.

C: \Windows\System32\WindowsPowerShell\v1. 0\PowerShell .exe -File C: \ SetMemory.ps1 -vCenter -VM web01 -Memory 512 -Logfile C:\ SetMemoryLog.txt

Another good way to handle scripts is to hold the variables in a .csv file. This could prove quite useful if you want to reconfigure the memory settings of many VMs, and at the same time hold the original memory allocations of your VMs before the failover occurred (say, for use in a failback process). This would also allow you to use a spreadsheet application like Microsoft Excel to change the variables you needed. In my case, I created a .csv file in Excel, using columns for the VM name, failover memory, and failback memory. Column B is used whenever failover occurs between New York and New Jersey, and column C is used to reset the memory allocation in the event of a failback to the New York site (see Figure 11.31).

Custom-recovery-plans- (30).jpg

Figure 11.30 Passing parameters to the command line is a much more flexible approach to adjusting memory allocations.

Custom-recovery-plans- (31).jpg

Figure 11.31 Sadly, I’ve found the Mac version of Excel incorrectly formats .csv files.

Example 6

This .csv file can be called at the beginning of the Recovery Plan, resetting the memory allocation by a similar PowerCLI script that we saw earlier in this section. In this case, three new extra variables are added that allow the SRM administrator to specify the location of the .csv file and which memory variable is being used in the script—either FailOverMemory or FailBackMemory. In this case, because the script is being run just before the VMs are powered on, I’ve included a sleep statement to ensure that the VM is correctly recovered by SRM before changing its configuration.

Param (






If ($LogFile) {

Start-Transcript $LogFile


if (! (get-pssnapin -name VMware.VimAutomation.Core -erroraction silentlycontinue)) {

add-pssnapin VMware .VimAutomation . Core


Connect-VIServer $vCenter

start-sleep –s 300

Import-Csv $CSV | Foreach {

If ($Failover) {

Set-VM -VM $_.VM -MemoryMB $_.FailOverMemory -Confirm:$false }

If ($Failback) {

Set-VM -VM $_.VM -MemoryMB $_.FailBackMemory -Confirm:$false }


Disconnect-ViServer * -Confirm:$false

If ($LogFile) {



The SRM administrator can invoke this script with this syntax in the Add Script dialog box (see Figure 11.32):

C: \Windows\System32\WindowsPowerShell\v1. 0\PowerShell .exe -File C: \ SetMemoryAdvanced.ps1 -vCenter vcnj -CSV C: \MemorySizes .csv -Failover True -Logfile C:\SetMemoryCSVLog.txt

If you wanted to reverse the process, the script would be edited before running the Recovery Plan in failback mode. The big difference here is that I’m connecting to the location where the VMs are being recovered. In a failback process, this would be the vCenter at the Protected Site (in my case, the vcnyc vCenter at New York). Notice also that the variable -Failback True is being used in this case:

C: \Windows\System32\WindowsPowerShell\v1. 0\PowerShell .exe -File C: \ SetMemoryAdvanced.ps1 -vCenter vcnyc -CSV C: \MemorySizes .csv -Failback True -Logfile C:\SetMemoryCSVLog.txt

Custom-recovery-plans- (32).jpg

Figure 11.32 In a failback scenario, you need to replace -Failover with -Failback to reset the VMs’ memory to their original settings.

Finally, you might feel uncomfortable with SRM running these scripts automatically. Remember, you can just put message steps in the Recovery Plan and run these commands manually if you wish. You may want to think about the consequences of using PowerCLI to modify VMs for failback. These changes could be replicated back to the Protected Site should you decide to failback to the Protected Site. Remember, you are making changes to the VMX file in the case of the memory allocation VMs. When you failback upon triggering a “reprotect” process, you’re reversing the repli-cation direction—to make changes accrued in the Recovery Site replicate back to the Protected Site. As a consequence, in my example when I failback to the Protected Site the VMs will not have correct production memory allocations. To stop this I will need a PS script that undoes the changes made by the Recovery Plan. If you want to know a funny story about this, I once used my “halve memory” script over and over again in a failover and failback manner, without increasing the memory allocation during the failback process. After doing this a couple of times I found Windows VMs would no longer reboot because their memory allocation was just 4MB. I have described this to customers by quoting Sir Isaac Newton’s Third Law of Motion: “To every action there is always an equal and opposite reaction.” The moral of the story is that if you use PowerCLI to massively modify your VMs when they are recovered, make sure you have corresponding scripts that undo that process during failback. As with all scripting, there’s an element of monitoring and maintenance that is beyond the scope of SRM to control. If a new VM is created that needs specialist scripting in order for it to work, you will need to tie a management process around enrolling into the Recovery Plan, and either attach scripts to it or update repositories which hold script data and information. For example, in the case of the .csv method of reducing the allocation of memory, you would need to add and remove VMs from the memorysize.csv file and keep it up-to-date with your environment.

Adding Command Steps to Call Scripts within the Guest Operating System

In PowerCLI 4.0 Update 1, a new cmdlet was introduced to allow you to call scripts within the guest operating system. The new cmdlet is called Invoke -VMscript. For Invoke-VMscript to work Microsoft PowerShell must already be installed in the VM. Additionally, VMware Tools needs to be up-to-date with the ESX host to prevent a warning message from occurring when the script executes; although this warning will not stop the script from executing and completing, it could cause unexpected problems. The Invoke -VMscript cmdlet works natively with cmd, bin, and bash shell scripts in Linux. By default, in Windows it assumes you will want to run a PowerShell script inside the VM; it is possible using the ScriptType parameter to specify the type of script you want to run.

Below is a syntax example where I used Invoke -VMscript to list a VM’s IP address settings directly from the Windows guest operating system running inside the VM. There are two important points here. First, despite authenticating PowerCLI against vCenter, the Invoke -VMscript requires root credentials on the ESX host itself to function, and this could inhibit its use where no consistent password is used for the root password, or it is company policy not to disclose it. Second, you clearly need to authenticate against the guest operating system running inside the VM for the script to work. Put bluntly, the bar set to authenticate with the Invoke -VMscript cmdlet is a high one.

Invoke-VMscript -VM db02 -ScriptText "ipconfig" -HostUser root -HostPassword password -GuestUser corp\administrator -GuestPassword vmware

Without the correct credentials, an authentication failure will occur. In the case of an authentication failure to the ESX host, PowerCLI will state the following:

"Insufficient permissions in host operating system"

This message is a little confusing, as we all know ESX is not an operating system but a virtualization hypervisor! In the case of an authentication failure to the guest operating system, PowerCLI will state the following:

"Authentication failure or insufficient permissions in guest operating system"

This invoke-script example above would cause the guest operating system to “echo” the results of a Windows ipconfig command back to the PowerCLI window (see Figure 11.33).

Custom-recovery-plans- (33).jpg

Figure 11.33 Using the Invoke -VMscript cmdlet to determine the IP address of a VM

Used in this simple way, the Invoke-VMscript cmdlet could be used to restart a service within one of the recovery VMs. Start by creating some kind of script within the VM:

@echo off

echo Stopping the Netlogon Service… net stop netlogon

ipconfig /registerdns

echo Starting the Netlogon Service net start netlogon

echo Success!!!

Then call this script with the Invoke -VMscript cmdlet:

Invoke-VMScript -VM db02 -ScriptText "c:\restartnetlogon.cmd" -HostUser root -HostPassword Password1 -GuestUser corp\administrator -GuestPassword vmware

Configuring IP Address Changes for Recovery Virtual Machines

One task you may wish to automate is changing an IP address within the virtual machine. Previously, VMware achieved this by calling Microsoft Sysprep from the Guest Customization Settings part of vCenter. With SRM 5.0, a new engine has been developed for changing the IP address of a VM. This new engine is significantly quicker than the previous Sysprep approach. In addition, there is a new graphical method of setting both the Protected and Recovery Site IP addresses. If you have many virtual machines you can also use a command-line utility that imports the IP configuration from a .csv file. This .csv file would need management and maintenance as the VM’s IP address could occasionally change, and the .csv file would need updating for any new VM that was added to a Protection Group and Recovery Plan in SRM.

The downside of this approach is that it is very administrator-intensive even with a .csv file. Additionally, you might find that although changing the IP address at the OS level works, applications continue to use the old IP address because it has been stored in some other configuration file. So it’s well worth considering other approaches which do not require a change in each and every virtual machine’s IP configuration. These approaches could include

• Retaining the existing IP address and redirecting clients by some type of NAT appliance which hides the IP address of the VM

• Using stretched VLANs so that virtual machines remain on the same network regardless of the physical location on which they are based

• Allocating the IP address by DHCP and Client Reservations

Creating a Manual IP Guest Customization

SRM 5.0 introduces a new graphical method of setting the Protected and Recovery Site IP addresses for the VM, as shown in Figure 11.34. This “update” feature can actually retrieve the IP address from a protected VM, and complete some (but not all) of the VM’s IP configuration. For this feature to work the VM needs to be powered on and VMware Tools must be running in the guest operating system. Then you need to follow these steps.

1. Locate a VM within your Recovery Plan, right-click, and choose the Configure option in the menu.

2. Select the virtual NIC in the Property column, and enable the option to “Customize IP Settings during recovery.”

3. Click the Configure Protection button, and click the Update button to retrieve the VM’s IP configuration (see Figure 11.34).

Review the IP data that is retrieved and add any additional parameters you feel need changing. Occasionally, I’ve seen the Update button not retrieve every single parameter; often this is caused by the VM not being properly configured.

Custom-recovery-plans- (34).jpg

Figure 11.34 The GUI for IP configuration can discover the VM’s IP settings.

4. Once the Protected Site is configured, click the Configure Recovery button and repeat this process for the Recovery Site IP settings. Once completed, the dialog box will show the IP settings used for both failovers and failbacks, so a VM’s IP settings are returned to their original state if a VM is returned to the Protected Site. The new IP configuration dialog box shown in Figure 11.35 can also deal with situations where the VM has multiple virtual NIC adapters.

Once you click OK, the VM will have an additional step called “Customize IP” (see Figure 11.36).

Custom-recovery-plans- (35).jpg

Figure 11.35 The new GUI allows the administrator to easily see the VM’s IP configuration.

Custom-recovery-plans- (36).jpg

Figure 11.36 The “Customize IP” step on the db01 virtual machine

Configuring Bulk IP Address Changes for the Recovery Virtual Machine (dr-ip-exporter)

For some, VMware SRM has had a command utility called dr-customizer.exe, which allows users to bulk-generate guest customization settings from the .csv file. This is better than having to manually run through the IP configuration for each and every VM, because manually typing an IP address for each VM can be time-consuming. Using .csv together with a spreadsheet tool like Microsoft Excel can allow you to manage the system in a much more efficient way.

1. Open a command prompt on the Recovery Site SRM server.

2. Change the directory to the SRM Bin file location held at C:\Program Files (x86)\ VMware\VMware vCenter Site Recovery Manager\bin.

3. Run this command:

dr-ip-customizer --cfg ..\config\vmware-dr.xml --out c:\nyc.csv --cmd generate --vc –i

4. After the first connection, if necessary, choose [Y] to trust the SRM server.

5. Provide the login details for the Recovery Site vCenter server.

When opened in Microsoft Excel the .csv file will look similar to the screen grab shown in Figure 11.37.

Custom-recovery-plans- (37).jpg

Figure 11.37 A VM is uniquely identified by its VM ID. Unprotecting and reprotecting a VM generates a new VM ID and placeholder object, and this can invalidate the settings in the .csv file.

The command creates a .csv file using the - -out and - -cmd switches. The - -vc switch indicates which initial vCenter to connect to when building the .csv file. As you can see, this process does not retrieve the IP details of the VM at the Protected Site, but instead creates a .csv file that contains the VM ID. This VM ID value is referred to as a managed object reference (MOREF) number, which uniquely identifies the VM in the SRM inventory. You will see the VM appears twice: once for its reference in the Protected Site (managed by the vCenter vcnyc) and once again for the Recovery Site (managed by the vCenter vcnj). These object numbers are unique to the vCenter that you connected to when you first ran the dr-ip-exporter utility. So, when the administrator goes to “import” the modified .csv file it must be to the same vCenter that was used to generate the .csv file; otherwise, you will receive an error. If you create a .csv file from one vCenter, and subsequently try to apply it to another vCenter, you will receive a generic error message that states:

ERROR: The Protected VM ‘protected-vm-27575’ does not exist. Please check your CSV file to make sure that it is up to date. Also, please connect to the same VC server for which you generated the CSV file. The VM Ids are different in each site.

As you can see, the .csv is a simple file that maps the name of the virtual machine as it is known in SRM, with the VM ID. The VM Name column is merely there to assist you in mapping these “shadow” VMs to the actual VMs contained in the Recovery Plan.

The Adapter ID column is used to control how settings are applied. If the Adapter ID column is set to 0, this acts as a global setting that is applied to all LANs within the VM. This setting cannot be used to change the IP address of a LAN; however, it can be used to globally set values that would be the same for all network interfaces such as the DNS configuration. If a VM has multiple NICs it’s possible to specify each NIC by its number (1, 2, 3) and apply different settings to it. For example, suppose you wished to give two different NIC interfaces two different IP settings—together with more than one default gateway and DNS setting; you would simply add an additional row for each VM. If all your VMs have a single NIC and you just need to re-IP them you could use a much simpler configuration. In this case, I set the Adapter ID column to be 1 for every VM. I then used the complete series feature of Microsoft Excel to generate a unique IP address for each VM, and so on, as shown in Figure 11.38. (Note that in the figure, rows have been hidden in Excel for space considerations.)

This .csv approach can become quite complex if you are dealing with VMs with multiple NICs with multiple default gateway and DNS settings (see Figure 11.39). Such a configuration could apply to a pair of VMs that were being used in a Microsoft Cluster Service (MSCS) configuration. For example, my VM db01 has two NICs that required eight entries in the spreadsheet to complete its configuration (again, note that in the figure, rows have been hidden in Excel for space considerations).

Custom-recovery-plans- (38).jpg

Figure 11.38 A simple .csv file where every VM has just one NIC

Custom-recovery-plans- (39).jpg

Figure 11.39 A configuration where every VM has one NIC but is configured for multiple gateways (or routers) and multiple DNS servers

6. To process the .csv file you would use the following command:

dr-ip-customizer -cfg ..\config\vmware-dr.xml --csv c:\nyc.csv --cmd create

7. Provide the username and password to the vCenter, and accept the certificate of both the vCenter and SRM hosts if necessary. The create command generates the IP customizations for you based on the parameters of the .csv file. The dr - ip - customizer command supports a - - cmd drop parameter which removes IP customizations from vCenter, and a - -cmd recreate parameter which applies any changes to the .csv file after the use of create and can be used to reconfigure existing settings.

You can edit these settings to double-check that output is as you expect, but you should not make any changes. I did this a few times while I was writing this to confirm that my .csv file was formatted correctly.

Creating Customized VM Mappings

As you might remember, inventory mappings are not mandatory, but they are incredibly useful because without them you would have to do mappings of the network, resource pool, and folder on a per-virtual-machine basis. Occasionally, a virtual machine will fail to be added to the Recovery Site because SRM cannot map the virtual machine to a valid network, folder, or resource pool. Alternatively, because you haven’t configured an inventory map, you will have to decide which customized virtual machine mappings are for you. VMs like this are flagged with the status message of “Mapping Missing.” This is a very common error and it’s usually caused by the VM’s network settings being changed, but not being included in the inventory mapping error. You should really resolve these errors from the Inventory Mappings location first, unless you have a VM like the one in Figure 11.40 which is unique and needs an individual per-VM mapping configured for it.

1. In SRM, select the Protection Group and click the Virtual Machines tab (see Figure 11.40).

2. Select the affected virtual machine and click the Configure Protection button.

As you can see, this VM was not automatically protected alongside the other VMs because the inventory mapping was missing a definition for the built-in “VM Network.” If you are carrying out a manual mapping like this the dialog box that appears when you configure the VM’s protection will indicate which components are missing the required parameter. This dialog box, shown in Figure 11.41, offers you the chance to modify the default mapping to another location if you wish.

Custom-recovery-plans- (40).jpg

Figure 11.40 The fs01 VM lacks a mapping for the VM Network port group. You can override the default inventory mapping with a custom per-VM mapping.

Custom-recovery-plans- (41).jpg

Figure 11.41 NIC 1 is not configured because “VM Network” was not included in the inventory mapping.

Managing Changes at the Protected Site

As you might now realize, SRM will need management and maintenance. As your Protected (production) Site is changing on a daily basis, maintenance is required to keep the Protected and Recovery Sites properly configured. One of the primary maintenance tasks is making sure newly created virtual machines that require protection are properly covered by one or more of your Recovery Plans. Simply creating a virtual machine and storing it on a replicated datastore does not automatically enroll it in your Recovery Plan. After all, not all VMs may need protection. If you follow this fact to its logical conclusion you could ask why you should create a new virtual machine on a replicated datastore if you don’t require it. Prior to vSphere 4 it was impossible to guide or restrict a user to only being able to select a certain datastore when he created a new virtual machine. There was a risk that a user could unintentionally place a VM on a replicated datastore when he shouldn’t have. In addition, there is a distinct possibility that he could store his new VM on an unprotected volume. Since vSphere 4, it is now possible to set permissions on datastores within folders, so it is possible to guide the user to creating VMs in the right locations.

Creating and Protecting New Virtual Machines

You might wrongly assume that as you create a new virtual machine—so long as it is created on the right replicated storage, in the right resource pool, in the right folder, and in the right network—it will automatically be “picked” up by SRM and protected by default. However, this is not the case. While creating a new virtual machine on a replicated datastore should ensure that the files of the virtual machine are at least duplicated at the Recovery Site, a new virtual machine is not automatically enrolled in the virtual machine Protection Group defined on the Protected Site. To enroll a new virtual machine, follow these steps.

1. At the Protected Site, select the Virtual Machine Protection Group, and select the virtual machine that is currently not protected.

2. Click the Configure All button, as shown in Figure 11.42.

So long as the VM’s settings are covered by the inventory mappings, the protection should complete automatically without further interaction. If, however, the VM’s settings fall outside the inventory mapping, you will be presented with a number of dialog boxes to manually set the location of the VM from a cluster, folder, or resource pool perspective.

The Configure All button allows you to protect multiple new VMs—and both methods will add the virtual machine to the Recovery Site’s inventory. The VM will be enrolled to every Recovery Plan that is configured for the Protection Group where the VM resides.

Remember, simply “protecting” new VMs is not the end of the task; the next stage is to ensure that the VMs were correctly ordered in the Recovery Plan, and any additional settings such as command scripts and messages are set correctly.

Renaming and Moving vCenter Inventory Objects

Despite the use of the linked mode feature, you can see that SRM depends highly on the operator correctly pairing and then mapping two separate vCenter inventories. These two vCenters do not share a common data source or database. So you might be legitimately concerned about what happens if vCenter inventory objects in either the Protected or Recovery Site are renamed or relocated. This has been a problem in some other management add-ons from VMware in the past; a notable example is VMware View:

Custom-recovery-plans- (42).jpg

Figure 11.42 Remember, new VMs are not automatically enrolled into the Protection Group and added to a Recovery Plan.

There are some rules and regulations regarding renaming various objects in vCenter. In the main, renaming or creating new objects will not necessarily “break” the inventory mappings configured earlier; this is because the mappings actually point to Managed Object Reference Numbers. Every object in the vCenter inventory is stamped with a MOREF value. You can consider these like SIDs in Active Directory—renaming an object in vCenter does not change the object’s MOREF value. The only exception to this is port groups which are not allocated a vCenter MOREF; in fact, their configuration and identifiers are not held by vCenter, but by the ESX host. If we examine the scenarios below we can see the effect of renaming objects in vCenter.

Managing Changes at the Protected Site

A number of changes can occur as part of day-to-day operational adjustments in your production environment. It’s important to know which of these can impact the SRM configuration.

Renaming Virtual Machines

This is not a problem. Protection Groups are updated to the new VM name, as are Recovery Plans. The same is true of the placeholder VMs in the Recovery Site. Previous versions of SRM required the administrator to unprotect and then reprotect a VM for the placeholder name to be updated. This is no longer the case.

Renaming Datacenters, Clusters, and Folders in the Protected Site

This is not a problem. The Inventory Mappings window automatically refreshes.

Renaming Resource Pools in the Protected Site

This is not a problem. The Inventory Mappings window automatically refreshes.

Renaming Virtual Switch Port Groups in the Protected Site

This depends on whether you are an Enterprise Plus customer with access to Distributed vSwitches (DvSwitches). If you are, there are no problems to report. All the VMs are automatically updated for the new port group name, and the inventory mappings remain in place.

It’s a very different story if you’re using SvSwitches. This will cause all the affected VMs to lose the inventory mapping. The VMs do remain protected, and no warning message will appear that is obvious to see in SRM. This is a bad outcome because someone could rename port groups on ESX hosts without understanding the consequences for the SRM implementation. Without a correct inventory mapping, the Recovery Plans would execute but they would fail with every VM that lacked a network mapping. This would create an error message when the Recovery Plan was tested or run. The nature of these error messages varies if it is the port group at the Protected Site that has been renamed or if it’s the port group at the Recovery Site that has been modified. If the port group is renamed at the Recovery Site, by default SRM will use the Auto function and create a bubble network if a real VLAN and port group are not available. The VMs will power on but they will only be able to communicate with other VMs on the same ESX host on the same bubble network. The Recovery Plan will display the following warning:

“Warning – Network or port group needed for recovered virtual machine couldn’t be found at recovery or test time.”

It’s often difficult to spot that this situation has occurred, but you can generally see traces of it behind Recovery Plans and their network settings (see Figure 11.43). So, put very simply, renaming port groups on an SvSwitch should be avoided at all costs! If you have renamed port groups after configuring the inventory mapping, two main corrective actions need to take place to resolve the problem. First, a refresh of inventory mappings is required. This is a relatively simple task of revisiting the inventory mapping at the Protected Site, looking for the renamed port group(s), and establishing a new relationship. In Figure 11.44, the renamed port group(vlan14 had been renamed to vlan14-test) has no association with a port group in the Recovery Site.

Custom-recovery-plans- (43).jpg

Figure 11.43 The effect that deleting port groups can have on inventory mappings

Custom-recovery-plans- (44).jpg

Figure 11.44 A rename of a port group can cause the loss of your inventory mappings.

Second, as you might know already, if you rename port groups on SvSwitches, the virtual machines configured at the Protected Site become “orphaned” from the port group, as shown in Figure 11.45. To create this example, I simply renamed the vlan14 port group to “vlan 14-test.” You can see that if port groups are renamed while they are in use by VMs, the port group label is dimmed and italicized in the vSphere client.

This orphaning of the virtual machine from the virtual switch port group has been a “feature” of VMware for some time, and is not specifically an SRM issue. However, it does have a significant effect on SRM. It can cause the Protection Group process, which creates the placeholder/shadow virtual machines on the Recovery Site, to fail. Correcting this for each and every virtual machine using the vSphere client is very laborious. You can automate this process with a little bit of scripting, with the PowerCLI from VMware:

get-vm | get-networkadapter | sort-object -property "NetworkName" | where {'vlan14' -contains $_.NetworkName} | Set-NetworkAdapter -NetworkName 'vlan14-test'

Custom-recovery-plans- (45).jpg

Figure 11.45 With SvSwitches, renaming a port group can cause a VM to become orphaned from the network when it is next powered on.

Managing Changes at the Recovery Site

So far we have examined changes at the Protected Site, and how they impact SRM. Now it’s time to turn the tables and look at the effect on an SRM configuration as changes take place there. Generally, the impact of changes in the Recovery Site is less significant, but there are some noteworthy situations to be forewarned about.

Renaming Datacenters, Clusters, and Folders

This is not a problem. The Inventory Mappings window automatically refreshes.

Renaming Resource Pools in the Protected Recovery Site

This is not a problem. The Inventory Mappings window automatically refreshes.

Renaming Virtual Switch Port Groups in the Recovery Site

Once again, I found there were no issues with renaming port groups at the Recovery Site if you are using DvSwitches. Any rename of a DvSwitch port group at the Recovery Site results in the inventory mapping at the Protected Site. However, the same situation presents itself with SvSwitches in the Recovery Site as it did in the Protected Site. I renamed my port groups from vlan50-54 to vlan60-64. However, no manner of refreshes or restarts updated the Inventory Mappings window at the Protected Site. The window switched to “None Selected.” The only resolution was to manually remap the port groups.

Other Objects and Changes in the vSphere and SRM Environment

In my experience, other changes can take place in vSphere and SRM which cause the relationships we configure in SRM to break. For example, I’ve found that renaming datastores at the Protected Site that are covered by replication cycles can potentially cause issues. It’s possible that when you rename a datastore at the Protected Site before it has been covered by the next replication cycle, a test is run. The test fails because the test expects to see the old name, not the new name, and it is still being presented with the old name at the Recovery Site. Symptoms include “File not found” error messages when the test plan is executed; rather worryingly you can find your replicated datastore is empty! My solution was to simply enable the option to “Replicate recent changes to the recovery site”; this should force an update caused by the renaming of a datastore.

Setting the renaming of datastores to one side, it’s worth stating that the SRM configu-ration stages do occur in a specific order for a reason, and each stage has a dependency on the preceding stage. You should configure SRM in the following order.

1. Pair the sites.

2. Configure the array manager.

3. Configure the inventory mappings.

4. Create the Protection Groups.

5. Create Recovery Plan.

Say, for example, that you remove your Protection Group. In that case, the Recovery Plan has references to Protection Groups that don’t exist. If you create a new Protection Group, you have to manually go to the Recovery Plan and configure it to use the correct Protection Group. As deleting and re-creating configurations is a very popular (if unsophisticated) way to “fix” problems in IT generally, you must be very careful. You must understand the implications of deleting and re-creating components. For example, say you delete and re-create a Protection Group, and then tell your Recovery Plan to use it. You will discover that all priority/order settings in the plan are lost, and have been reset to the default. You will find all the virtual machines are rehoused into the priority 3 category for both the power down and power on of virtual machines. Additionally, other customizations to your Recovery Plan could be lost, such as any per-VM scripting or re-IP settings. This is deeply annoying if you have spent some time getting all your virtual machines to power on at the right time in the right order. As you can tell by the tone of my writing, I’ve found this out through my own bitter experience!

Lastly, a word of caution: As we have seen, SRM can accommodate most changes that take place. However, this is currently a significant attribute of a protected/production virtual machine that is not propagated to the Recovery Site. If you increase or decrease the amount of memory allocated to a virtual machine after it has been covered by a Protection Group, the only way (currently) to fix this issue is to remove the protection of the affected virtual machine, and reprotect it; this causes the destruction of the virtual machine placeholder VMX at the Recovery Site, and its re-creation. The mismatch between the real .vmx file and the placeholder is not technically significant, and is largely a cosmetic irritation. When the plan is tested the amount of memory allocated to the VM from the Protected Site will be used. As we saw earlier, if you do want the Recovery Site VMs to have different settings, you are better off using PowerCLI to make those modifications at the point of recovery. So the moral of the story is this: View the casual removal of inventory mappings and Protection Groups with extreme caution.

Storage vMotion and Protection Groups

NOTE: Storage vMotion does not impact vSphere Replication configurations. Replication jobs in VR are automatically updated if the source VM is moved from one datastore to another.

In 2006, VMware Virtual Infrastructure 3.5 (VI3) offered a new feature called Storage vMotion. This allowed you to relocate the files of a virtual machine from one datastore to another, while the virtual machine was running, regardless of storage type (NFS, iSCSI, SAN) and vendor. At that time, Storage vMotion was carried out by using a script in the Remote CLI tools which are downloadable from VMware’s website. With vSphere 4 you can simply drag and drop and work your way through a wizard to complete the process. With vSphere 5, Storage vMotion still has implications for VMware Site Recovery Manager. Additionally, I found I had to use the Rescan Arrays option from within the Array Manager Wizard to force an update. Frequently, I found that after a VM had been removed or added from a Protection Group, the Storage vMotion process did not automatically force a Recompute Datastore Groups event.

Basically, there are four scenarios.

Scenario 1

The virtual machine is moved from nonreplicated storage to replicated storage, effectively joining a Protection Group. In this case, the outcome is very straightforward; it’s as if a brand-new virtual machine has just been created. The Protection Group will have a yellow exclamation mark next to it, indicating the virtual machine has not been configured for protection.

Scenario 2

The virtual machine is moved from replicated storage to nonreplicated storage, effec-tively leaving a Protection Group, and as such is no longer covered by SRM. With this scenario the outcome is less than neat. Removing a virtual machine from a replicated LUN/volume to nonreplicated storage can result in an error message in the Tasks & Events tab and the virtual machine being listed in the Protection Group as “invalid,” as shown in Figure 11.46. In my case, I moved the fs0 1 VM from a replicated datastore to local storage. The full message read:

"Invalid: Cannot protect virtual machine 'FS01' because its config file '[esx2_local] fs01/fs01.vmx is located on a non-replicated or non-protected datastore"

Custom-recovery-plans- (46).jpg

Figure 11.46 To avoid this scenario, unprotect the VM in the Protection Group and then move it to its new location with Storage vMotion.

The “solution” to this issue is to select the VM and choose the Remove Protection option. If you wish to relocate a VM in this way, perhaps the best method is to remove the protection from it and then carry out the Storage vMotion. Once the Storage vMotion has completed, a refresh of the Protection Group would cause any reference to the VM to be removed.

Scenario 3

The virtual machine is moved from one replicated storage location to another replicated storage location; essentially, the virtual machine is moved out of the scope of one Protection Group and into the scope of another Protection Group. In this case, when the virtual machine moves from one Protection Group to another, it should be “cleaned out” from the old Protection Group. In my case, I moved the ss02 VM with Storage vMotion from one Protection Group to another. This generates a yellow exclamation mark on both Protection Groups because ss02 is still listed in two locations. If you try to protect the VM at the “destination” Protection Group it will fail, and the Tasks & Events tab will show an error (see Figure 11.47).

At the “source” Protection Group there will be an error similar to the one observed in scenario 2 (see Figure 11.48).

Custom-recovery-plans- (47).jpg

Figure 11.47 In previous releases a more visible pop-up box appeared with a friendly description. VMware may reinstate this functionality with SRM.

Custom-recovery-plans- (48).jpg

Figure 11.48 SRM believes the administrator is trying to protect the VM twice in two different Protection Groups.

Generally, selecting an “invalid” virtual machine and clicking the Remove Protection option will fix the problem, and will allow you to protect the VM in the destination Protection Group.

Scenario 4

The virtual machine is moved from one replicated storage location to another replicated storage location, but the SRM administrator has created one Protection Group containing many datastores. In this scenario, the Protection Group does not generate an error and everything works in SRM without a problem. For this reason, it might be tempting to decide that creating one “jumbo” Protection Group that contains all of your datastores would the best way to stop any of these Storage vMotion woes. However, I think this will limit the flexibility that SRM is trying to deliver. Most of my customers prefer one Protection Group to contain one datastore, as this allows them to run Recovery Plans that failover and failback a certain subset of VMs. As soon as all the VMs are lumped together unceremoniously in one Protection Group you at one stroke diminish and remove a great deal of flexibility in the SRM product.

Virtual Machines Stored on Multiple Datastores

Of course, it is entirely possible to store virtual machine files on more than one datastore. In fact, this is a VMware recommendation. By storing our boot VMDK, log VMDK, and data VDMK files on different datastores, we can improve disk I/O substantially by reducing the disk contention that could take place. Even the most disk-intensive virtual machine could be stored on a LUN of its own, and as such it would not face any competition for I/O at the spindle level. You will be pleased to know that SRM does support a multiple disk configuration, so long as all the datastores, which the virtual machine is using, are replicated to the Recovery Site and exist on the same storage array. What SRM cannot currently handle is if a VM has two virtual disks stored on two different storage arrays.

The datastore location of a virtual disk is controlled when it is added into the virtual machine using the Add Hardware Wizard on the virtual machine. These virtual disks appear seamlessly to the guest operating system, so from Windows or another support guest operating system it is impossible to see where these virtual disks are located physically; they are presented seamlessly to the guest operating system.

In the situation shown in Figure 11.49, I added another datastore, called “db-files,” to my environment, and then added a virtual disk to each of my DB virtual machines, placing them on the new datastore. All I did after that was to make sure the db-files datastore was scheduled to replicate at exactly the same interval as my volume called “virtual machines.” If you have existing Protection Groups, these will have yellow exclamation marks next to them to reflect the fact that the virtual machines are utilizing multiple datastores, and that they need updating to reflect this change (see Figure 11.50).

Custom-recovery-plans- (49).jpg

Figure 11.49 A 200GB virtual disk being placed on a different datastore from the boot OS virtual disk. This is a common practice to improve performance.

Custom-recovery-plans- (50).jpg

Figure 11.50 Protection Groups alert you to the existence of new datastores that are being replicated and used by VMs.

Notice how Figure 11.50 shows the DB datastore and db-files datastore as being “grouped” together; this is because the db01–db03 VMs have virtual disks on both datastores.

If your storage array vendor supports the concept of consistency groups, this configuration will look very similar to the situation where multiple LUNs/volumes have been placed into the same consistency group. In this respect, the Protection Group configuration can come to be almost a mirror image of the groups you create at the storage layer to ensure that LUNs/volumes are replicated on a consistent schedule.

Virtual Machines with Raw Device/Disk Mappings

In this chapter I began with one VMFS volume and LUN in the Protected Site. Clearly this is a very simplistic configuration which I deliberately chose to keep our focus on the SRM product. Here, I will go into more detail with more advanced configurations such as VMware’s RDM features and multiple disk configurations that more closely reflect the real-world usage of a range of VMware technologies. It’s perhaps worth reminding you that RDMs offer no real tangible improvement in performance. The fact that many VMware admins erroneously believe they do probably makes RDMs the most misunderstood feature in vSphere. RDMs are useful and, in fact, in some cases are mandatory for some clustering software that runs inside a VM. They are also necessary if your storage array has management capabilities that require the VM to have “direct” access to the storage array.

When I started writing this section, I faced a little bit of a dilemma: Should I repeat the storage section all over again to show you the process of creating a LUN/volume in the storage array, and then configuring replication and how to present that to ESX hosts? Following on from this, should I also document the process of adding an RDM to a virtual machine? In the end, I figured that if you, the reader, got this far in the book, you should be able to double back to the relevant storage chapter and do that on your own.

Assuming you’ve done that if necessary, I’ll now describe my RDM configuration process. I added to my Dell EqualLogic array an RDM that was then added to the db01 virtual machine. The thing to notice in the screen grab in the Path ID column on the virtual machine shown in Figure 11.51 is the vmhba syntax of the RDM on the protected VM; it says the path is vmhba34:C0:T6:L0. SRM cleverly updates this mapping at the point of recovery so that it points to the correct target and LUN/volume.

For the moment, I want to concentrate on the specific issues of how SRM would handle this addition of new storage to the system, and how it handles the RDM feature. After creating the new volume/LUN, configuring replication, and adding the RDM to the virtual machine, the next stage is to make sure the array manager has correctly discovered the new RDM.

Custom-recovery-plans- (51).jpg

Figure 11.51 FC systems stick to the (L)UN value, but most iSCSI systems will use the (T)arget value as a way to present each LUN presented.

It is worth saying two critical facts about RDMs and SRM. First, the RDM mapping file itself must be stored on a VMFS volume covered by a replicated LUN. If you don’t do this, there simply won’t be an RDM mapping file available in the Recovery Site for the recovery VM to use. Second, SRM resolves hardware issues with RDM. RDM’s mapping files have three main hardware values: a channel ID (only used by iSCSI arrays), a target ID, and a LUN ID. These values held within the mapping file itself are likely to be totally different at the Recovery Site array. SRM fixes these references so that the virtual machine is still bootable, and you can still get your data. If you were not using SRM and carrying out your Recovery Plan manually, you would have to remove the RDM mapping file and add it to the recovery virtual machine. If you don’t, when the replicated virtual machine is powered on, it will point to the wrong vmhba path.

You might want to know what happens if you create a new virtual machine which contains an RDM mapping to a LUN which is not replicated or a VMDK that is not a replicated volume. If you try to protect that virtual machine, SRM will realize that you’re trying to protect a virtual machine which has access to a LUN/volume which is inaccessible at the Recovery Site. When you try to add such a virtual machine to the Protection Group the VM will display a message stating “Device Not Found: Hard Disk N”. If you try to configure protection to that VM the corresponding dialog box will show that the VM’s RDM mapping is pointing to a LUN or volume that is not being replicated (see Figure 11.52).

When you try to protect the VM a wizard will run to allow you to deal with this portion of the VM that cannot be protected. Ideally, you should resolve the reason the RDM or VMDK is not replicated, but the wizard does allow you to work around the problem by detaching the VMDK during execution of the Recovery Plan.

Custom-recovery-plans- (52).jpg

Figure 11.52 There is a problem. An RDM has been added to a VM but it is not being replicated to the Recovery Site.

Multiple Protection Groups and Multiple Recovery Plans

This section is quite short, but it may be the most important one to you. Now that you have a very good idea of all the components of SRM, it’s time for me to show you what a popular configuration might look like in the real world. It is perfectly possible—in fact, I would say highly desirable—to have many Protection Groups and Recovery Plans. If you recall, a Protection Group is intimately related to the LUNs/volumes you are replicating. One model for this, suggested earlier in the book, is grouping your LUNs/volumes by application usage so that they can, in turn, easily be selected by an SRM Protection Group. I’ve set up such a situation in my lab environment to give you insight into how such a configuration would look and feel. I don’t expect you to reproduce this configuration if you have been following this book in a step-by-step manner. It’s just here to give you a feel for how a “production” SRM configuration might look and feel.

Multiple Datastores

In the real world, you are likely to put your virtual machines in different datastores to reflect that those LUNs/volumes represent different numbers of disk spindles and RAID levels. To reflect this type of configuration I simply created five volumes called DB, FS, MAIL, VIEW, and VB. In Figure 11.53 I used the NetApp FAS2040 System Manager as an example.

So you could create volumes based on business units or applications—or indeed the applications used by each business unit. This would allow you to failover a particular application of a particular business unit while the other business units remain totally unaffected.

Custom-recovery-plans- (53).jpg

Figure 11.53 Here NetApp volumes are being created reflecting different VMs. It’s all about separation—separation allows for control.

Multiple Protection Groups

The storage changes outlined in this section were then reflected in the Protection Groups I created. I now had six Protection Groups reflecting the six types of virtual machines. When I created the Database Protection Group, I selected the datastore I created for that application. If we follow this to its logical conclusion, I end up creating another five Protection Groups for each of my replicated datastores (see Figure 11.54).

Custom-recovery-plans- (54).jpg

Figure 11.54 Over time, your Protection Groups will come to mirror your storage configuration.

Multiple Recovery Plans

These multiple Protection Groups now allow for multiple Recovery Plans; one Recovery Plan just for my mail environment, and so on. Also, in the case of complete site loss, I could create a Recovery Plan that included all my Protection Groups. At the end of this process, I would have a series of Recovery Plans that I could use to test each application set, as well as to test a complete Recovery Plan (see Figure 11.55).

It’s worth mentioning a limitation surrounding this configuration and scalability in SRM: The marketing around the new release of SRM claims to allow you to run up to 30 plans simultaneously. While this is technically true, it’s not without limitations, as you can see in Figure 11.56. Although a Protection Group can be a member of many Recovery Plans, it can only be used by one Recovery Plan at any one time. In my configuration in Figure 11.55, I would be able to run each of my “application” based Recovery Plans simultaneously. However, if I ran the “DR Recovery Plan – All VMs” that references all of my Protection Groups, I would find that my “application” based Recovery Plans would be unusable until the plan had completed. Their status would change to indicate “Protection Groups In Use.”

Custom-recovery-plans- (55).jpg

Figure 11.55 A good storage layout at the array level leads to a good Protection Group layout. This pays dividends when creating Recovery Plans.

Custom-recovery-plans- (56).jpg

Figure 11.56 Once a datastore is in use, no other Protection Group, and therefore, Recovery Plan, can use it.

The SRAs in SRM currently only allow one snapshot to be created per datastore. As a consequence, the datastore becomes locked exclusively for the duration of the plan. This would stop two separate SRM administrators from running the Mail Recovery Plan at the same time.

If you selected one of these Recovery Plans and looked at its Summary tab you would get a more detailed explanation (see Figure 11.57).

This occurs because currently the SRA does not allow for multiple snapshots to be taken of the same datastore. Many storage vendors support this functionality, but for the moment, the SRA does not leverage this capability. So, once a Recovery Plan is tested and the LUN/volume snapshot is presented, the Protection Group that backs this datastore does not itself become available until the plan has been cleaned up.

As you can see, the most powerful and sensible way to use SRM is to make sure various virtual machines that reflect big infrastructure components in the business are separated out at the storage level. From an SRM perspective, it means we can separate them into logically distinct Protection Groups, and then use those Protection Groups in our Recovery Plans. This is infinitely more functional than one flat VMFS volume and just one or two Recovery Plans, and trying to use such options in the Recovery Plan as “Recover No Power On Virtual Machines” to control what is powered on or not during a test of a Recovery Plan. The goal of this section was not to try to change my configuration, but just to illustrate what a real-world SRM configuration might look and feel like. I was able to make all of these changes without resorting to powering off the virtual machines, by using Storage vMotion to relocate my virtual machines on the new LUNs/volumes.

Custom-recovery-plans- (57).jpg

Figure 11.57 The Recovery Plan is unavailable because the Protection Group that backs it is in use elsewhere.

The Lost Repair Array Managers Button

If you have used SRM in previous releases you might wonder what has become of the Repair Array Managers button. This button didn’t repair the storage array so much as it allowed you to repair your configuration of the Recovery Site’s communication to that array. It allowed the SRM administrator to deal with array configuration settings even in disaster mode. For example, it could be used in the following scenarios.

• The first IP used to communicate to the array is good, but the first controller is unavailable. When the SRA tries to use the second controller it fails because the SRM administrator typed in the wrong IP address, or indeed failed to specify it.

• An individual at the Recovery Site changed the IP address used to communicate to the Recovery Site storage array without informing the SRM administrator.

• An individual at the Recovery Site changed either the username or the password used to authenticate to the array.

In SRM 5.0 the Repair Array Managers button has been deprecated. It is no longer required because SRM 5.0 has departed significantly from the original design we saw in previous releases. This redesign is such that the feature became redundant and no longer needed in the product.


For me, this is one of my biggest chapters in the book because it really shows what SRM is capable of and perhaps where its limitations lie. One thing I found a little annoying is the way there’s no drag-and-drop option available to reorder virtual machines in a priority list; clicking those up and down arrows for each and every virtual machine is going to get pretty darn annoying. I found it irritating with just ten virtual machines, never mind hundreds.

Nonetheless, hopefully this chapter gave you a good background on the long-term management of SRM. After all, virtual machines do not automatically become protected simply by virtue of being stored on replicated VMFS volumes. Additionally, you saw how other changes in the Protected Site impact the SRM server, such as renaming datacenters, clusters, folders, networks, and datastores—and how, for the most part, SRM does a good job of keeping that metadata linked to the Recovery Site. It’s perhaps worth highlighting the dependencies within the SRM product, especially between Protection Groups and Recovery Plans.

I find the fact that we cannot yet back up our Recovery Plans to a file does introduce a worry: that major changes at the Protected Site, such as unprotecting a VM or even deleting a Protection Group, can lead to a damaged Recovery Plan with no quick and easy way to restore it. As you might have seen, deleting Protection Groups is a somewhat dangerous thing to do, despite the relative ease with which they can be re-created. It unprotects all the virtual machines affected by that Protection Group and removes them from your Recovery Plans. Re-creating all those Protection Groups does not put the virtual machines back in their original location, thus forcing you to re-create all the settings associated with your Recovery Plans. What we could really use is a way to export and import Recovery Plans so that those settings are not lost. Indeed, it would be nice to have a “copy Recovery Plan feature” so that we can create any number of plans from a base, to work out all the possible approaches to building a DR plan.

Finally, I think it’s a shame that events such as cold migration and Storage vMotion still do not fully integrate seamlessly into SRM. Hopefully, you saw that a range of different events can occur, which SRM reacts to with various degrees of automation. However, as you will see in the next chapter, it’s possible to configure alarms to tell you if a new virtual machine is in need of protection.