Scripting Site Recovery

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Scripting Site Recovery

Version: vCenter SRM 5.0

One of the interesting ironies or paradoxes that came to my mind when writing this book was what if, at the point of requiring my DR plan, VMware’s Site Recovery Manager failed or was unavailable? Put another way, what’s our Recovery Plan for SRM? Joking aside, it’s a serious issue worth considering; there’s little point in using any technology without a Plan B if Plan A doesn’t pan out as we expected.

I would like to give a special acknowledgment to four individuals who helped directly with this chapter, with specific reference to the PowerShell aspect. Thank you to Carter Shanklin of VMware, who was happy to answer my emails when he was product manager for VMware PowerShell. Additionally, thank you to Hal Rottenberg, whom I first met via the VMware Community Forum. Hal is the author of Managing VMware Infrastructure with Pow erShell. If you wish to learn more about the power of PowerShell I recommend watching and joining the VMware VMTN Community Forum, and purchasing Hal’s book. Thank you to Luc Dekens of the PowerShell forum who was especially helpful in explaining how to create a virtual switch with PowerShell. Finally, thank you to Alan Renouf of VMware, whom I have known for some time through the London User Group. Alan has been especially helpful in this edition of my book on SRM both in this chapter and in Chapter 11, Custom Recovery Plans. Alan and Luc, together with Glen Sizemore, Arnim van Lieshout, and John Medd, recently collaborated to produce VMware vSphere PowerCLI Reference: Automating vSphere Administration, a book which sits on my bookshelf and which I highly recommend you read.

Given that the key to any Recovery Plan is replication of data to a Recovery Site, at the heart of the issue the most important element is taken care of by your storage array, not by VMware’s SRM. Remember, all that SRM is doing is automating a very manual process. Of course, this is not the case if you are using vSphere Replication (VR). So there are really two main agendas behind this chapter: Show how to do everything SRM does manually in case our Plan A doesn’t work out, and show how incredibly useful SRM is in automating this process. Hopefully you will see in this chapter how hard life is without SRM. Like any automated or scripted process, you don’t really see the advantages until you know what the manual process is like.

With that in mind, I could have started the book with this chapter as Chapter 1 or 2, but I figured you would want to dive deep into SRM, which is the topic of this book, and save this content for the end. I also figured that structuring the chapters the way I have would give you an idea of what SRM is doing in the background to make your life much easier. The big advantage of SRM to me is that it grows and reacts to changes in the Protected Site, something a manual process would need much higher levels of maintenance to achieve.

As part of my preparation for writing this chapter, I decided to delete the Protection Groups and Recovery Plans associated with our bidirectional configuration from my Recovery Site (New Jersey). I did this to create a scenario where it appeared there was no SRM configuration in the Recovery Site. The manual recovery of virtual machines will require some storage management, stopping the current cycle of replication, and making the remote volume a primary volume and read-writable. While SRM does this automatically for you using the SRA from your storage vendor, in a manual recovery you will have to do this yourself. This is assuming you still have access to the array at the Protected Site, as with a planned execution of your DR plan. Additionally, once the replication has been stopped, you will have to grant the ESX hosts in the Recovery Site access to the last good snapshot that was taken.

Once the ESX hosts are granted access to the volumes, they will have to be manually rescanned to make sure the VMFS volume has been displayed. Based on our requirements and the visibility of the LUN, we will have the option either to not resignature or to force a resignature of the VMFS volume. After we have handled the storage side of things, we will have to edit the VMX file of each virtual machine and map it to the correct network. Then we will be able to start adding each virtual machine into the Recovery Site, each time telling the vSphere client which cluster, folder, and resource pool to use. In an ideal world, some of this virtual machine management could be scripted using VMware’s various SDKs, such as the Perl Scripting ToolKit, the PowerCLI, or the vCenter SDK with some language of your choice—VB, C#, and so on. I intend to use PowerCLI for VMware as an example. As you will see, with scripting, the process is laborious and deeply tedious. Critically, it’s quite slow as well, and this will impact the speed of your recovery process. Think of all those RTOs and RPOs…

There are many ways to script a recovery process—as many ways as there are system administrators. But the basics remain the same: At some stage, storage must be presented to the hosts at the Recovery Site, and VMs must be registered and powered on. I’ve seen many different examples of doing this. One interesting example was presented at the London User Group in 2011. Our guest, Gabrie Van Zantan, presented a solution he built using PowerCLI that would export the current configuration of an existing site (clusters, resource pools, folders, and VM names) to a .csv file, and then at the DR location this would be “imported” to the system, effectively creating a mirror image of the production location. I thought this was a very interesting approach to take. In this chapter, I won’t try to emulate Gabrie’s configuration; I mention it as an example of how many different approaches there are to this issue. Instead, I will be focusing on the core basics that allow for recovery to take place.

Scripted Recovery for a Test

One of the main tasks to consider if you are planning for a totally scripted DR solution concerns handling storage. The backbone of all DR work is replication, and if your ESXi hosts cannot access the storage and register a VM to the system, your manual DR plan will fail at the first hurdle. With ESXi’s rich support for different types of storage, this adds a layer of complexity, especially if you using all three storage protocols that are supported.

Managing the Storage

Without an SRA, we will have to engage more with the vendor’s tools for controlling snapshots and replication. This area is very vendor-specific, so I refer you to your vendor’s documentation. In general terms, this means taking a snapshot at the Recovery Site array of the LUNs/volumes currently being protected, and then presenting that snapshot volume to the ESX hosts. In the case of block-based storage, the ESX hosts would need to have their HBA rescanned and the VMFS volume resignatured to make it accessible to the ESX hosts. In the case of NFS, it would merely be the process of making sure the snapshot was an NFS export and accessible to the hosts, and then mounting the NFS export to the hosts. In terms of this book, it would be too much to explain how this is done for each storage vendor I have available in my lab. But as an example, on a Dell EqualLogic system this would be achieved by taking a clone of the most recent replica (see Figure 15.1).

When the clone volume replica wizard runs, you assign the replica a new name and allow access to it from the ESX hosts (see Figure 15.2).

At this stage, the storage management is dealt with and the next step is to allow the ESX hosts to see the storage. Of course, there is always the possibility that storage vendors provide their own PowerShell or remote CLI utilities that will allow you to automate this stage as well.

Scripting-site-recovery- (1).jpg

Figure 15.1 Taking a clone of the most recent replica on a Dell EqualLogic system

Scripting-site-recovery- (2).jpg

Figure 15.2 Modifying the ACL on the Access tab of the volume to allow access to the replica from the ESX hosts

Rescanning ESX Hosts

You should be more than aware of how to do a rescan of an ESX host either from the GUI or from the CLI, but what you are looking for is a new VMFS volume to appear. This rescan has to be done once per affected ESX host, and it would be quite laborious to do this via the vSphere client (and again, if you made a mistake with the storage layer). Of course, you could log in with PuTTy and use esxcfg-rescan or the vCLI’s esxcfg-rescan.pl script. Personally, I prefer to use PowerCLI. The following snippet of PowerCLI rescans all the ESX hosts in vCenter, which is much more efficient from a scripting perspective:

connect-viserver vcnj .corp.com -user corp\administrator -password vmware get-vmhost | get-vmhoststorage –rescanallhba

The syntax of the above snippet of PowerShell is relatively easy to explain. Get -vmhost retrieves all the names of the ESX hosts in the vCenter that you authenticated to, and this is piped to the command get -vmhoststorage, to then rescan all the ESX hosts. Get -vmhoststorage supports a –rescanALLhba switch, which does exactly what you think it does.

Resignaturing VMFS Volumes

Now we need to mount the VMFS volume to the ESX hosts. From a PowerCLI perspective, no cmdlets currently exist to carry out a resignature of a VMFS volume. However, it is possible to dive into the SDK. As we saw in earlier chapters, we can use the Add Storage Wizard to resignature VMFS volumes, and both the Service Console and vCLI contain esxcfg-volume and vicfg-volume commands which allow you to list LUNs/volumes that contain snapshots or replicated data. From the vCLI you can use the following command to list LUNs/volumes that are snapshots or replicas:

vicfg-volume .pl -l --server=vcnj . corp .com --username=corp\administrator --password=vmware --vihost=esx3 .corp.com

This will produce the following output:

VMFS UUID/label: 4da70540-50ec1cd9-843d-0015600ea5bc/dell-eql-virtualmachines Can mount: Yes

Can resignature: Yes

Extent name: naa.6006048c68e415b4d1c49e2aac2d4b4a:1 range: 0 - 100095 (MB)

With the escfg-volume and vicfg-volume commands you have the choice of either mounting the VMFS volume under its original name, or resignaturing it. To mount the volume without a resignature you would use:

vicfg-volume.pl –m 4b6d8fc2-339977f3-d2cb-001560aa6f7c dell-eql-virtualmachines --server=vcnj.corp.com

- -username=corp\administrator --password=vmware --vihost=esx3 .corp.com

If you did want to carry out a resignature you would simply replace the –m switch to mount with the –r switch, like so:

vicfg-volume.pl –r 4b6d8fc2-339977f3-d2cb-001560aa6f7c dell-eql-virtualmachines --server=vcnj.corp.com --username=corp\administrator --password=vmware --vihost=esx3.corp.com

If you want to rename the VMFS volume as SRM does after a resignature you can use this piece of PowerShell. In this case, the PowerCLI searches for my resignatured volume called “snap-17ab2 5ce-dell-eql-virtualmachines,” and renames it to “dell-virtualmachines-copy”:

set-datastore -datastore (get-datastore *dell-eql-virtualmachines) - name dell-virtualmachine s-copy

Mounting NFS Exports

Of course, your environment might not be using block-based storage and VMFS; you may be using NFS instead. To some degree, automating this is much easier because you do not have the issue of potentially resignaturing VMFS volumes. In the example below I use the get -vmhost cmdlet to retrieve all the hosts in the vCenter, and then use the foreach object cmdlet to say that for each ESX host ($vmhost) found in the array, carry out the NFS mounting process using the new-datastore –nfs cmdlet:

connect-viserver vcnj.corp.com -user corp\administrator -password vmware foreach-object ($vmhost in (get-vmhost))

{

new-datastore -nfs -vmhost $vmhost -nfshost 172.168.3.89 -path /vol/ cloneofreplica-virtualmachines -name virtualmachines

}

Creating an Internal Network for the Test

It is part of my standard configuration on all my ESX hosts that I create a port group called “internal” which is on a dedicated switch with no physical NIC uplinked to it. However, you might wish to more closely emulate the way SRM does its tests of Recovery Plans, and create a “testbubble” network. Creating virtual switches in PowerCLI is very easy, and again we can use the logic of foreach-obj ect to apply this to all the ESX hosts.

connect-viserver vcnj.corp.com -username corp\administrator -password vmware

foreach-object ($vmhost in (get-vmhost))

{

$vs = new-virtualswitch -vmhost $vmhost -name "recovery-vswitch1" $internal = new-virtualportgroup -virtualswitch $vs -name "testbubble-1 group"

}

IMPORTANT NOTE: Remember, when you’re dealing with port groups on Standard vSwitches (SvSwitches) the label in this case, “testbubble- 1 group,” is case-sensitive.

Adding Virtual Machines to the Inventory

The next stage involves interrogating the datastores that were mounted earlier in this exercise. If you were doing this manually, it would involve browsing the datastores and looking for the VM directories that contain .vmx files, and then right-clicking each one and adding it to the inventory. While using the Add to Inventory Wizard, you would need to select a vSphere cluster, ESX host, and folder and resource pool location for the VM. Remember, you would have to repeat these steps for every virtual machine that needs to be recovered. This is made more complicated by the fact that there is no “inventory mapping” as there is in SRM, so you will be forced to register the VMs to the correct resource pool and folder. I think it’s in this area where Gabrie’s approach becomes interesting. If we had a record of all the VMs from the Protected Site, a .csv file that listed all the VMs together with their correct resource pool and folder location, life would be much simpler. Essentially, this .csv file would become not unlike the SRM database that holds the inventory mapping data.

It is possible to automate the process of adding a virtual machine to an ESX host (not a cluster) using the command-line ESX host tool called vmware-cmd; unfortunately, this tool cannot handle vCenter metadata, such as folder location and resource pools. Perhaps a better approach is to use some PowerCLI. Once we know the path to the VM, we can think about trying to register a VM. There is a New-VM cmdlet which we can use to handle the full registration process, including the ESX host, resource pool, and folder location in the vCenter inventory, like so:

connect-viserver vcnj .corp.com -username corp\administrator -password vmware

new-vm -vmhost esx3.corp. com -vmfilepath "[dell-virtualmachines-copy] fs05/fs05.vmx" -resourcepool "file servers" -location "file servers"

Remember, this registration process would have to be repeated for every VMFS volume, for every VM needing recovery. That would be very time-consuming, given the time it would take to complete every registration. A better method is to use PowerCLI’s ability to query datastores for files ending with “.vmx” and then pipelining the results to the new-vm cmdlet.

connect-viserver vcnj.corp.com -username corp\administrator -password vmware

dir 'vmstores:\vcnj.corp.com@443\nj datacenter\dell-virtualmachines-copy\*\*.vmx' | % {new-vm -host $vmhost -vmfilepath $_.datastorefullpath}

If your datastore and VM placement were more discrete the same command could contain the -resource pool and –location parameters as well:

connect-viserver vcnj.corp.com -username corp\administrator -password vmware

dir 'vmstores:\vcnj .corp.com@443\nj datacenter\db-copy\*\*.vmx' | %

{new-vm -host $vmhost -vmfilepath $_. datastorefullpath -resourcepool "DB" -location "DB"}

Of course, there are many ways to approach this complex registration process. For example, Luc Dekens has a rather good function called Register-VMX which he explains in detail on his site:

http://www.lucd.info/2009/12/02/raiders-of-the-lost-vmx/

You might find that you have a blend of different VMs in the same datastore, but different VMs need to be located in the correct resource pool and folder. If that’s the case, you could register the VMs first, and then move them with:

connect-viserver vcnj.corp.com -username corp\administrator -password vmware

move-vm -vm db* -destination (get-folder -name db)

move-vm - vm db* -destination (get-resourcepool -name db)

move-vm - vm fs* -destination (get-folder -name "file servers")

move-vm - vm db* -destination (get-resourcepool -name "file servers")

Fixing VMX Files for the Network

One of the critical tasks SRM automates is the mapping of the VM to the correct network port group on a vSwitch. If you add your virtual machine into the vCenter first (our next task) you can automate the property change (as I mentioned previously) with PowerShell for VMware. The commands required to achieve this depend very much on whether you are working with SvSwitches or Distributed vSwitches (DvSwitches). With SvSwitches it is merely an issue of searching for the VMs that have a certain port group label, such as “vlan11,” and replacing them with the desired port group—say, “testbubble-1 group”:

get-vm | get-networkadapter | sort-object -property "networkname" | where {'vlan11' -contains $_.networkname} | set-networkadapter -networkname "testbubble-1 group" -confirm:$false

If, on the other hand, you are working with the DvSwitch, it is a little bit trickier. Currently, there is a gap between the functionality of the cmdlets for SvSwitches and DvSwitches. The piece of PowerCLI shown above simply won’t work currently for a manually recovered VM which was configured for a DvSwitch from a different vCenter, because a VM that is configured for a DvSwitch holds unique identifiers for the DvSwitch at the Protected Site, and these simply don’t exist at the Recovery Site. When this VM is manually recovered without SRM because no inventory mapping process is in place, the VM will lose its connection to the DvSwitch as it now resides in a new vSwitch. Essentially, the VM becomes orphaned from its network configuration. This shows itself as an “invalid backing” for the network adapter.

You might be interested to know that the aforementioned Luc Dekens (of http://lucd.info/fame) has a whole series of articles on handling DvSwitches using PowerCLI. He’s even gone so far as to write his functions (which behave just like regular cmdlets) to address this functionality gap in the PowerCLI. I wouldn’t be surprised if there are new cmdlets in the next release of the PowerCLI to address this limitation. For the moment, unfortunately, it’s perhaps simpler to have one vCenter for both sites. In this case, the ESX hosts in the DR site would share the same switch configuration as the ESX hosts in the primary site. However, a word of warning: Such a configuration runs entirely counter to the structure of VMware SRM that demands two different vCenters for every Protected and Recovery Site. So, if you cook up your manually scripted solution and you later decide to adopt SRM, you will have a great deal of pruning and grafting to do to meet the SRM requirements.

Specifically, Luc has been working on DvSwitch equivalents of the SvSwitch cmdlets called get-networkadapter and set-networkadapter. Luc’s get-dvswnetworkadapter and set-dvswnetworkadapter functions are much easier to use. To use his functions, create or open your preferred PowerShell profile. If you don’t know what PowerShell profiles are or how to create them, this Microsoft Web page is a good starting point:

http://msdn.microsoft.com/en-us/library/bb613488%28VS.85%29.aspx

Next, visit Luc’s website at the location below and then copy and paste his functions into the profile:

http://lucd.info/?p=1871

Using these functions, you can run commands such as the example below to set every VM that begins with ss0* to use vlan55:

get-dvswnetworkadapter (get-vm ss0*) | set-dvswnetworkadapter -networkname "vlan55" -StartConnected: $true

I’m sure Luc will carry on improving and extending the features of his functions, and I heartily recommend his series of articles on PowerCLI and DvSwitches.

Summary

As you can see, the manual process is very labor-intensive, which is to be expected by the use of the word manual. You might have gotten the impression that this issue can be fixed by some whiz-bang PowerShell scripts. You might even have thought, “This sucks; why do I need SRM if I have these PowerShell scripts?” However, it’s not as simple as that, for two main reasons. First, there’s no real support for this home-brewed DR, and second, you can test your scripts all you want, but then your environment will change and those scripts will go out of date and will need endless reengineering and retesting.

In fact, the real reason I wanted to write this chapter is to show how painful the manual process is, to give you a real feel for the true benefits of SRM. I know there are some big corporations that have decided to go down this route, primarily because of a number of factors. First, they have the time, the people, and the resources to manage it—and do it well. Second, they were probably doing this manually even before SRM came on the scene. At first glance their manual process probably looked more sophisticated than the SRM 1.0 product; they might feel that way about SRM 4.0. Personally, I think as SRM evolves and improves it will become increasingly harder to justify a home-brewed configuration. I think that tipping point will probably come with SRM 5.0.