Hyper-V R2ality: A Simple Plan – Hyper-V Recovery Manager Preview
Because what could possibly go wrong?
I found a number of different articles and tutorials useful in my experiments. As ever there isn’t really a “killer” article – and you do have to rather stitch together the various parts. But in the interest of helping folks I’ve gathered those resources here:
Deployment Guide for Hyper-V Recovery Manager:
Configure Windows Azure Hyper-V Recovery Manager:
Manage vault certificates:
Windows Azure: Backup Services Release, Hyper-V Recovery Manager, VM Enhancements, Enhanced Enterprise Management Support:
Recommendation: I wouldn’t recommend using Azure HRM if you haven’t yet deployed Windows Hyper-V 2012 R2. I couldn’t get the networking functioning with the previous release. Oh, and don’t bother trying to upgrade from R1 to R2, like many Microsoft Windows “upgrades” its horrible. I think your better off with the old faithful clean install. More about this in my next post!
- I had poor experiences with previous release of SCVMM/Hyper-V. The recent R2 release did actually work. Sadly, I think to get the most out of Azure HRM you talking about an upgrade (don’t even think about that!) or new R2 configuration. Networking mapping doesn’t seem to work without the R2 release
- HRM is “Hyper-V Replica Broker” only solution, so you cannot use array-based replication with it…
- SCVMM is the weakest link, and occasionally needs reboots to work with HRM
- Expect to bounce between various interfaces to monitor things – Hyper-V Replica, Failover Manager, SCVMM and Azure HRM
- Although you can export the result of jobs (Excel only) you cannot export the recovery plans…
- HRM is really glorified start-up list – basically starting VMs up the right order – with manual steps and scripts being called – other features reside else where (re-ip in Hyper-V Manager for example) or don’t exist at all…
Aside: One thing I’ve noticed in the time I’ve spent with virtual DR technologies are the many innumerable terms we use for the same darn thing – such as the Production, Source, Protected, Primary Site, and the Target, Destination and Recovery Site. Sadly, the Azure HRM preview suffers from this – so watch out for these terms and don’t get them muddled up!
In my previous article on Microsoft Hyper-V Replica I had a configuration where replication was occurring between two clusters within the same “System Center Virtual Machine Manager” (SCVMM) environment. To check out the Microsoft preview of their “Azure Hyper-V Recovery Manager” (Azure HRM) I needed to reconfigure my environment. One of the requirements of HRM is two SCVMM instances into different datacenters – for that reason I created a second SCVMM instance, and moved my cluster over there.
So here I have two SCVMM instances one for New York and the other for New Jersey. This is similar configuration that I have been using for a while with VMware’s Site Recovery Manager. This pruning and grafting process looks straightforward but it did take a reboot and a re-install of the SCVMM agent to make the Windows Hyper-V “lab13” and “lab14” servers communicate properly. Incidentally, despite the name “GoldR2” both the Windows Hyper-V and SCVMM are using the same release. My plan at some later stage is attempt upgrade from the Service Pack 1 version to the recent released R2 build.
This requirement of needing two SCVMM instances is not unlike VMware SRM that also requires a vCenter at each site – what makes the technologies different is that Azure HRM also requires a “cloud” configuration in both sites. It is however, possible to have one SCVMM with two clouds defined, and setup Azure HRM that way.
As we saw in the previous post, Windows Hyper-V Replica adds replication and very simple options for failover and failback. This is all managed from the Hyper-V Manager or Failover Manager, and not from SCVMM. Indeed, it leads to situations where SCVMM is clueless about the process that’s happening behind its back. It’s a pretty typical example of Microsoft Management Jive (MMJ!), where one management tool causes another to be upset or confused. What I was looking for was to see if the much trumpeted “Azure Recovery Manager” added much to the underlying replication provided by Windows Hyper-V Replica.
That’s the important aspect of Azure HRM to bear in mind. This is NOT disaster recovery TO the cloud. As the diagram from Microsoft illustrates all that Azure HRM offers is an orchestration layer held within Azure. The recovery is still to at a least two sites connected with enough bandwidth to keep up with the “churn” of disk changes taking place inside the protected VMs.
Inside the Cloud Washing Machine
Creating a cloud in SCVMM is very easy – almost too easy. Some people might call it cloud washing, which would be a little unfair. This term is generally used to describe taking an existing technology and rebadging/remarketing as a “cloud solution”. It’s not unlike the predilection for some third parties in the previous decade to append the letter “v” to an existing technology, in order to give the impression it’s specifically designed for virtualization. Of course cloud has now given way to “software-defined”, and I doubt it won’t be long before Microsoft announces the “Microsoft Software-Defined Operating System” or “MS-DOS” for short. J
You can create the Microsoft version of a cloud from SCVMM from the “VMs and Service” view.
The wizard that runs allows you select Windows Hyper-V cluster and Logical Networks to the cloud. There are some options that require pre-configuration in the “Fabric” layer before hand if you want to specify them. For example there’s support for third-party load-balancers, which need to be added into the “Fabric” view, before they can be selected from here.
Note: This library path is “owned” by the cloud being defined. It cannot be reused by any subsequent cloud that’s created – this means each “cloud” must have its own VMM Library Share. This will be come significant later on, as Azure HRM is dependent on the VMM Library for holding any scripts that are called during the running of a recovery plan. So it means for each cloud you create you need a VMM Library location populated with your scripts.
Note: It’s this part of the wizard that perhaps makes this a more “cloudy” configuration – in the sense that the previous parts were merely the process of creating a container object that pointed at resources at the virtual layer. The “capacity” element allows the administrator to take the total compute resources and make a “sub-allocation” by disallowing the “Use Maximum” options, and allow the administrator to “assign capacity” to this cloud.
Configuring Azure Hyper-V Recovery Manager for the First Time
Creating an Azure HRM Vault
There’s a run-once setup routine for enabling Azure HRM for the first time once your Azure account has been activated for the preview. Firstly, you must create what Microsoft calls a “vault” in the “Recovery Services” pane of the Azure portal.
This opens a wizard that allows you to define a “vault” to be used by Azure HRM, from which you can specify a name, and location for where it is stored. Currently three locations are available – West US, East Asia and West Europe.
Once the vault has been created you can select it, and you will be greeted with a “Quick Start” welcome screen.
Upload a Certificate File (.cer)
As you can see there, the first thing that is required is a certificate. There is one certificate required per-vault – so this isn’t a certificate that comes from either SCVMM instances, it’s used to verify you trust the vault and its contents. Microsoft outlines in detail the certificate requirements in this article:
These include the following requirements:
You can use any valid SSL certificate that is issued by a Certification Authority (CA) that is trusted by Microsoft (and whose root certificates are distributed via the Microsoft Root Certificate Program). For more information, see Microsoft article 931125.
Alternatively you can use a self-signed certificate that you create using the Makecert.exe tool.
The certificate should be an x.509 v3 certificate
The key length should be at least 2048 bits
The certificate must have a valid ClientAuthentication EKU
The certificate must be currently validity with a validity period that does not exceed three years. You must specify an expiry date, otherwise a default setting that is valid for more than three years will be used.
The certificate should reside in the Personal certificate store of your Local Computer.
The private key should be included during installation of the certificate
To upload to the certificate to the portal, you must export it as a .cer format file that contains the public key.
Each vault only has a single .certificate associated with it at any one time. You can upload a certificate to overwrite the current certificate associated with the vault at any time.
Those are quiet hefty requirements and I imagine the main thing folks might struggle with is understanding and applying all these attributes so the certificate is properly accepted by the service. In my case I was fortunate enough to have a Enterprize Root CA on my network, and so from my SCVMM server I was able to the use the MMC to make a request. The “details” and “properties” of the “computer” certificate request template can be adjusted so you can configure the requirements of the Azure HRM service.
Critically, this means increasing the number of bits used by the certificate, and ensuring the private key is included in the enrolment process. These settings are held within the “private key” tab.
Once you click Enroll this generates a certificate and stores it within local computer certificate and “Personal” Store:
Once the certificate has been issued, it can be exported into the .cer format. In this case you don’t export the private key, and this will allow you to export the certificate in .cer format.
Using the “Manage Certificate” option allows you to upload the certificate file.
Subsequent, SCVMM hosts that are being added to the Azure HRM will need this certificate imported into their certificate store. This could be done using Active Directory, but you can also export the certificate into a .pfx format including the private key – and then import it into the certificate store of the destination SCVMM Host.
Export the Certificate using the MMC focused on “Computer”:
Import the Certificate with the MMC focused on “Computer”:
Download and Install the Azure HRM Provider
Despite being an online service, Azure HRM does require that a “provider” plug-in be installed into the SCVMM server. This is the “DRP” component that you sometimes see in architecture diagrams.
This component extends the UI of SCVMM console to add the ability to “Enable Replication” on a VM. This is singly absent in the basic Window Hyper-V Replica model where the Hyper-V Manager triggers protection. For the “VMMHRMProvider_x64.exe” to install SCVMM must be running either SCVMM R2 or the SP1 version with “cumulative update 3” or later installed to it. Without these versions the provider .exe install will not complete. The most recent cumulative update is version 6 that was made available on August 8th, 2013. Remember if your behind on the Windows Updates you will often have to do two rounds of it – one to get the updates, that allow you to install further updates.
If you not using Windows Update to patch the SCVMM servers then these cumulative updates can be downloaded individually.
When you finally get Windows Hyper-V 2012 patched up to the hilt (and believe me it does take time to update, re-updating and rebooting again and again) you should eventually be confronted with this welcome screen from the HRM Provider setup.
IMPORTANT: As the “Getting Started” screen says, you must stop the SCVMM service before proceeding.
After the file copy, and configuring your proxy server settings as befits your environment, you’ll be asked to browse for the certificate used to secure the “vault”. Clicking the browse button will allow you to select the certificate from the local computer store:
Next you can specify a friendly name for the SCVMM, and register it with the “vault” on Azure.
This should complete the registration process, and restart the SCVMM service when you close the setup program.
Once the registration process has completed you should see references to these friendly names under the “resources” menu, and “servers” like so in Azure HRM:
Configure Protection Setting for VMM Clouds
During the registration process the clouds that are present in the SCVMM configuration should be “published” to the “vault”. The reference to these clouds should appear under “Protected Items” in the Azure Portal like so:
In my simple configuration there is only two clouds. So the New York cloud will be regarded as a “Primary” with New Jersey as the “Target”. According to Microsoft:
“A cloud can only belong to a single cloud mapping—either as a primary or a target cloud.”
On the surface this would appear to preclude the ability to reverse the replication and failback to the original site. This is not the case, once a planned or unplanned failover has occurred; a secondary process is triggered which allows the administrator to “commit” the switch of the sites, and reverse the replication.
Once the pairing has been made one cloud becomes the “Protected Cloud” and the other becomes the “Recovery Cloud”.
This does seem unnecessarily complicated – for instance in VMware Site Recovery Manager (SRM) any pairing between two sites is automatically bi-directional, and with SRM failback routines it will even assist in inverting the failover process (aka failback) by inverting the replication between one storage array to another, or by utilizing vSphere Replication.
The “pairing” process is relatively straightforward. The administrator clicks the arrow next to the name of the cloud.
Next select the option to “Configure Protection Settings”, and in the subsequent page select the “Target Location” and “Target Cloud”. This will expand the page to show a great many options for how replication is to be managed. These are divided into two category types called “Replication Location & Frequency” and “Replication Settings”.
Replication Location & Frequency Explained:
The “Copy Frequency” controls how often replication takes place. This setting is only applied to VMs running Windows Hyper-V 2012 R2, for all other editions this setting is ignored. The frequencies are pre-set to allow for 30second, 5minutes or 15mins, and there is no “slider bar” to control the replication (or copy) frequency. Whilst on the surface it might be impressive to allow replication to occur ever 30seconds, you have to wonder how many customers would have the appropriate bandwidth to make this serviceable – and indeed have a use case that demands this level of frequency. I imagine in most cases if customers do need two sites to be kept in synchronous state they are probably using dark-fibre and array-based replication, rather than hypervisor-based replication.
As with the Hyper-V Replica settings you can also configure the creation of rollback “Recovery Points” indicating their number and how frequently they can be taken. In my case I was only allowed 15 recovery points as maximum taken every hour (or less frequently). With Windows Hyper-V 2012 R2 this number has increased to 24 recovery points.
Note: One thing I later discovered when I experimented with Azure HRM with the R2 release of Windows Hyper-V 2012 was these additional recovery points remained a maximum of 15. I’m not sure why that is – R2 is meant to support 24 recovery points taken at per hour – which would allow for a maximum of 24 levels of undo per day.
Replication Settings Explained:
The “Data Transfer Compression” option is used to compress the data stream before it is replicated to the target location. Despite the obvious CPU overhead of any compression, the bandwidth saving most likely outweighs that penalty. CPU cycles are plentiful after all, when compared to WAN bandwidth.
The port number controls what TCP port is used to allow the replication to occur. I must admit I found the use of 8084 a bit confusing considering that Windows Hyper-V Replica Brokers normally support either HTTP/80 or HTTPs/443. These port settings are configured when you add the “Windows Hyper-V Replica Broker” role to an existing Windows Hyper-V Cluster.
I was curious to know what the effect of leaving this port assignment untouched would be. I assumed unless the Azure HRM service was told of the correct port number it would fail…
The “Replication Method” allows for replication across the network as controlled by the “Replication Start Time”. Alternatively, the “Offline” method allows you to pre-populate the target with copies of the VMs to be replicated to reduce the bandwidth required to complete the first synchronization.
Once your done with this page, the “Save” button at the end will retain your configuration to the “vault”. It’s important to note that you have a one-time chance to select the right target from the list. If you don’t you have to remove the references to the clouds altogether – and add them back in. Microsoft explains that at this stage three processes are happening in the background:
- Firewall rules used by Hyper-V Replica, with the ports required for replication traffic are opened
- Certificates required for replication are installed.
- Hyper-V Replica settings are configured on the Brokers
The confirmation of this option starts a lengthy “job” which carries out the necessary configuration of the “Hyper-V Hosts”.
It’s interesting to consider the number of different tiers of management interaction taking place to push these setting down. The Azure HRM service is speaking to the SCVMM servers, which is speaking to the Windows Hyper-V Hosts in each cluster – and then in turn is communicating to the Windows Hyper-V Replica Brokers where these settings fundamentally reside. Perhaps it’s the sheer complexity of all those layers that lead to a failure of this job when I approved it.
Looking at the error details on the job, the Azure HRM did flag-up a misconfiguration on the New Jersey cloud.
I’d been a bit slapdash in the configuration of the New Jersey cloud that had caused this error. By mistake I’d enabled VMware ESX as “Capability Profile” instead of Windows Hyper-V.
I decided to investigate how the Hyper-V Replica Broker had been reconfigured – and pretty quickly I could see the problem. The role had terminated and failed to start correctly. Sadly, however this didn’t resolve my problem. Ironically, the system did pass Windows Hyper-V R2 hosts – despite the fact neither of my SCVMM hosts actually have an R2 host!!!
I was left with one final error to resolve. Unfortunately, the help was less than helpful.
You see GoldR2 isn’t actually a “server” I can log into and use “winrm helpmsg hresult” against. It’s the Hyper-V Replica “role” within the Windows Hyper-V failover cluster. It’s more a kin to a process that runs under the pseudo-identity of a host on the network.
Further investigation of the Hyper-V Replica Broker role did show that it was running on the New York cluster with the right configuration (HTTPs/8043) – but the New Jersey cluster it was still operating using my original configuration (HTTP/80).
As you can see it’s as if the Azure HRM hasn’t communicated via New Jersey’s SCVMM down to the Windows Hyper-V cluster to carry out the reconfiguration of the Hyper-V Replica Broker. I discovered that not only had the protocol not been changed to HTTPs/8084, but also that the certificate hadn’t been installed either. It was tricky to workout what had gone wrong – given that the SCVMM of New Jersey had successfully registered – and was reporting SCVMM information as well as the unique information from the “Provider”. Perhaps there was a problem with communication between the Azure HRM and the SCVMM – or maybe there was an issue with my certificate after all, even though the status told me everything was hunky-dory?
So in the end I fixed it. Using the time honoured Windows Administrator approach.
Reader, I rebooted the SCVMM in New Jersey – and tried again.
By now I’d spent all day getting this far.
Configure Network Mappings
Networking mapping is a concept I’m familiar with from using VMware’s Site Recovery Manager. The idea is a simple one. The networks at SiteA may use different VLANs/subnets to SiteB. When a VM is brought up into a different site its configuration will need to be adjusted. Not only will it need to be attached to the right network, it will also need its IP address reconfigured at the same time. There are considerable risks in this process – especially during a test of the DR plan, if two nodes with the same identity are put on the network at the same time. This mapping is done to the “VM Networks” within both “clouds”. What makes this aspect interesting is a significant and welcome change to networking that comes with Windows Hyper-V 2012 R2. In the “bad old days” (last year) the VLAN settings were configure on per-VM basis like so:
With the onset of the R2 release VLAN definitions now reside within the “Network Site” configurations within the “Logical” switch.
It’s an infinitely more “Logical” way of handling VLAN definitions. It’s an approach that VMware has had since about 2003. You see Microsoft IS catching up with VMware – one slow decade at a time. 🙂
I think because of this change the mapping process will look considerably different in R2 when compared to the first release of Windows Hyper-V 2012. Once I’ve finished playing with Azure HRM, my next project is to upgrade my Windows Hyper-V environment to R2. [ED. Good luck with that Mike!].
Joking apart my main concern here is with those using the first release of Windows Hyper-V 2012. There just seems NO way of mapping one VLAN in one site to another. That’s sure to create anxiety around whether using Windows Hyper-V 2012 R2 will be a requirement to get the full functionality out of the service.
You’ll find the network mappings are held under “Resources” and “Networks” in the Azure Portal. When you use it make sure you have “Source” (or Primary) cloud selected first, and the “Target” location second. You select the VM network you wish to map, and click the “Map” icon at the bottom of the screen.
Enable Replication for Virtual Machines
Phew! Nearly Ready. Despite the existence of the Azure Portal, replication is enabled on VMs not from it, but from the SCVMM. This is a bit of a “first” for Microsoft as previously SCVMM had no clue what a Hyper-V Replica was. If you look at existing VMs that were residing on a Hyper-V cluster you will find the option to enable replication is unavailable.
This is because the existing VMs may/are not associated with any cloud, and they were initially created at the Microsoft “virtualization layer”, not the Microsoft “cloud layer”. This is despite the fact the “cloud” when it was defined was pointed at an existing cluster of Hyper-V Hosts. To fix this problem, on the properties of each VM, you need to right-click and adjust the properties. Under the “General” tab and the “cloud” entry:
This has the effect of moving the VM(s) into the scope of the cloud, and thus enabling the “Enable Replication” option.
One tricky problem I had was the existing Windows Hyper-V Replica settings I had previously configured before touching the Azure HRM. As you might recall from my previous post there’s a quite a significant clean-up process required if you disable Hyper-V Replica on the properties of a VM. Stale and orphaned objects in SCVMM, Hyper-V Manager and Failover Manager need to be cleaned out to return the VM to state before replication was enabled. I was hoping the all-new integration with SCVMM would make this simpler. But it seems despite repeated attempts to clean out my previous administration, SCVMM was convinced Hyper-V Replica was enabled elsewhere. But it wasn’t.
Note: On a “clean” Hyper-V machine that had never been protected before this message did not appear. So it looks if I hadn’t successfully cleaned up my previous experiments with Windows Hyper-V Replicas.
Undeterred I went and acknowledged the warning and clicked yes. The job did complete, although SCVMM gives little information about the progress of the replication itself. I was going to say “surprisingly little”. But as time goes using SCVMM nothing surprises me anymore. Things are becoming depressingly familiar. You could say my normally high expectations are being slowly eroded and ground down by Microsoft technology. In fairness to Microsoft you can right-click the VM, and check its replication status like so:
Once this first initial replication has completed, the replicated objects do appear in the target SCVMM environment, in my case – the New York VMs have appeared in the New Jersey cloud.
Additionally, the Azure HRM does show the total number of VMs protected in the “Dashboard” view, together with a list of VM names in “Protected Items, +Vault Name, +Virtual Machines” View
Using Azure Hyper-V Recovery Manager
Azure HRM really has very simple features currently – a basic test and failover of an individual VM on an ad hoc basis, and the ability to create what Microsoft calls a “Recovery Plan”. The basic functionality is little more than what you get with Windows Hyper-V Replicas on their own when controlled by Hyper-V Manager. In other words no orchestration or recovery plan to speak off. It’s really the second feature of “recovery plans”, that I think most Microsoft customers might be most interested in. It’s interesting that most of the official online documentation totally ignores this basic functionality I guess because it’s pretty much of no use in most production environments.
You can access the basic functionality of the Azure HRM by selecting the “Protected Items” menu, then selecting the protected cloud (in my case CorpHQ-NewYork-Cloud) that then lists the protected VMs in that cloud. At the bottom of the Azure portal there are buttons that allow you to carry out a “Failover” and “Test Failover” like so:
Triggering a test failover requires selecting the VM, and clicking the button. The administrator will then be asked to configure what sort of network functionality will be offered. I found this a bit odd given the “mapping process” that was undertaken earlier. The Administrator is offered three choices – none, use existing or create automatically.
Alternatively, the use “Create Automatically” option allows you to select the Logical Networks defined, and have Azure instruct SCVMM to auto-generate “VM Networks” for you, and delete them when the test has finished.
I decided to opt for the “Create Automatically” option to see how that worked. Sadly, it fell at the first hurdle, as it seemed to fail to create the necessary network components for the VM. I did get a VM to power on at the target site called New Jersey. But the virtual network card was not attached at power on despite the fact that Azure HRM status indicated everything had been successful.
So Azure HRM in the job status said the networking had been created successfully:
But looking at the “Environmental Details” that was clearly not the case:
Sure enough if I looked at the properties of the recovered VM, its virtual NIC was not connected to any network at all.
I decide to click “Complete Test” which issues a clean up instruction – and allows the administrator to make a record of the results.
On a positive note, I found this “clean up” process did work and it is a massive improvement on using Windows Hyper-V Manager on its own – that creates orphaned objects all over the place to be manually clean out afterwards.
Given this initially poor experience with “automatic” testing, I decided to give the manual process ago, where you manually select the “VM Network” you want the VM to be connected to. Perhaps I’m misunderstanding the feature – because this didn’t work either. The VM came up configured for the “VM Network”, but not connected and not set for a VLAN either. Maybe Azure HRM is expecting Windows Hyper-V 2012 R2, or assumes the administrator will intercede and complete the networking component. As ever with Microsoft technologies is so hard to work out sometimes – is this broken or by design?
Having poked at the manual methods of doing failovers, I thought I would move on to using the recovery plan feature. My theory was if that didn’t work out for me – I would try upgrading my environment into the R2 release, and see whether that improved matters. To tell you the truth I was a bit scared about doing that upgrade, I wanted to try and get the most out of this build before embarking on that process.
Note: I found my networking was as broken with recovery plans as it was with manual failovers. It was difficult to really see what the problem was – although I was concerned that mapping functionality wasn’t really working as advertised. I decided that I would have to attempt a R2 upgrade to see if that was any better an experience.
The big brother of manual failover/failback is a recovery plan. Given my woes with getting the networking piece in place I decided to create a simple plan that contained just one VM to see if the networking side of things was any better with the “network mapping” in place.
Recovery Plans are made under the “Recovery Plans” menu (there’s no surprise there!). You give the plan a name and then specify the source and target SCVMM servers. Incidentally, the interface does default to making the source/target the same SCVMM – so it doesn’t look like the Azure HRM picks up on the “Protected” and “Recovery” cloud association made earlier.
Azure HRM Recovery Plans do support “groups” which allow you to gather VMs together, in my case I created a simple recovery plan that just included one VM, just to see if I could resolve this networking issue.
Sadly, I had the same problem again with networking. I decided to persist with the configuration as it was so I could examine the other features available within a recovery plan, before embarking on an upgrade to Windows Hyper-V 2012 R2.
Azure HRM recovery plans supports “groups”, and these groups allow you to gather like-minded VMs together. This pretty much typical in most virtual DR automation technologies – and its there to allow you to bring up VMs in the right order for their service dependencies. Azure HRM doesn’t have a specific “VM Dependencies” feature like VMware’s Site Recovery Manager (SRM), and Azure’s HRM’s groups are more akin to VMware’s SRM “Start-up Priorities” of which there are five-levels. There are some other features that Azure HRM groups lack such as per-VM settings that allow for start-up delay controlled by time or by the availability of a service.
I must say there was considerable “jive” between the various layers of the cake when it came to enabling replication for my multi-tier application. I would enable replication in SCVMM and the status would indicate the job had completed successfully.
But within a couple of minutes, the setting to enable replication would become available again, with “Disable Replication” disabled. There would be no information why replication had been untimely stopped.
Other tools involved in enabling and monitoring replication would indicate that replication was not enabled at all. Snooping around with the Failover Manager and Hyper-V Manager showed the “shadow” VMs generated when replication is enabled were not present, and the status on my VMs indicated replication was not enabled.
As for Azure HRM that monitors (but does not drive replication) showed that some, but not all my VMs were being replicated.
The trouble with these multiple pains of glass is none of them really told me why or what the problem was. And that got me thinking.
I pretty much take the piss out of the concept of “single pains of glass” on a daily basis, as being the hollowed out vacuous marketing term it has become. BUT, if you are working with a product with multiple pains of glass, then doubt does creep in. Out of all the management UIs you have at your disposal – which one is the font of all truth that can be trusted above the others when they jive with each other? What dangers exist when you go round the back of one pain of glass, to do your administration with another? And if you do have a problem which pain of glass do you use to try to fix the problem?
Looking at the Windows Hyper-V Replica role on the cluster indicated that the role was running at both locations without a problem. I decided to do the old Windows Admin trick of restarting those roles to see if that would fix my issue. Sadly, that had no effect. In the end I decide to enable replication where it “lives” using the Hyper-V Manager. This seemed to flag up an issue with the certificates – as if the Windows Hyper-V Replica broker had lost its association with certificates loaded up into the Azure HRM.
This was despite double-checking the Windows Hyper-V Replica broker to see if there were any problems. Both the primary and target Hyper-V Replica Brokers were correctly configured for port 8084 and to use the certificate.
I decided to see if re-enabling the replication would fix my problems, sure enough that job failed just like the first time.
The last time that happened I rebooted the SCVMM host, this time I tried to be less intrusive and just restart the SCVMM service instead. This had no effect.
So Reader, I rebooted it… Tad-dah! Success!!!
After a couple of hours troubleshooting I was able to create a recovery plan with multiple groups like so:
You need to be a little careful about the order that you create these groups. Although VMs can be moved between groups – that involves a lot of clicking – and there’s no method to move one a group ahead of another – saying making group 3, start before group 1. Additionally, although a VM can be moved between one group and another, all the VMs within a group start simultaneously – so there’s no boot order within the group – or boot delays on a per-VM basis.
There’s an important implication to consider here with recovery plans in Azure HRM. By default the relationship between one cloud and another is unidirectional, and the recovery plan is only useable for failover from the primary to the target site. If the administrator wishes to have SiteB protected by SiteA (SiteA<SiteB) they would need a total of four clouds for a two-site environment. That to me seems a bit clunky, especially when you consider that Hyper-V Replicas natively support bidirectional replication – yet in SCVMM you need multiple clouds just to make the relationship work.
Adding a Manual Action:
Manual actions can be added to the recovery plan, between groups of VMs. However, there’s no method of pausing a recovery plan between one VM and another. Ostensibly a manual action allows the plan to be paused for some human oriented task to be completed that cannot be automated.
When the recovery plan is run, the plan stops and leaves you with a “Complete Manual Action” button that must be filled out, before the next part of the plan can start.
Scripts are held within the “VMM Library” of the respective SCVMM enrolled into Azure HRM. These are then added into the recovery plan. Currently, Azure HRM only supports PowerShell .ps1 script files, as there is a limitation on the script engine types supported by the VMM Library – it only supports .ps1 and .sql files. This is in marked contrast to VMware Site Recovery Manager that supports ANY scripting engine installed to the SRM Server so long as it has the scripting engine installed to it, and that scripting engine is specified in the path to the script. With this said, I imagine it is probably is just as easy to call other scripting engines from within a PowerShell script.
PowerShell may have never been run on the SCVMM server, and so the administrator does have to adjust the “Execution Policy” that allows scripts (either trusted or untrusted/signed or unsigned) to run. Typically, people cheat and just use the lowest form of security available, rather than have the hassle of using certificates to sign off their scripts.
Then a script can be created using the .ps1 extension, and uploaded to the Library. In my case I made a simple script that pinged the database VMs, and save the results of that ping to a text file:
ping 10.17.107.61 > c:\dbup.txt
ping 10.17.107.62 >> c:\dbup.txt
ping 10.17.107.63 >> c:\dbup.txt
Note: You’ve got to admit I rock at scripting. 🙂
This script can be then uploaded into the VMM Library using the “Import Physical Resource” button. You merely browse for the script (by clicking the “Add Resource” button) and then browsing for the path of the VMM Library:
Once the import process has finished you can then see the PowerShell script in the VMM Library:
In the Recovery Plan you then specify the “relative” path to the script file. So as my simple “pingtest.ps1” was in the “root” of the VMM Library, I just needed to specify the filename. Had there been a scripts subdirectory, I would have to type /scripts/pingtest.ps1.
Exporting Job Results:
Azure HRM does support the export of the results of a recovery plan to an Excel file. It doesn’t support the export of the recovery plan steps itself – just the results, and the only format supported is Excel. This is somewhat limiting, as I’m used to being able to export both the recovery plan and its results into a Word, Excel, HTML or CSV format. But at least the results of recovery plan can be exported which is vital in being able to prove to auditors that successive testing of the plan has taken place. The issue I see with this – is that all my VMs were labelled as “completed” despite the fact none of them connected to the network. So you would need to be bit cautious with this, just because a job is marked as “completed” doesn’t mean it was a “successful” recovery.
Planned and Unplanned Failovers
By now I was beginning to run out of stuff to play without actually trying a failover for real. When you select a recovery plan, and choose to failover – you get the option to indicate if it is a planned or unplanned event.
There are no real options here, except to confirm that you accept your direction of travel from the Primary to the Target location – in my case New York to New Jersey.
Once the failover has completed – the administrator is given the option to “commit” the changes to system. The administrator needs to be a bit careful as this rolls up all the “snapshots” or “checkpoints” accrued with the VM replicas at the Target location, in my case New Jersey.
Once this “commit” process has completed, a second stage of “reversing replication” becomes available:
This reverse replication does not change the “personality” of the clouds themselves – the two clouds maintain their status of protected and recovery, and the recovery plan is available for both failover and failback processes. When you come to run the recovery plan to carry out the failback, you will see the direction will take the VMs back to their original location – in my case from New Jersey, back to New York:
Windows Hyper-V 2012 R2 Experiences:
Like a fool I did attempt an upgrade from Windows Hyper-V 2012 R1 to Windows Hyper-V 2012 R2 [More about this in a subsequent post!]. It ended up an utter mess. So I decided to totally level the environment and start with R2 from scratch. There was one rather interesting outcome from this. I didn’t bother to “clean-up” the previous “vault” with all those Azure HRM settings. Of course, now the original environment that was wedded to that vault doesn’t exist. So I thought I would try cleaning that out of Azure HRM. No such luck.
I was able to delete the recovery plan, and remove the protected VMs. But Azure HRM won’t allow me to remove the previously registered SCVMMs, and as consequence won’t allow me to remove the “protected” items – the references to the two clouds.
As a consequence I’m unable to delete the original vault. As work around I just created a new vault called “Corp-Inc”, and chose to ignore the old one.
What I think is interesting my destruction of my original configuration, is not unlike a disaster when everything gets wiped off the face earth. However, Azure HRM persisted in thinking everything was still there. I’m sad to say this has been a repeated experience using Microsoft technologies. Stuff that should be there is still there, and you can’t remove it!
One good experience came around the whole Hyper-V Replica Broker requirement. I found in my second attempt that merely pairing the Recovery Cloud to the Protected Cloud resulted in two Hyper-V Replica Brokers being created and defined on my respective Failover Clusters. I think this is “out of the box” experience was a good one. With one caveat – I had remembered to give appropriate AD rights to allow the clustering service to create computer accounts. Without that, then the roles would not have started….
At last I was able to get to the “network mappings” part of the workflow. I have two VLANs 1060 and 1390, which I planned to map New York’s 1060 to New Jersey’s 1390. I’m pleased to say that Windows Hyper-V 2012 R2 new “VLAN” functionality makes this process make more sense than previously…
In case you don’t know (or haven’t read my previous blogposts), the “Logical Switch” has new functionality which I think personally makes networking easier in the R2. I will remind you that VMware has been doing it this way since 2003. Just sayin’.
When I tried the mapping process I got an odd message in the Azure HRM windows. It’s one that really doesn’t make sense to me:
“One of more the subnets in the primary VM network are not present in the recovery network. For the missing subnets, the replica virtual machine will be attached to the first subnet in the recovery network.”
I really have no clue what Microsoft means by this. Firstly, its totally logical that subnets in the Primary Site wouldn’t be present in the Recovery Site. After all they are two totally DIFFERENT locations that wouldn’t necessarily share the same network, unless of course they were stretched VLANs. Normally, this mapping process (along with a re-IP) is necessary because the VMs are being recovered to a location they don’t normally reside. The second part of the statement is even more confusing – surely the whole point of this mapping process is to allow the administrator to ensure the VMs come up on the desired network, not any old one or the first one it finds? This warning message came up what ever I selected in the interface. I decided to ignore this message and just see what happened. Maybe its one of those Microsoft warnings – which is actually benign, but scare the beJesus out of people when they read them?
Despite the warning – the job was allowed to start.
It also completed as well, although it did “skip” a part that was called “Network Attach”. So I wasn’t really 100% sure whether the job was 100% successful or only partial successful.
So I tried un-mapping and mapping again. I still got the same “warning”, but at least the job appeared to give a better outcome.
It’s actually quite difficult to see if the mapping process is working properly. The “shadow” objects created in the recovery cloud aren’t editable. However, there is a work around. It’s possible to look at the network settings on these “shadow” objects in Hyper-V Manager. So here on “WEB01” I could see it’s network had been mapped away from VLAN1060 to VLAN1390 which was my intention.
Enabling Replication & Creating Recovery Plans
It was now time to enable protection on my VMs and also create a recovery plan. I notice in SCVMM R2 the UI for enabling protection is different than in the SCVMM R1 release. There’s a right-click option on the VM called “Manage Protection” that in turn allows you to enable it:
Note: The pull-down list here is a bit curious. The frequency is set within the Azure HRM and cannot be controlled from here. So I guess you could say that Azure is ensuring a policy. But one that’s a bit inflexible – all VMs contained in the cloud are replicated at the same frequency, regardless of whether that’s entirely necessary. That could see you setting up multiple clouds to allow for different frequency rates. I dunno. Maybe this a bug? It’s just a preview after all…
So the good news – drumroll. I was able to get my recovery plan to successfully bring up the network on the VM at the recovery cloud. Woohoo. At last! It appears as if these “mappings” are only used during the Planned or Unplanned Failover. You not asked for any network settings when you run these formats (Planned/Unplanned) of the Recovery Plan. Once the network mapping is in place it cannot be selected during a test of a recovery plan.
Despite being on the latest and “greatest” R2 release “management jive” is still a problem. For instance my VMs are shutdown at the production location called New York. But the SCVMM in New York still thinks they are powered on – Hyper-V Manager however, is right in telling me they have been powered off… Sigh…
It might seem a trivial GUI refresh issue – and try as I might I have struggled to get SCVMM to update itself. But I feel this is important – as when you bouncing VMs backwards and forwards in testing it is SO easy to get lost as where the VM is, and which system is running the VM currently. In the end to keep track of where my VM was and what it was doing – I used Hyper-V Manager.
There are a number of conclusions and tips that I can draw from my experiences with using this preview of Azure Hyper-V Recovery Manager. Firstly, if replication appears broken or Azure HRM isn’t working your old friend rebooting does appear to fix problems. It seems like the weakest link is the relationship between SCVMM and Azure HRM.
This is Microsoft’s first real foray into the world of virtual disaster automation and I’ve got one big recommendation for them. Guys, you will need a LOT more features than this to attract customers. Even a fully featured super rich product like VMware SRM has to roll the rock up hill. Here’s why. Customers are forever comparing the price of products against using scripting tools to automate these sorts of steps. Of course comparing so called “free” against purchasable software or subscription services always puts the vendors on their back foot. As customers are forever forgetting that nothing is for free. Scripting isn’t free – it comes with costs associated with development, maintenance and support.
The other issue Microsoft is up against is the vast array of tools that can be installed to Windows from third party vendors. These are often availability tools that have been stretched out across sites. In my experience it’s precisely in these geo-scenarios that you discover their limitations. Then there is our new friend the “stretched cluster” or geo-clusters. Customers often forget the not inconsiderable hardware and network requirements to make this work. If Microsoft wants to be successful with their virtual disaster automation project, then they need a lot more features to make this attractive to customers. Don’t imagine you can simply compete on price. The market isn’t shaped that way. People won’t spend the money if the features aren’t there. It’s hard enough to convince people to spend on BC/DR at the best of times.
I could at this stage beat Microsoft up by comparing VMware’s SRM to Azure HRM. I don’t think that’s terrifically helpful. I don’t see the products as competing. Azure HRM offers DR automation to Windows Hyper-V, and VMware SRM offers DR automation to vSphere. So you have to have already chosen the platform before you end up selecting Azure HRM or VMware SRM. With that said I think there’s a number of a features Microsoft needs to triage as soon as possible. These are:
Azure HRM doesn’t appear to be able re-IP VMs. Despite the use of stretched VLANs; NATing and changing routing tables – most customers (I mean more than 50%!) to this day are re-IP-ing their VMs when they are failed over to the diaster recovery location. You can argue all you like about whether this is a good thing – this is the hard reality. A disaster automation tool that can’t re-IP is DOA in my book.
Now in fairness, although Azure HRM doesn’t have a re-IP process, Hyper-V Replica does. You will find it on the properties of the VM being replicated within Hyper-V Manager.
However, I found these settings didn’t get applied when I ran the recovery plan from Azure HRM. I found the original IP address was maintained. Perhaps that’s because the network was never successfully tested…
Fortunately, using R2 I was able to test this feature, and I did make it work. However, I discovered an important limitation – the dialog box above only holds the “failover” IP address, not a failback address as well. That means you could failover from a 10.x.x.x address to a 192.x.y.z address at the recovery site, but if you did a failback, the IP would remain as 192.x.y.z. To fix this you would need to keep a record of all the original IP addresses, and then probably use some sort of PowerShell script to change the failover IP address back to the original IP prior to doing the failback process. That might sound reasonable into compare it to something like VMware SRM that can retain both the source and target IP addresses.
I decided to look at SCVMM “IP Pool” that appear on the properties of a Logical Network. These worked for this scenario in a much better way than the Hyper-V Failover IP. Clearly, these need to be setup on the Logical Switch to begin with, and you must deploy VMs from a template in the VMM Library, when you do this the option to use a “Static IP” for the VM is exposed.
Note: Incidentally, it does seem once your using IP pools, using a static MAC address become mandatory. Fortunately, there is a built-in MAC Address Pool in the “Fabric” portion of SCVMM that takes care of this requirement.
I found so long as there was a IP pool associated with the recovery VLAN, then it is possible to bounce the VM to and from each site, with the IP Pool handling the re-IP configuration process.
I know VMware have defined this concept of “software-defined” everything, but hardware isn’t going to evaporate. Many customers have made significant investment in array-based replication (ABR). They like it. They want it. They like the acceleration, scalability and synchronised replication it offers. While for some customers hypervisor-based replication (HBR) is a good fit, for some customers it isn’t. Microsoft desperately needs a program like VMware’s “Site Recovery Adapter” which is supported by a wide array of storage vendors in the market place. SRM is one of the few disaster automation tools that support both ABR and HBR. Some of the vendors in the space like to poo-poo ABR support, and ramble on about the politics of working with the storage teams, and getting firmware support in place. But here’s the real reason they bang on about their HBR. It’s because they don’t have ABR support, and would find it monumentally difficult to build the necessary relationships with the storage vendors to pull it off with any creditability. Only companies like VMware or Microsoft who have the industry influence and pulling power could. VMware has being doing ABR and has had a program of storage vendor support from day one. In fact, with VMware SRM it started its life as an ABR solution, and vSphere Replication was added in version 5.0.
Microsoft really needs to get a handle on what I call “management jive”. Where you have 2-3 different management UIs that don’t report the same information. There’s insufficient feedback to the administrator in SCVMM especially to monitor their replication. Customers will find themselves bouncing from Hyper-V Manager and/or Azure HRM to find out what the status is of their replication.
And finally… a complement… I’m actually quite impressed/intrigued that Microsoft has chosen to offer disaster automation as an cloud solution with an Azure SaaS solution. It strikes me as innovative way to deliver management via a SaaS application. What could be more “cloud” like? That should mean they could innovate new features (and god do they need to!) rapidly, and be able to seamlessly offer a public cloud as a possible target for recovery. I’m sure that must be their ultimate plan – to offer disaster TO the cloud. I’m going to give them the benefit of the doubt and assume they have chosen this approach as a faster route to market rather than adding this automation layer to SCVMM – rather than being a method of taking ownership of customers VMs, and locking them into Azure forever…