image01
A scene from Bladerunner where Harrison Ford has to work who are the real humans, and who are the replicants. Sometimes choosing the right software can feel like that. You’re surround by replicants who look like the real deal, but really they are clapped out robots. Some of the replicas are very seductive on the outside, but deadly on the inside. Buyers Beware!

As part of my travails in Hyper-V Reality series I decided to take a look at Windows Hyper-V Replicas. The equivalent feature in VMware is vSphere Replication. This is a replication process that is software based, and is agnostic when it comes to the underlying storage. There’s a couple of ways that Windows Hyper-V Replicas can be set up – with or without Failover Clustering; Secured or non-secured. My initially goal was to look at the most “production” style configuration. In my configuration the setup will between two clusters and secured. That’s because if you going to replicate, you are probably smart enough to know you theoretically the network that is carrying the replication could be non-secure.  There were times when I was writing this post that I wish I hadn’t given myself this constraint. There are plenty of online help that talks about setting up Hyper-V Replicas without clustering and non-secure using the Hyper-V Manager – and little documentation about the certificate requirements. I do make a rod for my own back sometimes!

As ever my “documentary” style that records every horrible moment of my time with Microsoft Windows Hyper-V can get a bit long winded. So once again here’s the “Edited Highlights”

Edited Highlights:

  • In both R1 and R2 of Hyper-V Replica Broker I found I had handle its requirement for a computer account object in Active Directory. Simply enabling the role does NOT handle this for you. You’ll need to consult this article first before enabling the role:
    http://blogs.technet.com/b/askpfeplat/archive/2012/12/10/why-adding-hyper-v-replica-connection-broker-fails-in-failover-cluster-manager.aspx
  • Hyper-V Replica when used in Failover Cluster environment uses a “Replica Broker”. Communication can be non-secured (HTTP) or secured (HTTPs). The certificate enrolment process is fine – if you the kind of person who enjoys going to the dentist for a root canal. I imagine most folks will avoid this with a barge pole so large it would beggar belief – and choose to instead use a site-to-site VPN configuration if the replication traffic was going over Internet pipes rather than dedicated leased lines…
  • Hyper-V Replica Broker is NOT integrated in any shape or form with SCVMM. This means you are forced to use management tools normally associated with a SMB style deployment – such as Hyper-V Manager or Failover Manager. You will see situations where SCVMM will have reference to objects that don’t exist anymore when you turn off replication. If you are a SCVMM user you will need to toggle between Hyper-V Manager and SCVMM
  • Management Jive: Choose your weapon and stick to it – I wouldn’t for day-to-day management recommend oscillating between Hyper-V Manager and Failover Manager – they tend to jive with each other with the Right-Manager not knowing what the Left-Manager is up to.  If you prefer pretty pictures and pretty dialog boxes, Failover Manager is more pleasing on the eye when managing this process than Hyper-V Manager.
  • Hyper-V Replica Log – Is basically a log that holds a record of the changes occurring in the virtual disks. If the network is down the HRL can become full, trigging a full resynchronisation in out-of-office hours
  • Hyper-V Replicas are not supported within a cluster – only between clusters, or systems that are not clustered. This might prove challenging to an SMB that has a single-site, single-cluster configuration – and yet wants to use replication internally as a rapid recovery method rather than using backups only
  • Incorrect Restore: Off-site backups can cause corruption in the “initial copy” process. A VM01 being protected by a Hyper-V Replica must be restored from a catalog containing VM01. You can’t use one backup (VM02) to carry out an initial copy for another (VM01).  This is true of vSphere Replication as well, and is intended to protect an administrator from overwriting VMs by accident
  • There’s no bulk method of importing large numbers of VMs that have been copied to removable media. Your only option is to use PowerShell.
  • Prepare yourself for a lot of per-VM settings – protecting a VM is per-VM process, and modifying any settings is a per-VM process too
  • Hyper-V Replica does not use a special icon or label in Hyper-V Manager or Failover Cluster to identify replica VM. As such it is very easy to get a copy of the VM confused with the real VM.
  • Hyper-V does a re-IP process but only for failover (not for failback), if you were using Windows Hyper-V Replica inconjunction with SCVMM you’d be better of using IP Pools.

Enabling the Windows Hyper-V Replica Broker

Enabling replication for a cluster involves the setup of a “Windows Hyper-V Replica Broker” on the properties of the cluster itself. This broker acts as a “Client Access Point” – in other words two brokers from two different clusters communicate with each other. In turn they handle situations such as when a VM is moved from one Windows Hyper-V host to another. Without the broker a VM could be moved, and replication would then cease. The replication can be both inbound and outbound or bi-directional if you prefer that phrase. This enabling of the Windows Hyper-V Replication Broker is not done from SCVMM, but from the Failover Cluster Manager itself. You’ll find the option to configure this role on the +Role node of the cluster itself.

image02

You should find the option for the “Hyper-V Replica Broker” amongst the list of other roles such as DHCP Server and File Server.  I guess that’s showing Failover Clustering’s heritage as being primarily a service that attempts to deliver high availability to application services.

image03

In my case I’m using DHCP on the cluster to easy the configuration of IP addresses – so in production environments each Windows Hyper-V Replica Broker would need a NETBIOS name and static IP address.  This means a computer account object is created in Active Directory, in the same OU as your Windows Hyper-V computer accounts exist.

image04
The Windows Hyper-V Replica Broker is one of those weird Microsoft structures. It has a name, computer account and IP address – and you can even assign a certificate to it. But it doesn’t exist – it’s not a physical or virtual machine. It’s more like a process that has the attributes of a conventional operating system.

Sadly, for me this process failed at the first hurdle. Unfortunately, the Failover Cluster Manager doesn’t really give any details why. It just said failed. No, **** Sherlock as my Dad is fond of saying.

image05

So it was time to do some Googling (I’m sorry I don’t “Bing”). Apparently, this failure can happen due to insufficient privileges on the OU where the broker is being added. That’s despite being logged on as a domain user. It appears as if the clustering service is responsible for adding in the role.  Sure enough some digging about in Event Viewer flagged up this as a culprit.

image06

The workaround is to “pre-stage” Active Directory with a valid computer account or give the cluster object computer account privileges to the OU. Weighing this requirement in the balance – it seemed to me that pre-staging the OU was a more secure approach than giving a service access to an entire OU. This involves creating a computer account (in my case called Gold-HRB01) and then giving the cluster computer account (in my case Gold) full-control.

Sadly, this method failed as well.

image07

So I undid that work – and tried the other method. That was quite a bit more convoluted. It involved locating the required OU and running the AD “Delegation Wizard” rights to the OU. That required custom privileges, as the ability to add computer accounts to a specific OU isn’t a generic privilege or “common task”. In this case a computer account object was created on the Failover Clustering service – however, the role still didn’t come online….

At this point the resources available on the web started to dry up, so I was left to my own (de)vices. It struck me that I should be able to ping the gold-hrb01 by its IP address and hostname. I was able to ping 10.17.107.43, but the ping by hostname was unresponsive. A quick check of DNS indicated there was no host A-name record.  I had no idea whether or not the role should update DNS, so for the hell of it I added a record manually.  Unfortunately, that made no difference at all.

[ASIDE: In my second cluster I didn’t bother doing this – and the role did start after some kicking. So I didn’t get the impression that name resolution to the NETBIOS name of the Windows Hyper-V Replica Broker was a requirement. However, I did notice A-name record was created automagically for me when I tried this on a second cluster…]

 image08

Somehow. Don’t ask me how. It worked. As if by magic. This seemed to be triggered by reboots and/or the moving of the role from one node in the cluster to another.

 image09

It was never really clear what actually made things sit up and dance. Like many Windows problems it just seemed to start working despite the fact I’d made no configuration changes whatsoever. This was the kind of thing I used to dread when I was a Microsoft Certified Trainer (MCT) – I’d spend days, weeks even in the ‘90’s trying to get to the bottom of issues like this – so I could definitely know how/why some thing would work. Perhaps the “walking away and leaving it alone approach” could be a factor here. There’s an aspect of Failover Cluster that once you have N number of failures the system just gives up trying. Kind of like me sometimes. 🙂

image10

I decided to give this configuration a shot on my second cluster and see if it happened again. Sure enough it did. But hey, at least it consistently fails – and that is in my book is a plus! Reproducible errors are at least open to SR requests. Of course, I had to remember to make sure this second cluster had the rights to create computer accounts in the OU. It did appear to be the case that a reboot (and not the move of the role) made the Windows Hyper-V Replica Broker come alive. Who’d a thunk it eh? A reboot fixes a problem in Windows. Times are a changin’.

As far as I could fathom – a computer account; a DNS A-name record; a move or a reboot – or a combination of all three can get this role started. Good Luck!

Generate Computer Certificates (Optional)

Once the Windows Hyper-V Replica Broker has started you can configure replication. This option is located on the right-click of the Broker called “Replica Setting”.

 image11

In a simple configuration the Kerberos protocol together with HTTP/80 can be used to verify replication. This is fine if your replication is happening within a site, and therefore the traffic is on a known network and trusted or if you primary connection from SiteA to SiteB is protected by some other protocol say a site-to-site VPN. If on the other hand your using the Internet apparently this is non-secure, so Windows Hyper-V Replica Broker would be configured to use certificates based authentication, and encrypt the data with HTTPs/443. The requirements for these certificates are quite specific, and if those attributes are not included then the certificate will be rejected.

This sort of stuff is the bane of my life to tell you the truth. I’ve been managing Enterprise and Stand-alone Root CA’s ever since Microsoft IIS 4.0. But despite the familiarity I still find the number of different tools, cipher suites and attributes bewildering. Sadly, I don’t think VMware has an unblemished record on this issue either. I think the whole industry needs to wake up and realise that certificate security needs to be much more automated than it is, and easier to use – how about software-defined certificates? J Incidentally, these certificate requirements are merely the tip of the iceberg. The blog post “Hyper-V Replica – Prerequisites for certificate based deployments” outlines in fine detail the exact requirements. For example it contains such statements as:

For a SAN certificate, set the Subject Alternative Name’s DNS Name to the primary server name (e.g.: primary1.contoso.com). If the primary server is part of a cluster, the Subject Alternative Name of the certificate should contain the FQDN of the HVR Broker (install this certificate on all the nodes of the cluster).

Any questions?

[Aside: This was an old gag of mine back when I was a Microsoft Certified Trainer, and I’d just read some Gobbledygook of from a training manual or KB article. It always made the guys laugh…. But only once they realised I wasn’t remotely serious…]

The “See details” of this dialog box is quite important. Lets open that bad boy up shall we?

 image12

Gulp. So not only does the Windows Hyper-V Replica Broker need a certificate with specific attributes, but so do ALL the nodes of the cluster. Additionally, all the nodes in the cluster need the Windows Hyper-V Replica Broker certificate installed. Once again I was rather relieved I only had a handful of Windows Hyper-V Hosts to handle.

One solution to this onerous piece of administration might be a “wild card” certificate, where *.corp.com would protect all the nodes in the corp.com domain. These are sometimes referred to as “SAN Certificates” where SAN stands for “Subject Alternative Name”. This is a special attribute within a certificates and it’s what allows for a certificate to be used on more than one host – it’s the field that holds the wildcard data. I was unsure whether such a configuration would be possible or supported with Windows Hyper-V Replica Broker. After some strenuous googling, I was able to unearth this article:

Geek of All Trades: Certificates made easy: Part 2 – additionally this article is helpful as well – Requesting Hyper-V Replica Certificates from an Enterprise CA.

If outlines the somewhat involved (for the uninitiated) of defining your own custom certificate template, and then enrolling that into Active Directory – for use for all your nodes in the cluster.

When confronted with this configuration I imagine many customers think – blow this for a game of soldiers. That’s precisely what I did. I chalked this up to experience – one of those Microsoft: “you could do this all with Microsoft technologies – but only a nutter would do so – in the real world there much simpler methods of securing communications than going crazy with certificate enrolments.” It’s bit like you could use Windows Backup to backup your data – but only saddos on Microsoft Official Curriculum courses do that.

image13
Its by no mean a requirement that Windows Hyper-V servers have to be in a cluster although you’d have to be a bit bonkers not to. That’s what “Allow replication from specified servers” enables – replication to non-clustered Windows Hyper-V servers.

When this is enabled for the first time you will receive a warning about the firewall needing to be reconfigured on every node in the cluster. For the sake of a quiet life I tend to turn off the Microsoft Firewall.

image15
Incidentally, I found this warning never appears again, even if you don’t check the “Please don’t show me this again”. Notice the direction – this allowing “inbound traffic” so VMs can be replicated to this cluster. As I wanted to replicate from A to B, and B to A, I just repeated the configuration on both sides.

This enables the replica feature on all the hosts in the effected cluster – so there’s no need to “touch” the individual Windows Hyper-V servers. That’s why if you crank up Windows Hyper-V Manager (the other management tool along side SCVMM that you can use to manage Windows Hyper-V) you’ll find the options for “Replication Configuration” are dimmed. It’s already got these settings from its membership of the Windows Hyper-V cluster…

Enabling Replication on a VM

Currently in Windows Hyper-V you cannot enable replication on a group of VMs, as you can do with vSphere Replication. That means if you have many VMs to enable you will have to consider the use of PowerShell for bulk administration tasks. The other important fact to mention is replication of a VM cannot be managed using SCVMM. I find this a bit shocking really. SCVMM is supposed to be Microsoft’s premier tool for managing virtualization – but a feature like Windows Hyper-V Replica isn’t managed there.

So if you’re a visually orientated person (and lets face it most Windows Admins are) you will be limited to enabling replication with either the Hyper-V Manager or Failover Manager.

With Hyper-V Manager:

Right-click the target VM, and select Enable Replication

 image17

With Failover Cluster Manager:

Under “Roles” locate your VM in the flat-list, right-click and select Replication and Enable Replication

image18

With PowerShell:

 enable-vmreplication –vmname “web01” –replicaservername “goldhrb01” –replicaserverport 80 –authenticationtype kerberos –recoveryhistory 4 –vsssnapshotfrequency 4

The graphic methods such as they are all present the same UI, and both request the administrator to specify the name of the Windows Hyper-V Replica Broker that will be used as the DESTINATION. So in my case my “Gold” cluster is replicating to the “GoldR2” cluster.

Note: Incidentally, both of these clusters are actually using Windows Hyper-V Server 2012. I plan to upgrade this environment to R2 shortly. I want to examine what the upgrade process is like with Microsoft’s Virtualization software. Yes, I know I’m a glutton for punishment. 🙂

image19

The connection parameters merely validate how the Windows Hyper-V Replica Brokers communicate to each other, and the respective Windows Hyper-V Hosts. The only setting of note is the option to “Compress the data that is transmitted across the network” that is enabled by default. Presumably the money used on CPU cycles is well spent, compared to buying additional bandwidth or needlessly wasting bandwidth across the WAN.

image20
I must say it becomes pretty tiresome to have endlessly input this to the Enable Replication for each VM…

In the next part of the wizard you are able to exclude virtual disks that are not required. A common use of this in both Windows Hyper-V Replica and vSphere Replication is to exclude the guest operating system swap, by moving the file away from the boot disk to save on both bandwidth and disk space.

 image21

Windows Hyper-V Replica supports use of snapshots to generate recovery points, and VSS to create application-consistent snapshots. This was functionality that was added to vSphere Replication in the recent vSphere 5.5 release.

image22
Warning: Be very careful with enabling the option “Replicate incremental VSS every…” If you make this too frequent it can seriously degrade performance.

Finally, you can setup the method of the first initial sync or “initial replication”. This is a full-copy of the VM, followed by an incremental replication of just the differences. As with vSphere Replication its possible to enable this “initial sync” this across the network, or pre-populate by exporting the target VM to removable media – and shipping that to the destination cluster, and importing into the system. Some people call this a “Sneaker Net” (a term that will cause confusion in Europe were people don’t wear sneakers!!!) or “Offline Data Transfer”. Care needs to be taken with the third option to use a restore of a backup. You can’t just take any old VM, restore it and let Windows Hyper-V Replica work out the differences – it literally must be a backup of the VM to be replicated. This option is probably only of interest to folks already shipping the backup offsite already, where the same site is being used as the replication destination. For service providers its pretty useless as the customer is unlikely to be using the same backup vendor as the service provider.

image23

This initialization process can be monitored from Hyper-V Manager or Failover Manager. It’s perhaps easier to see in Hyper-V Manager because in Failover Manager it’s tucked away out of sight.

 image24

I was a bit tardy in being quick enough to capture the replication process in Failover Manager. So I disabled replication on my VM (success) and then enabled it again. Sadly, I got this error message:

 image25

I’m not quite sure why Windows Hyper-V is so unhappy. I think the disabling of the replication causes the situation, by not removing the VM that’s created at the destination. It leaves a stale reference behind. So in the screen grab below “Hyper-V-VM01” is the “shadow” (to borrow a VMware term) VM that was replicated from the Gold cluster to the GoldR2 cluster:

 image26

I decided to manually clean out this replicated VM, and try again. There’s an obvious mistake that could happen – deleting the wrong VM. Thankfully, the replica VM is stopped and powered off – whilst my genuine VM is powered on. Unfortunately, I found that I had no management of this VM from with SCVMM, as all the options were disabled/dimmed.

image27

Then I remembered!!! There’s no management of Windows Hyper-V Replicas from SCVMM. I discovered I could remove the stale reference in Failover Cluster, but that still didn’t resolve the issue – as Hyper-V Manager was convinced the replica VM still existed. It wasn’t until I’d removed the stale object from Hyper-V Manager that I was able to start replication again.

Phew. All that to capture a lousy graphic about a lousy task status – that will teach me not to do that again!

image28

IMPORTANT: I later discovered that this inability to remove the replica objects once replication has been disabled is “by design” but to what purpose remains unclear at this time. Not only do you need to remove the object in Failover Manager – but you also have to remove it from Hyper-V Manager as well. I’ve also witness management jive in SCVMM as the replica object is listed in the inventory but it is marked as “missing”. The only option is then to delete this orphaned object.

Fun-ctionality of Hyper-V Replica

Of course the whole point of having a Hyper-V Replica is to be able test the failover, and carry out planned and unplanned failovers. For me this is the difference between a “soft test” and a “hard test”. See it a bit like a fire drill. Perhaps once a week on Monday at 10am the fire bell rings. No one panics. But once a quarter or twice year the bell doesn’t ring at the usual time – and it keeps on ringing – folks start to look at each other sheepishly and then start to file out slowly. The quarter/bi-annual fire drills are the only real way to test a recovery works. Everything else is just validating the software is functional. Of course even with the hard test there are sub categories. A planned for failover is a graceful process that’s like to produce less unexpected problems than a genuine disaster. So the proof is in the pudding, the proof is in the eating.

Test Failover:

A test failover can be triggered on a destination replica object. If you’re not seeing the option, the chances are your accidentally selecting the source VM in the production location. Of course this means you require login rights at the destination location to trigger the process. That means full access to either the Windows Hyper-V host or the Failover Cluster that contains the replica VM.

image30

In this screen grab we can see the VM called web01 is powered off, and has a history of “Recovery Points”.  Selecting the “Test Failover” presents a dialog from where you can select either the latest recovery point or select a previous one from the history.

image31

This creates another VM called “web01 – test” which is not powered on. This takes a matter of seconds to create, so I’m figuring this must merely be a pointer to one of the “Recovery Points”

As there are no network mappings in Windows Hyper-V Replica, then you may need to remap the network of the VM. You will need to ensure that you don’t map this VM to the same network as the real VM otherwise IP conflicts and NETBIOS conflicts will occur. In my case I double checked the original VMs settings, and patched it into a different VLAN. I was surprised to see the Windows Error Recovery appear. I’m not sure if that’s Hyper-V Replica associated, or coming from this VM being powered on/off in a dirty fashion too many times.

image3

When you’re done, you can go back to the original VM and select “Stop Test Failover” in the menus.

 image32

image34

This means despite the many “Recovery Points” accessible you can only test one at anytime. Selecting “Stop Test Failover” does destroy this temporary test VM.  As with this experience when I was first setting up and troubleshooting Hyper-V Replicas, this destroys the VM from Hyper-V Manager and the Failover Cluster – how ever, SCVMM is left with an orphaned object in the inventory.

 image38

A refresh of the cluster did not clean out this object – and right-clicking the VM and refreshing it does not help either. I guess the error associated with this is understandable – you cannot refresh an object that shouldn’t really exist.

image39

You’re only option here is manually deleting this duff entry in SCVMM. I was curious to know what would happen if I forgot to do this.  What I discovered is that the test VM is successfully created – but that it generates multiple stale/orphaned objects in SCVMM. Nice!

 image40

 Planned Failover:

A planned failover is a situation where both the protected and recovery site are both available, and you have foresight to see the potential disaster coming. Perhaps there is going to be some planned power maintenance work in your locale, and for environmental reasons you don’t have a backup generator. This is something I’ve seen a lot of in metropolitan areas, as some local government authorities are uncomfortable with gallons and gallons of diesel fuel in a tanker. Generally, this sort of failover will need senior management approval.

In this case you right-click your VM, and choose “Failover” in the menus.

image41

You must power down the original VM in the production location otherwise you will see this error message.

image42

This has the result of “moving” the VM from one cluster to another. It does “leave behind” the reference to the original VM. It also leaves behind the history of “Recovery Points” which allows you to rollback the state of the “moved” VM. The process itself changes the menu options on the properties of the “moved” VM like so:

 image43

As ever with these sort of virtualization management tools from Microsoft there no real alerts or alarms telling the Windows Admin what they should be doing next. If you check the “View Replication Health” though it does indicate that despite both sites being available – no replication is currently in place.

image45

The reversal of the replication triggers the big “Enable Replication” wizard – most of the options are pre-populated so it becomes a “next-next-next-finish” affair. This then triggers a brand new “initial replica” process.

 image46

On the plus side I did find with the planned failover that I wasn’t left with duff objects left in SCVMM. That’s because when the replication is inverted the old VM left behind after the “move” is re-used as a target replica VM for failback. But as you can see – despite the fact that both VMs could be very similar the “Reverse Replication” process requires a full initialisation, and the previous history “Recovery Points” is lost as well. That’s a bit of a shame.

Unplanned Failover:

For this test I went into my production cluster and hard powered off all my Windows Hyper-V Hosts. This should be enough to simulate a hard failure. As this is all software based this should be sufficient. If I was dealing with array-based replication I could bring down the storage – and another method is to stop the networking between the two sites. One method of doing that is disabling a route between sites – that’s how I used to do it in my VMware Site Recovery Manager books. Everything was in a single rack with New York City connected to New Jersey via router. By shutting down the router I could emulate the loss of an entire site. It’s also handy for seeing how replication tools handle an extended network outage.

One of the disconcerting aspects of SCVMM is how poorly it reacts to a “server is dead” scenario. In my case I had powered off all the hosts in the cluster.  There’s nothing really in SCVMM to tell you that the hosts were in a dead state. I mean literally these servers are powered off. Even the VMs on the cluster are marked as “running”, although admittedly they do have a red X on the icon.

image49

If you’re in the “Fabric” view and select a cluster – you will see that the hosts “Need Attention” and that the Cluster Service is “Running”. The host certainly do need attention – they are dead, and the cluster service cannot be running, as all the hosts in the cluster are dead as well.

 image50

The Failover options are just the same within Hyper-V Manager and Failover Manager – so unlike some DR solutions that allow you indicate if the process is a “planned” or “unplanned” scenario.  It did seem to take longer to carry out the “Failover” process as it searched for a site/cluster that wasn’t available.

 image51

This dialog box took about 30seconds per-VM to process until the VM was powered on.

As a test I decided to create some important data after the successful failover of the VMs. I find the best place to save really important data on a server is on the desktop in notepad, especially if the server is a web-server connected to the Internet. 🙂

 Screen Shot 2013-11-18 at 13.53.23

Of course there’s little point to trying to “Reverse Replication” to the Windows Hyper-V Replica Broker that’s on a cluster that’s dead to the world – but I thought I would give it a shot anyway. If you do try it (after a bit of wait as it tries to speak to the broker that’s dead) you will see the option to attempt it there:

image52

I decided to run though this wizard – because it would let me – on just one of my VMs, with the hope than when the down cluster was brought back online then auto-magically it would trigger the replication. Sadly, that isn’t an option. Although you can progress through the wizard – the Hyper-V Replication Broker must be online.

image53

Think about this – it’s probably not a bad thing that you can’t automatically setup these relationships. Think about the bandwidth that could be chewed up? Generally, a failback process is more complicated than merely sticking the datacenter in reverse gear. What if you need to provision a brand new datacenter for example? Still with that said, if it shouldn’t be done, or can’t be done – why does Microsoft have a pointless wizard that allows to me progress when it knows it can’t contact the Hyper-V Replica Broker?

Looking at what happened when the down cluster was brought back online goes somewhere to explaining why an automatic process could cause problems. It wasn’t correctly executed. Due to the defaults in Windows Hyper-V Failover Clustering when all the VMs were powered back on again. Even though they were out of date, and didn’t include my important data. I didn’t get an IP conflicts as these VMs as they were both configured as DHCP clients – they could both ping the default gateway. But I was surprised not to see “duplicate name exists on the network” error messages.  Looking at the original VMs I could see that both were trying to resume replication but failing:

image54

So I now had two copies of the same VM on the same network. It was relatively easy to know which was which – because one had up-times of a couple of hours – whereas the others were in minutes. Plus the old VMs didn’t have my important data on their desktop!

Simply reverting the replication in this state doesn’t work – despite the fact that the Hyper-V Replica Broker at the original site is now available.

image55

I thought this could be because the VMs at the original site were powered on – I powered web01 and web02 down, and tried again. Sadly, even when the old VMs were powered off , although the “Reverse Replication” wizard was useable it presented me with a “Kerberos Authentication” error that I hadn’t had when the failover was a planned one.

 image56

I decided to hit Google again – and pulled up this blogpost:

The Case of the Unexplained Windows Server 2012 Replica Kerberos Error: 0x8009030C 0x00002EFE that seemed to describe the same error. It’s a brave blogpost that documents the real time process of troubleshooting the error – including all the dead-end rabbit holes needed to eventually trace the issue. This post at the very end highlighted that hardening in Windows/Active Directory could be the cause of the problem. The trouble I was, I know for a fact that no such hardening has taken place on my domain. I must admit I did wonder whether it was at all worthwhile trying to resolve this problem. Given that a reverse replication is a “full initialisation” of the VM. It would seem simpler merely to destroy the old VM and have done with it.

But I thought given that the menus specifically have a “Reverse Replication” option then it should work. The funny thing was it was Friday evening when this was happening. I decided that, as life is short, to leave this issue until Monday morning. I gave the “Reverse Replication” another shot – and low and behold it worked! Kind of. It worked on the first VM, but not on the second. I went through the wizard two or three times on the second VM – and then it worked. As if by magic.

This is one of the less reassuring aspects of Microsoft technologies. Generally, when something is broken in VMware, it is properly busted. And you know that until you resolve the problem it will stay that way. Still to this day with Microsoft technology there’s that element of – if we turned the engine off, got out of the car, and then all got back in again – it would somehow work. In my case I left the car on the drive over the weekend, and on Monday miraculously it started… even though I had to turn over the engine a couple of times. Ever own a car where you were never 100% sure it would get you to work in the morning? The sense of relief when it does work is palpable.

Anyway, from here the process was relatively simple. The VMs recovered in the DR location were now up and running. I left them a couple of hours to make sure they had sync’d back to the production locale. And then I carried out a planned failover, and reverse the replication again to get myself back where I started.

I was pleased to see this time the planned failover did work smoothly… Phew!

image57

Notes from the Small Print

All vendors are at this game – flagging up just the positive aspects of their technologies – never drawing folks to the downsides or merely absent functionality. I guess its understandable after all the role of the vendor is always to cast a positive light upon their software, and try to minimise any negatives. The customers role in this game of “cat and mouse” is being to spot these casual omissions, and glossing over key facts – and decode them into something meaningful that they can use to determine if the solution passes muster.  I don’t think many vendors are guilty of deliberate deception, but in a desperate bid to make a solution appear more than it is there’s always the temptation to overegg the message.

There are number of gotchas to be aware of with Hyper-V Replicas and dare say this might apply to any “host-based replication” solution as its sometimes referred to in the industry.

(1) Hyper-V Replica holds a “Hyper-V Replica Log” or HRL. This is a change log of block changes within a 5min period. This means if the same block is modified more than once within that 5min period only one replication event occurs – the most recent. This means if there is a significant outage on the WAN then eventually the HRL will get so full that it makes no sense to try and get the destination/source in sync. Instead a full-copy will be initiated of the same type that was used to setup replication for a VM in the first instance. This is triggered when the HRL is 50% of the original VHD file. The bigger the VHD the longer it’s going to take for the HRL to reach the 50% threshold.

IMPORTANT: This 5min frequency was not configurable in Window Hyper-V 2012. However, in Windows Hyper-V 2012 R2 it is an option that is configurable. Additionally, the maximum number of Recovery Points have been increased from 15 to 24 – allowing for one day’s worth of recovery points.

http://technet.microsoft.com/en-us/library/jj134172.aspx

This HRL is transferred every 5mins from the source to the destination – so one thing that will be tricky is working out given how much “churn” there is on the data – and whether the volume of changes accrued in the HRL can be transmitted across the wire in a timely fashion without the DR site becoming out of sync with the Production site. All replication technologies have this issue baked into them – it’s called the Law of Physics. And for the record it isn’t just latency and bandwidth that can be a factor – it’s the quality of that link as well. A lossy link with many dropped packets and retransmit can equally undermine replication traffic.  The classic recommendation when trying to work out this churn is to use backups or snapshots of the measure of data accrual over a given time – divide by the time between on backup window or snapshot – and you have a crude average of what block being touched in that period.

Some might say that the HRL default values are a bit over ambitious. If the two sites do become out of sync (caused by the HRL being too far behind) – then forcible sync only happens between the hours of 6.30pm-6.00am. The assumption is folks aren’t about and there is less network contention. The Hyper-V Replica does have a monitoring process, but alerts aren’t triggered unless you a have 20% or more missed events in a period of 12 hours.

(2), Topologies – Windows Hyper-V Replicas support many replication directions such as:

  • Unclustered
  • Between one cluster and another cluster
  • Between a cluster and unclustered system

What’s not widely advertised is you cannot do replication within a cluster. You’d be surprised how much replication takes place within a site, as a faster method of recovering the VM than conventional backups allow for. The bandwidth is (almost) free and unlimited – so there’s plenty of scope for synchronous snapshots (with array-based replication). With Windows Hyper-V Replicas if you only had one cluster, you would have to build an unclustered target or build another cluster just to get the replication working.

(3), (something I mentioned before, but it does bear repeating for those folks who are innate skimmers of long blogposts!) one method of pre-populating the initial copy process is with your offsite backups – assuming you do that, and the offsite location is the same locale that your replicating to. It makes perfect sense to use a backup to quickly build a copy of the VM you want to replicate at the DR location – and then have the Windows Hyper-V Replica Broker synchronize the data. This does work – but you have to be very careful. The VM you get from the backup catalog must be the SAME VM you intend to replicate. You can’t grab the backup job for VM01 and use it to complete the initial copy for VM02. You will definitely get corruption, if don’t restore VM01 to synchronize with VM01. So if you intend to use this method be prepared to have to restore a lot of data – potentially containing information that is the same from one VM to another.

(4), There’s a “Sneaker Net” method of pre-populating the initial copy using removable media such as hard-drive in a caddy or large USB memory stick. The importing thing to remember here the removable media must be mounted at the Hyper-V host where the DESTINATION “shadow” VM has been located. The process involves right clicking the shadow VM and then using the “Import Initial Replica” option to trigger the import from the removable media followed by synch using the HRL. There’s no “bulk” way of doing this for many VMs from the GUI – each one must be done individually by hand. Enjoy!

(5) Prepare yourself for a lot of “per-VM” settings. Despite Microsoft rather good record for policy engines – the much-maligned “Group Policy Object” system for example. Bulk settings and policies are rather than absent from the management layer in Windows Hyper-V. Say for example a prolonged network outage has caused the replication process to fail. Each VM would need to have synchronization restarted either immediately or staggered using a schedule.

image58

It is possible to trigger a resynchronization using the Get-VMreplication cmdlet. Increasingly, I feel Microsoft use of PowerShell is indicative of poor management UI interfaces. This might be an extreme thing to say, but because of the deficiencies in the UI, its now more efficient to use PowerShell to manage the Windows Operating System. That’s somewhat ironic for a product that promises to “leverage your existing Windows management skills” by sending you back to a black background and white text environment. The joke I have going on twitter right now – is the next marketing buzz word to come out of Redmond is like to be Microsoft Software-Defined Operating System – or MS-DOS for short. 🙂

(6). There’s no such thing as a free lunch – and although the software that carries out VM replication is free – the resources required to drive it are certainly not. So what sort of overhead does Hyper-V Replica bring to the table? Each replication job will cost around 50MB RAM per VM. The CPU penalty is so low it’s barely worth mention but it’s in the sub 3% region. The HRL is processed every 5mins, and therefore incurs a storage penalty – how much depends very much on your rate of churn. Think of a very conservative number, and add some more for bad luck. Additionally, writing entries to the HRL, Microsoft reckons the total amount of IOPS will be in the range of 1.5x the normal amount. The creation of recovery points incurs a storage penalty. As it is you can only have a maximum of 24 recovery points anyway which occur once an hour. So it might work out more efficient and more flexible just to have a backup than it is to have such a small window of recovery points.

(7). On the plus side Hyper-V Replica does a fair job of monitoring the status of the replica that is available on the properties of each VM that is being replicated.

image59

(8) Hyper-V Replica does have the ability to re-IP VMs natively for a situation where the VM is being recovered to different site under a different IP scheme. It’s less clear how you change the VLAN configuration from one site to another – as there isn’t a “mapping” process.

 image60

Having looked at this in some detail with an evaluation of Azure Hyper-V Recovery Manager, I would suggest that using IP Pools at either site is probably the better way of managing this situation for the time being.

Finally, Windows Hyper-V Replica is more comparable to vSphere Replication than to say a product like VMware Site Recovery Manager. When Windows Hyper-V Replica or vSphere Replication is used on they’re own you get relatively modest automation.  At best you will get a process that allows you to failover and failback VMs in a planned/unplanned mode – and ability to control some of the aspects of the replication itself – such as frequency, snapshots point in time rollbacks and retention times.

 image61

But that’s a long way short of a truly fully functional automation layer such as VMware Site Recovery Manager. Of course when vendors compare themselves it’s hardly ever apples to apples. They will compare their entry-level tech to enterprise – and scoff at the price – totally ignoring features they don’t support. And I guess I could be easily accused of that now. It’s really unfair to compare Microsoft Hyper-V Replica Broker to VMware Site Recovery Manager; a more fair comparison would be to vSphere Replication. It would be fair to say that Microsoft do have a feature at this “layer” that VMware does not. Hyper-V Replicas have a built-in method of re-IP-ing a VM, whereas vSphere Replication does not. To get that feature you would need VMware Site Recovery Manager.

VMware SRM is one of the few DR automation technologies that support BOTH host-based replication (HBR) and array-based replication (ARB). The HRB people complete neglect their lack of support ARB. Choosing to poo-poo a feature they don’t have. That’s a classic approach – when you don’t have a feature – dismiss it as irrelevant.  And of course, it works in the other direction – they will flag a unique feature that they have – and bang on about it being the be-and-end-all. They’re they go comparing oranges to apples.

So in my next blog post I will be looking at the Preview of Azure Hyper-V Recovery Manager. This is an online subscription services that uses the Hyper-V Replica infrastructure, but adds a layer of automation to the process including recovery plans…