IMPORTANT:

One thing I learned after my setting up my storage is that disks that wind-up using the Cluster Shared Volume (CSV) type must be a basic disk, not a dynamic disk. Fortunately, I hadn’t setup my partition as a dynamic disk. But I do find it interesting that Microsoft own clustering system is incompatible with its disk system. Not so “dynamic” after all then?

image1

http://technet.microsoft.com/en-us/library/jj612869.aspx

Edited Highlights:

  • Many of the problems with storage and Windows Failover Clustering just disappeared when scrapped my “R2 Preview” and wipped and re-installed with GA release available to Technet subscribers
  • UPDATE: The best way to setup the storage is present to all host, format and assign a drive letter – and then ensure the drive letter is the same on all hosts in the cluster – AND then create the cluster, and add the disks/volumes as .CSVs. Since I wrote this post, I’ve had feedback saying that you should NOT do this… To be honest I get different people telling me different things. As I stated later in the blogpost – its never been 100% clear to me what state the storage should be before building the cluster. Online? Online to all hosts? Partitioned? Formatted? Driver Letters irrelevant?
  • VMs must have the “Availability” option set on them to be deployed to a cluster…
  • Ensure your VM path defaults point to the C:\ClusterStorage\VolumeN to stop a situation where the VMs files are stored locally…
  • You might like to create the cluster BEFORE you enable Windows Hyper-V that way you can specify the paths during the enablement of the role.
  • Recommend: Create the cluster with Failover Manager. Not SCVMM. The manager gives better feedback, and information

Introduction

In my previous post I went through the appropriate hoop jumping with the Microsoft Failover Validating Clustering wizard. By now I was more than ready to actually create my first cluster. If you recall I had trouble with the Validation Configuration wizard. It would not accept my Microsoft NIC Team as offering proper redundancy to avoid isolation and split-brain – and additionally it kept on flagging the boot disk as being a potential problem. These were both false alarms and crying wolf. To over react to these – would to be like someone who looks at the Event Log of Windows instance, and is horrified to see errors and warnings. I guess as an industry we have we come to accept those false alarms from Microsoft for what they are. With that said, it does rather worry me that validation of those settings to some degree is a requirement for being supported by Microsoft. This is something that quite baldly stated in the “Create Cluster…” wizard:

image2

Am I the only one here who finds the statement odd: “No. I do not require support from Microsoft for this cluster, and therefore do not want to run the validation tests…” I mean who after all needs clustering, but then doesn’t need it supported? I wondered what the legal ramifications were of this second radio button. If you selected it and problem occurred could Microsoft hold it against you?  If you select the top radio button, and the verification has warnings or error – however benign, could Microsoft use your “misconfiguration” to say that you are not supported? Perhaps I could be accused of scare-mongering here – but I can’t help feeling that if Microsoft “Failover Clustering” was better executed there wouldn’t be a need for all this validation and second-guessing support.

image3

What I found is because my validation resulted in red and yellow comments – Microsoft regarded the cluster as unsupportable. At the end of the test it will NOT allow you to proceed UNLESS you select the second radio button.  I did discover you could run the Validate Cluster wizard, and even when you get a failed report – you can select to create a cluster, and this worked. I’m not sure what the support status is if you chose this approach – I imagine it will assume you don’t want a supported configuration.

One of the more annoying aspects of the validation wizard is how different passes of the wizard give different results. On Friday I had a successful pass on the storage.

image4

This Monday morning I ran the same validation check and my heart sunk. Back had come the same goddamn error messages. It’s had to imagine what’s gone wrong in the time I’ve been away…

image5

I’m not sure quite what’s causing this problem. Looking at the alerts on the server, it reports problems with “GoldCluster” and “GoldCluster01”. These are abortive attempts that were made to create clusters weeks ago, that are still being reported by the event system. My fear is like a lot of Windows technologies – if it works first time your good to go – but if you have problems the clean up process isn’t up to muster and your lumbered with a lot gunk left behind from previous configurations. I’m just hoping that these aren’t going to be fatal, and that I won’t have to wipe these boxes down and start all over again to get a “clean installation”.

[This is precisely what I ended up doing! I ended up wiping the Windows Hyper-V R2 Preview, and installing the Windows Hyper-V R2 GA available to TechNet/MSDN subscribers. With that many of my storage problems went away…]

Creating a Cluster using Failover Cluster Manager

There’s two ways to create a cluster for use with Windows Hyper-V. You can either create the cluster from the dedicated “Failover Cluster Manager” or use System Center Virtual Machine Manager. The former is generic interface that could be used managing a cluster that protected any Windows service. There’s perhaps “too much” UI here for the average virtualization administrator to have to worry about. But it is robust in the sense that it gives plenty of information about what your configuration is doing. A more “virtualization centric” view is provided by SCVMM. the trouble with it is if something does go wrong in the configuration the feedback to the SysAdmin is pretty basic. The assumption is that everything is in place, and works first time. As we’ve seen that’s a mighty big assumption to make with MSCS. I decide to use both methods so I could compare and contrast the experience.

First up – the Failover Cluster Manager.

In the “Access Point for Administering the Cluster” I was surprised to see only one network range available from my management network (172.168.3.0) when actually there’s a range for 192.168.3.0 as well. This “access point” is the IP address for managing the cluster itself. In classic clustering say of Microsoft SQL or Exchange, it would be the IP address that clients would use to connect – with node01 and node02 hidden behind the cluster IP.

image6
NOTE: It’s interesting that Failover Clustering still shows the legacy of a previous era by the use of a NETBIOS name requirements. If you unsure about the list of characters and words that Microsoft deems as reserved consult this KB article – http://support.microsoft.com/kb/909264?wa=wsignin1.0

The second part of the wizard “confirmation” ensures that all storage compatible for use with Failover Clustering is added into the cluster.

image7

Despite the very disconcerting warnings from the Validation Configuration wizard and the Create Cluster wizard, it does appear that storage is available and reserved by the cluster itself

image8

With that said I had brand new networking warnings in the “Validation Report”. These involve some cryptic references such as “Com – System_Object”. Sadly, the pre-validation configuration didn’t flag up this issue; it only became apparent after I had created the cluster itself.

image9

These issues were eventually resolved by scrapping my initial network configuration, which use teamed networks with multiple IP address – and opting for a flat network with no network resiliency.

image10

It was only after making these network changes – that I began to see the correct IP range for management. I’d love to know the logical that controls this selection of interfaces. I understand the wizard will only show networks that are valid across all the nodes in the cluster, listed under the “Networks” node.

image11

Adding a CSV Volume to the Cluster

Despite these errors I thought it was best to press on with adding my storage to the cluster. I didn’t want to get bogged down in resolve every error report I might get, especially as it was increasingly difficult to see if these were significant. Don’t get me wrong I believe in validation – but I start to loose confidence when you start getting false positives in such reports. I can’t help thinking if Failover Clustering was better designed, I wouldn’t be getting bogged down in all these validation reports. Just sayin’.

Disks can brought into the cluster under control by using the Failover Cluster manager, and expanding the cluster – and navigating to +Storage, +Disks and then right-clicking the disk. One thing that’s just never been clear to me despite reading all the documentation is what the state of the disk should be to successfully add disks into the Failover Cluster Manager. Should the they be online/offline?; on just one server or all? Should the disks be partitioned and formatted?

One the frustrating aspects of trying to resolve this is how once storage it is claimed by the Failover Cluster Manager – other tools such as Computer Management start to break because the resource is owned by another part of the Microsoft Windows platform.

 image13

From the Failover Cluster Manager you should be able to add the SAN/iSCSI disks to the cluster.

image14

This creation of the cluster should update the SCVMM so the new cluster name appears in the list.

 image15

Creating a VM on a Cluster

I would have to say my first experience of creating a VM on a Microsoft Failover Cluster wasn’t a happy one. That’s because the books I was reading didn’t flag up a requirement on the VM, and also my extensive dabbling with the network had cause “non-compliance” to take place.

The first error I received was when creating a VM for the first time – this is the error message that came up in the jobs view:

 image16

 The Recommended Action was a bit misleading because there was nothing wrong with the CSV. The problem was me trying to create a “non-highly available” VM on a “Clustered Resource”. That’s not allowed. What the SysAdmin must remember to do is enable this option on the properties of the VM or template.

 image20

That’s right. Unlike VMware vSphere Clusters that are enabled for HA and DRS where any VM placed on the cluster gets protection – Windows Hyper-V makes this a per-VM setting which if its not enabled will cause the error message.  The other thing you need to be a careful about is exactly where the VMs files are being located. Windows Hyper-V and SCVMM allow you to set different locations for the configuration files and the larger virtual disks that make up the VM. If you are not careful these can wind-up on non-shared local storage – causing “live migrate” errors at a later stage.

image21
Note: Yes, I know there are innumerable places where these defaults can be altered – but none of them seem to take affect in all cases. Perhaps it’s a situation where all the defaults need to be changed in all locations – then they take affect.

Yes, I know there are innumerable places where these defaults can be altered – but none of them seem to take affect in all cases. Perhaps it’s a situation where all the defaults need to be changed in all locations – then they take affect.

Another area that gave me warnings was “non-compliance” on the logical switch. In truth during the writing on these series of articles I had to monkey about with the networking significantly. Mainly in desperate bid to both get network redundancy whilst meeting the Microsoft Validation tests. I was able to deploy a new VM, but SCVMM gave me one of those “w/info” errors. In case you don’t occasionally SCVMM tasks complete, but sometime these appear with the phrase “with info”. Nine times out of ten indicate a wider problem. You shouldn’t really ignore them. It was this one that caught my eye.

 image22

Quick Google (not Bing!) brought up a blog article called Fixing non-compliant virtual switches in System Center 2012 Virtual Machine Manager by J.C Hornbeck. In it J.C explains a common problem that can occur:

This can be caused by creating logical switches on the nodes out of order, thus the settings in the logical switch may not be in sync. Another possibility is that changes were made to the virtual switch out of band (i.e. directly in Hyper-V), thus the settings may be different for each node. Fortunately for us, a new feature of System Center 2012 Virtual Machine Manager Service Pack1 (VMM) will warn us if settings among switches are different.

Sure enough when I check out the +Fabric, +Logical Networks and “Hosts” view I could see that the networking had a problem.

 image23

My host were only “partially compliant” because one of Logical Switches was “Not Compliant”. I’m pretty sure I didn’t touch the networking options directly on the host using Hyper-V Manager, but I was making changes to the Logical Switch and its associated dependencies. Why this non-compliance happened I will probably never know. Fortunately, there is a “remediate button” you can click to rectify this “jiving” within the management layer.

 image24

I understand this “remediate” was introduced as part of the Service Pack1 for SCVMM. This remediate did take sometime to resolve the issue. In fact I had Windows Hyper-V host claim it was “partially compliant” when though all the components within it were compliant!!!

image25 Perhaps this was some sort of “GUI Refresh” problem. After SCVMM was left for a half-an-hour the error messages went away. Phew!

Revalidating a Cluster

One thing I noticed with a cluster is once it is under the scope of SCVMM that there is an option to validate the cluster. This appears to ignore the validation you might have done with the Failover Cluster Manager.  A right-click on the cluster will give you chance to validate it from SCVMM, and I was disappointed that generated issues as well.

image26

The validation process generates reports that are available on any of the nodes in the cluster so I decided to check it out.  The first warning was about the cluster configuration itself – specifically the state of the quorum disk. Fortunately, this warning did come with instructions on how to reconfigure this on the cluster itself (Note: This cannot be changed from SCVMM)

 image27

Next there were a whole of host of storage warnings. Most of these appear to be “crying wolf”, because the validation process cannot validate CSV’s that are online. Ironically, you have to take the CSV offline, for the test to work.

 image28

Creating a Cluster: With SCVMM

SCVMM comes with an “Uncluster” option which removes the cluster altogether – and leaves the Windows Hyper-V nodes in SCVMM. It also cleans up (allegedly!) and removes the cluster configuration that might have been initiated by Failover Cluster Manager on the Windows Hyper-V node(s) itself.  Despite this option it doesn’t entirely “clean-up” the environment. Disks that have been partitioned and formatted as part of the creation of a cluster – are left alone – this intended to prevent data loss on the volumes itself.  In fact I had number of failures because of the previous cluster configuration that had been tattooed to the Windows Hyper-V hosts.  That required me to go into “Disk Management” on the hosts, and destroy the existing partition tables created earlier, and then rescan the other Windows Hyper-V hosts to make sure they were seeing the same disk environment.

image29

The main reasons why I thought I would look at Windows Hyper-V clustering from the SCVMM perspective is because doing this sort of configuration for me normally begins with the virtualization management layer – in my case VMware vCenter. From one management tool (you could call it a single pain of glass if you like :-p) you do all the management. I wanted to see how SCVMM compared to vCenter in this respect – for a more “virtualization-centric” viewpoint than a generic clustering interface.

You can trigger the creation of a Windows Hyper-V cluster from the “Fabric” view using the big yellow “create” button.

 image30

You give the cluster a name, and then add in the nodes you want to be part of the cluster configuration.

image31
Note: You’ll see there is an option to “Skip cluster validation tests”. Given what I’ve seen so far I wouldn’t recommend enabling this – just in case. The Failover Cluster Validation wizard produces a lot of interesting error messages when I first started looking at this – and I don’t think ignorance is a state of bliss when MSCS is concerned!

The clustering wizard interrogates the storage – and makes it available to the cluster – this is actually a small window, and you’ll need to scroll in order to see the “CSV” option.

image32

I notice there was option to create an external virtual switch, but as I already handled this earlier, I was able to ignore the option altogether. As ever the main view of SCVMM gives you really no indication of what’s going on – that means an excessive amount of toggling between “VM and Services”, “Fabric” and the “Jobs” view. I know that sometimes the “Task” view in vCenter gets some flack for the depth of information presented there – but at least its an easy way to monitor the progress of your administration, and easily flags up potential errors or problems.

image34

The creation of the cluster seemed to take much longer (nearly 10mins) but I imagine that’s a consequence of both making the cluster and validating it afterwards. Despite repeated attempts I was never able to successfully create a cluster using SCVMM.

image35

At this point I’d spent some weeks trying to stand-up Microsoft Clustering – and I must admit a couple of times I’d lost the will to live with the whole darn thing. It was time to move on and look at something else.

Conclusions:

Looking back on this process I began to see that there might be a more efficient order of steps than the one’s I’d originally undertaken. It strikes me that it’s probably easier to configure clustering in advance of enabling the Windows Hyper-V Role itself. That way when you come to enable Windows Hyper-V for the first time you can properly select the default storage locations.

image37

The other issue I was stuck by is the innumerable different ways storage is referenced. You have your disk numbers in Computer Management (Disk0/1/2), and each of these are addresses by a drive letter (C:, V:, Q:). When Failover Cluster Manager claims these disks they get appended additional identities – the “Cluster Disk” name – as well as the as the CSV mounting name, in my case C:\ClusterStorageVolume1

image38

It does seem to be the case that these numberings are controlled by the order by which the disks are made into CSVs. The screen grab above shows the sort of “jive” I’m talking about – Cluster Disk 2 is known as C\ClusterStorageVolume1. It’s a smaller matter, and one corrected by relabelling the “Cluster Disk2” friendly label. It’s just I don’t feel I have time for this sort of tidying up process after doing my management.

I started off talking about the bad old days of clustering back in the NT4/2000 era, with a genuine hope that things had got better and improved. I’m sorry to say that I find MSCS much the same as ever. It’s important to remember why clustering from VMware became so popular. It’s a way of protecting VMs that is OS neutral, and should be much easier to setup than clustering inside the guest operating system. I’ve used many different availability tools that run inside Windows, and in truth they can all be “interesting” to setup and manage. And think that’s really at the heart of the issue with Microsoft Failover Clustering when it’s used with Windows Hyper-V. It has all the history, pedigree and legacy of a clustering system that was originally designed for protecting services, not VMs. As I sat back and looked at my experiences it was salutary to think about how much hard work the Microsoft experiences had been. All I wanted to happen was to have a VM restart on another host in the event of failure.

Is that so hard?