A week or so ago I was experimenting with running ESX inside a vCloud Director vApp – a process I dubbed “vINCEPTION” to describe the way ESX can virtualize itself (often to referred to as vESX or “nesting”). I was experimenting with different ways of getting the VMs which run on top of vESX to speak to the outside world – such as guest vlan tagging. Anyone a couple days later I cranked up my vSphere Client only to see lots of red exclamation marks on my hosts.

Screen Shot 2013-03-04 at 08.41.21

in the all new vSphere Web-Client the Alarm Pane flagged these up like so:

 Screen Shot 2013-03-04 at 08.56.10

Screen Shot 2013-03-04 at 08.57.47Yikes!

Upon close inspection I found every host had an alarm based on an inaccessible VLAN/Network and the MTU values:

Screen Shot 2013-03-04 at 08.42.39

It didn’t take me very long to find the culprit. It was a portgroup I’d created with no VLAN Tagging. Normally, that would be used against a physical switch that hadn’t been VLAN’d at all. The trouble was the vSphere Distributed Switch that I created the portgroup on was fully-VLAN up. The alarm would be normally triggered by me typing in something like VLAN 400, where 400 wasn’t defined on the physical switch at all. I know that DvSwitch “Health Status” is a feature configured from the all new vSphere Web-Client so it was time to go a hunting there. I was aided in the fact I knew I hadn’t touch the other DvSwitch at all in weeks – this “Infrastructure” vSwitch handles on my internal traffic (management, HA, HA Heartbeat, IP Storage, vMotion and so on…)

Looking at the “Virtual Datacenter DvSwitch”. I could see 18 warning issues – and looking at dvUplink1/2 on vmnic2/3 I could see there was a VLAN Trunk using ID0 which was “not supported”

Screen Shot 2013-02-27 at 13.33.01

A quick look at the vINCEPTION portgroup I’d created I could see there was no VLAN ID specified. That was deliberate. I was trying to turn of VLAN Tagging at the physical ESX host level, and have the vESX host do it instead…

Screen Shot 2013-02-27 at 13.34.07

There would be two easy fixes – delete the portgroup if it isn’t in use OR edit the VLAN field an input a VLAN value.

Screen Shot 2013-02-27 at 13.34.55

After doing that everything was right in the world. Naturally, I like a world of green ticks!

Screen Shot 2013-02-27 at 13.36.24

Conclusions:

So what’s so great about this. EVERYTHING! The vSphere platform is getting to be like a good bottle of wine. As it gets older is maturing. So problems that would have been buried under layers of configuration settings that would difficult to locate – are now exposed immediately. These “Health Check” style features actually helped me diagnose a potential problem. Granted one that’s of my own creation. I hope/expect to see more of these sorts of diagnostic checks in the future – after all were only human with monkey-brains…