Houston, we have  a problem

This week I had a major problem. My vCenter Server Appliance just stopped working for apparent reason. I couldn’t login via the web or vSphere client. All I was getting was an error about not being able to connect to the lookup service, that would appear after the logon attempt stating “Failed to connect to VMware Lookup Service – https://vcnyc.corp..com:7444/lookupservice/sdk.

Screen Shot 2013-07-06 at 05.08.52

Things started to look worrying when I was able to sniff around the appliance management web-page (the one on port 5480). Everything looked to be running, there was just one thing the DB was 100%

Screen Shot 2013-07-06 at 04.48.33

In my experience something at utilised to 100% is generally a problem, especially when it comes to storage. In my case I was using a local PostGres database which is stored on the appliance itself – and it had run out of space. I’m prepared to admit that might have been my fault. vCSA scales to some 5 ESX hosts, and some 50 VMs. But there’s been times I’ve had 9 hosts. I can’t say honestly either way if have had more than 50 VMs (although right now I have nearly 60 VMs and Templates) I doubt it. I don’t have enough RAM for that! It also turns out that could have dialed down the retention of data in the database to keep it skinny.

Sure enough a quick google, I was able to find folks in the vCommunity who had the same experience. I decided to open a internal bugzilla ticket with our support folks, and my worst fears were confirmed. There’s a couple of resolutions to this problem:

  1. Hit the big reset button and zap the database…
  2. Create a new disk, partition and copy the DB files to the new location…
  3. Increase the size of the VMDK, and use a re-partitioning tool to increase the disk space

Clearly, (1) is not a terrifically good option – I’ve got vCD, vCAC, vDP, VR, vCC and View all in some way configured for the vCenter. Zapping the database would pull the rug under the configuration of all these other components. Option (2) seems a bit of faff to me. And option (3) is what I would do if I was experiencing this issue on any other system.

Houston, we have a solution

The first job was to identify in which disk and which partition the database is located. I worked that from interpreting some stuff the support guys had asked me to do. They’d got me to run:

du -h /storage/db/

Using the mount command I worked out that the DB was on /dev/sdb3. The second disk (B),third partition(3):

Screen Shot 2013-07-06 at 05.12.18

To increase the 2nd VMDK, after that I decided to snapshot the VM,and then use a tool like qparted to increase /dev/sdb3. If something went wrong I could revert the VM and try option (2). The order of this important. You cannot snapshot a VM, and then increase the size of the disk. But you can increase the size of the disk, snapshot – and then repartition. The snapshot will protect you from the situation where the repartition goes belly-up.

TIP: I recommend gracefully shutdown the vCenter in the situation. It has 8GB memory allocation so taking a snapshot whilst powered on means creating a memory file of 8GB. The vCenter is inaccessible, and if you using qparted you have boot from DVD .ISO to get exclusive access to the disk anyway. So that would be:

1. Shutdown

2. Increase disk size

3. Snapshot

4. Attach DVD .iso containing qparted

5. Use qparted to resize the disk.

That first step was easy, all I need to was increase a spinner. How much by? Well, the VMDK is thin provisioned, and on volume with 1379GB of free space. So decided to crank it up from 60GB to 200GB.

Screen Shot 2013-07-06 at 07.34.13

6. Next we boot to the DVD. gparted is a very simple utility – switch to the disk (sda to sdb), right-click the partition – select resize – drag the partition to take up the new free space, click OK and Click Apply…

Screen Shot 2013-07-06 at 07.32.40

7. A reboot of the appliance, bought the vCenter up again, and the DB had enough space…

Screen Shot 2013-07-07 at 10.54.49

Houston how can we stop this happening again…

I think there’s a couple of ways to avoid this:

  • Don’t use the embedded DB with the vCSA, and always use an external DB where it maybe easier to monitor and manage the disk space… The difficulty here is the only external DB supported by the vCSA is Oracle. So whilst a home lab might get away with OracleXE – that’s something that’s probably not supported in production, and price option as well. 
  • Wait for a bug fix from VMware. Apparently, this is known issue with the embedded DB and it will be resolved in future releases
  • Personal Recommend: I would recommend increasing the disk space, and change your retention settings to make sure the DB is being purged of stale data…

Database Retention Settings and Statistics Intervals.