While I try to hold my frustration at bay and wait for VMware support to get back to me to figure out what the h*** happened yesterday that has sent my vROPS 6.0.1 cluster down for the count on this now close to 24 hours.
A recap of what happened up to the point of realizing that the cluster was what I would call inconsistent. I spent most of the day yesterday cleaning up by removing a number of old unused VMs. Amongst those were a couple of turned of VMs that I did not think much of before deleting them.
About 1½ hours after deleting the last VMs I got an error in vROPS about one adapter instance not being able to collect information about the before mentioned powered off VMs. I looked in the environment information tab to see if they were still listed along with some of the others I had deleted. But no – they weren’t there. Hmm.
Then I thought they might still be listed in the license group I had defined. Went over to look and to my horror this was the first sign something was wrong – none of my licenses were in use?! Looking in the license groups view all my hosts were suddenly shown as unlicensed and my license group that normally has around 1800 members was empty. What? Editing the license group showed that the 1800 members including the hosts under unlicensed where listed as “Always include” so how come they weren’t licensed.
At this point I began suspecting that the cluster was entering a meta state of existence. So looking at the Cluster Management page I missed a critical piece of info at first but more on that later. Everything was up and running so I went to the Solutions menu with the intent of testing the connection to each vCenter server. But doing so caused an error that the currently selected collector node was not available? But the cluster just told me everything was up? So tried every one of the 4 nodes but none worked. Okay what do I do. I tried removing an adapter instance and add it again. Big mistake. Can’t readd it with the same name so had to make a new name for the same old vCenter..
That still not worked. Then I went back to the cluster management and decided to take one of the data nodes offline and the online again to see if that fixed. While waiting at “Loading” after initiating the power off I suddenly got an error saying it was unable to communicate with the data node. Then the page reloaded and the node was still online. Unsure what to do I stared at the screen only to suddenly see a message “Your sessions has expired” and then being booted back to login?
When logging back in I now only saw half of the environment. Because the old adapter that I had removed and readded under another name was not collecting. It just stated Failed.
I decided to take the car home from the office here. Was not sure what to do and need a few hours to get it at some distance. Back home I connected to the cluster again and looked at Cluster Management again. Then I spotted the (or “a” at least) problem.
Below is a screen print of what it normally looks like:
And here is what it looked like now:
Notice the slight problem that both HA nodes reported as being Master? That cannot be good. What to do other than power off the entire cluster and bring it online again.
About 30 minutes later the cluster was back online and I started to get alerts again. A lot of alerts. Even alerts that it had previously back in the Easter week had cancelled. But okay – monitoring is running again. So decided to leave it at that and pick it up this morning again.
Well still no dice – things were still not licensed. Damnit. So I opened a ticket with VMware. While uploading log bundles and waiting I tried different things to get it to work but nothing. Then suddenly my colleague says he can’t log into vRops with his vCenter credentials. What? I had been logged in as Admin while trying to fix this so hadn’t tested my vCenter account. But it did not work. Or atleast not when using user@doma.in notation. using DOMA\user it worked – atleast I could login and see everything from the adapter that I readded yesterday. Not the other one. What?
By this time a notification event popped up in vRops clicking it gave me “1. Error getting Alert resource”. What? Now pretty desperate I powered off the cluster again and then back on. This fixed the new error of not showing alerts. Atleast for 30 minutes. The suddenly some alerts showed this again.
Trying to login with vCenter credentials did not work at all now. This is escalating! Tried setting the login to a single vCenter instead of all vCenters. Okay so previously I had only been able to see the contents of the readded vCenter adapter so I tried the one I could not see anything from. DOMA\user worked and I could see the info from this. Success – I thought. Logging back out and trying it against the readded vCenter did not work with DOMA\user but user@doma.in worked? But when inspecting the environment I was seeing the data from the other vCenter? What?
Right now I am uploading even more logs to VMware. I will update this when I figure out what the h*** went wrong here.