vRops down for the count

While I try to hold my frustration at bay and wait for VMware support to get back to me to figure out what the h*** happened yesterday that has sent my vROPS 6.0.1 cluster down for the count on this now close to 24 hours.

A recap of what happened up to the point of realizing that the cluster was what I would call inconsistent. I spent most of the day yesterday cleaning up by removing a number of old unused VMs. Amongst those were a couple of turned of VMs that I did not think much of before deleting them.

About 1½ hours after deleting the last VMs I got an error in vROPS about one adapter instance not being able to collect information about the before mentioned powered off VMs. I looked in the environment information tab to see if they were still listed along with some of the others I had deleted. But no – they weren’t there. Hmm.

Then I thought they might still be listed in the license group I had defined. Went over to look and to my horror this was the first sign something was wrong – none of my licenses were in use?! Looking in the license groups view all my hosts were suddenly shown as unlicensed and my license group that normally has around 1800 members was empty. What? Editing the license group showed that the 1800 members including the hosts under unlicensed where listed as “Always include” so how come they weren’t licensed.

At this point I began suspecting that the cluster was entering a meta state of existence. So looking at the Cluster Management page I missed a critical piece of info at first but more on that later. Everything was up and running so I went to the Solutions menu with the intent of testing the connection to each vCenter server. But doing so caused an error that the currently selected collector node was not available? But the cluster just told me everything was up? So tried every one of the 4 nodes but none worked. Okay what do I do. I tried removing an adapter instance and add it again. Big mistake. Can’t readd it with the same name so had to make a new name for the same old vCenter..

That still not worked. Then I went back to the cluster management and decided to take one of the data nodes offline and the online again to see if that fixed. While waiting at “Loading” after initiating the power off I suddenly got an error saying it was unable to communicate with the data node. Then the page reloaded and the node was still online. Unsure what to do I stared at the screen only to suddenly see a message “Your sessions has expired” and then being booted back to login?

When logging back in I now only saw half of the environment. Because the old adapter that I had removed and readded under another name was not collecting. It just stated Failed.

I decided to take the car home from the office here. Was not sure what to do and need a few hours to get it at some distance. Back home I connected to the cluster again and looked at Cluster Management again. Then I spotted the (or “a” at least) problem.

Below is a screen print of what it normally looks like:

CorrectAnd here is what it looked like now:

WrongNotice the slight problem that both HA nodes reported as being Master? That cannot be good. What to do other than power off the entire cluster and bring it online again.

About 30 minutes later the cluster was back online and I started to get alerts again. A lot of alerts. Even alerts that it had previously back in the Easter week had cancelled. But okay – monitoring is running again. So decided to leave it at that and pick it up this morning again.

Well still no dice – things were still not licensed. Damnit. So I opened a ticket with VMware. While uploading log bundles and waiting I tried different things to get it to work but nothing. Then suddenly my colleague says he can’t log into vRops with his vCenter credentials. What? I had been logged in as Admin while trying to fix this so hadn’t tested my vCenter account. But it did not work. Or atleast not when using user@doma.in notation. using DOMA\user it worked – atleast I could login and see everything from the adapter that I readded yesterday. Not the other one. What?

By this time a notification event popped up in vRops clicking it gave me “1. Error getting Alert resource”. What? Now pretty desperate I powered off the cluster again and then back on. This fixed the new error of not showing alerts. Atleast for 30 minutes. The suddenly some alerts showed this again.

Trying to login with vCenter credentials did not work at all now. This is escalating! Tried setting the login to a single vCenter instead of all vCenters. Okay so previously I had only been able to see the contents of the readded vCenter adapter so I tried the one I could not see anything from. DOMA\user worked and I could see the info from this. Success – I thought. Logging back out and trying it against the readded vCenter did not work with DOMA\user but user@doma.in worked? But when inspecting the environment I was seeing the data from the other vCenter? What?

Right now I am uploading even more logs to VMware. I will update this when I figure out what the h*** went wrong here.

 

Disabling “One or more ports are experiencing network contention” alert

From day one of deploying vRealize Operations Manager 6.0 I had a bunch of errors in our environment on distributed virtual port group ports. They listed with the error:

One or more ports are experiencing network contention

Digging into the exact ports that were showing dropped packets resulted in nothing. The VMs connected to these ports were not registering any packet drops. Odd.

It took a while before any info came out but it apparently was a bug in the 6.0 code. I started following this thread on the VMware community boards and found that I was not alone in seeing this error. In our environment the error was also only present when logging in as the admin user. vCenter admin users were not seeing it so this pointed towards a cosmetic bug.

A KB article was released about the bug and that the alert can be disabled but it does not described exactly how to disable the alert. The alert is disabled by default in the 6.0.1 release but if you installed 6.0 and upgraded to 6.0.1 and did not reset all settings (as I did not do) there error is still there.

To remove the error login to the vROPS interface and navigate to Administration then Policy and lastly Policy Library as marked in the image below:

PolicyOnce in the Policy Library view select the active policy that is triggering the alert. For me it was Default Policy. Once selected click the pencil icon to Edit the profile as show below:

EditOn the Policy Editor click the 5. step – Override Alert / Symptom Definitions. In Alert Definitions click the drop-down next to Object Type, fold out vCenter Adapter and click vSphere Distributed Port Group. To alerts will now show. Next to the “One or more ports are experiencing…” alert click the error by State and select Local with the red circle with a cross as show below.

Local DisabledI had a few issues with clicking Save after this. Do not know exactly what fixed it but I had just logged in as admin when it worked. This disables the alert! Easy.

 

 

 

Default Host Application error

Last week I was called up by one of our Windows Admins. He had some issues with a VM running Windows and IIS. As we were talking he also casually mentioned another error he was seeing that was “caused by VMware”. I was a bit sceptic as you might imagine 🙂

He was seeing this error when he attempted to browse the IIS web page by clicking the link available in the IIS Manager:

Default Host Application ErrorNotice the VMware icon in the bottom. This is an error from VMware tools! What? As any sane person would do I consulted Google. And got a hit here – https://communities.vmware.com/message/2349884

The third reply post gave me the answer. Seems that when installing VMware Tools it might associate itself with HTTP and HTTPS links. This would then cause a click on the link in IIS Manager to call VMware Tools which is unable to service the request. The fix is pretty straight forward.

Go to Control Panel, then Default Programs and Set Associations. Scroll down to the Protocols section and locate HTTP and HTTPS. Make sure these are set to your browser of choice – in the image below I set them back to Internet Explorer (he was a Windows Sysadm after all 🙂 ). If the association is wrong it would be set to Default Host Application as shown for TELNET.

Fix

Working with Tags

The last couple of days I have been working with PowerCLI and vCenter Tags to see if I could automate my way out of some things regarding tracking which sys admins are responsible for a given VM.

Tagging and creating tags manually is not really my cup of tea (we have 1000+ vms and 40+ sys admins and even more people beyond that who could be tagged. So some automation would be required.

Next pre-creating all tags was not something I would enjoy either as maintaining the list would suck in my opinion. Also all tags are vCenter local so if you like us have more than 1 vCenter then propagating Tags to other vCenters is also something to keep in mind.

I added a bunch of small functions to my script collection to fix somethings. The first thing I ran into was “How to find which vCenter a given VM object came from?”. Luckily the “-Server” option on most commands accepts the vCenter server name as a string and not just the connection object so the following will get the vCenter of a given object by splitting the Uid attribute:

$object.Uid.Split("@")[1].Split(":")[0]

Splitting at “@” and taking the second part will remove the initial part of the string so it now starts with the FQDN followed by more information. Then splitting at the “:” just before port number and taking the first part will result in the FQDN of the vCenter. This may not work in all cases but it works for our purpose.

Now I needed this in my script because I was running into the problem of finding the correct Tag object to use with a given VM object in the “New-TagAssignment” Cmdlet. However it dawned on me that if I just make sure that the tag is present on all vCenter servers when I call “New-TagAssignment” I don’t need the Tag object just the name and PowerCLI/vCenter will do it’s magic. Thus the following works perfectly:

$VM | New-TagAssignment "<TAGNAME>"

But in any case I now have a way of finding the vCenter name of a given vSphere object in PowerCLI 🙂

 

ffmpeg oneliner(s)

Hello there. I expect this to be one of the first posts that I will continue to update. Mostly for my one reference. I have been in the process of converting some old video files for better support for Chromecast/DLNA and generally for my own streaming purposes.

Some of the first problems I found were combining old files without re-encoding them. So I looked at the old trusty ffmpeg to do the job. Below I will over time add lists of ffmpeg one-liners:

Combine two .avi files and copy codecs:

ffmpeg -i "concat:part1.avi|part2.avi" -c copy complete.avi

 

Sysblog.dk is now a thing

Finally! I have had this domain on hand for almost a year with the intention on moving my virtualization blog to this domain. The name came to me one day and to my luck the domain was available!

I plan on continuing writing about virtualization and the things I work on at Aalborg University but also all the other interesting stuff I stumble upon in my daily life.

A few topics are already being lined up for the press regarding a few Windows issues I encountered when I got my new laptop at work. So stay tuned!

vCenter Orchestrator 5.5.2.1 and SSO behind a load balancer

If you are like us in my organization and are crazy about HA solutions you have probably looked at putting SSO behind a load balancer. This may look like a daunting task and troubleshooting may not be easy. But hey we have an SSO server in our two sites that maintain the SSO service across the entire platform 🙂 Hurray!

Now this is not something I just recently configured. Alone reconfiguring an existing environment to a new SSO server seems like something I would avoid. Easier to just move ESXi hosts to new vCenter servers in a new setup. No, we were among first movers on SSO in HA. We installed vSphere 5.1 and configured the SSO for HA as described in KB2034157. Now vSphere 5.1 SSO had a host of other problems so only 4 months after installing vSphere 5.1 and moving production to this setup we upgraded to vSphere 5.5 and VMware were spot on with new documentation as there were major changes in SSO some URL’s changed, such as the /sts URL. For people like us the reconfiguration of the load balancer was described in KB2058838. Easy!

We have now been running this setup for about 12 months and it has been working well for us. Upgrading has been a bit tricky but having only applied vCenter patches twice in the period this was okay. We are running vCenter 5.5 U2b today so we have access to the new VMRC client for when Chrome stop supporting NPAPI.

But here is where the title of the post comes in. I recently (well it is almost two months ago now) upgraded our vCenter Orchestrator with the lates security patches pushing us to 5.5.2.1. Following this a problem occurred. I could no longer login via the client! This is a pretty serious problem. So I started debugging. Tried re-registering with the SSO. No problem. Test login in the configuration interface -> works. Login via client still fails. What the hell?

I the started browsing the vCO server.log file and looking at what happened when logins failed. There is what I found – 3 of these every login:

2014-11-27 09:52:33.716+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RestTemplate] GET request for "https://<sso-lb-fqdn>:7444//websso/SAML2/Metadata/vsphere.local" resulted in 404 (Not Found); invoking error handler
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RetriableOperation] Exception handled during retry operation with message: 404 Not Found
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] INFO {} [RetriableOperation] Retries left: [2]. Sleeping for [3] seconds before the next retry attempt.

Now these indicate that the vCO cannot talk to the SSO. Well I just re-registered it and tested login? How could this be? At this point I started a support case with VMware. And following over a months back and forth the support started looking into why there was a double slash “//” after the port number thinking that the SSO registration was somehow wrong. I at the same point realized something. Looking at the URL the vCO server was using a different URL that was configured in the configuration interface. What? And remembering back to the load balancer configuration I quickly realized that the problem was as simple as the /websso URL that vCO was using when logging via the client was not allowed through the load balancer that VMware provided above. At some point between vSphere 5.5 release and now some products (including vCAC/vRA) started using /websso instead of /sts.

From here I have spend about two weeks ask VMware how I should configure this with out getting real answers. Finally last week I got a paper describing how to configure an F5 load balancer for SSO when using vCAC. Now this would have been good if I could reverse engineer the approach that the F5 load balancer was using and configure that in the Apache load balancer. But no, those two configurations are completely different. So I decided to test something very simple. Copy the configuration block for /sts and rename everything to /websso. And guess what. So far it works! Here is how it looks:

###################################################################################
# Configure the websso for clustering

ProxyPass /websso/ balancer://webssocluster/ nofailover=On
ProxyPassReverse /websso/ balancer://webssocluster/

Header add Set-Cookie "ROUTEID=.%{BALANCER_WORKER_ROUTE}e; path=/websso" env=BALANCER_ROUTE_CHANGED
<Proxy balancer://webssocluster>
 BalancerMember https://<sso-node-1-fqdn>:7444/websso route=node1 loadfactor=100
 BalancerMember https://<sso-node-2-fqdn>:7444/websso route=node2 loadfactor=1
 ProxySet lbmethod=byrequests stickysession=ROUTEID
</Proxy>

vCAC: Playing Around

I have for a while wanted to write a bit about vCloud Automation Center (vCAC) or, if you haven’t heard, as it will be rebranded vRealize Automation (vRA). My organization upgraded to vCloud Suite Standard licenses 2 years ago during the promo so we have access to the Standard edition of vCAC.

 

So what would it be able to do for us? We have been looking at providing some kind of private cloud solution to users but the form and shape of this has yet to be defined. So for the moment I am just testing out what I can do, how it is done, how to configure access and the likes. So far I am impressed but still a bit confused between tenants, fabric and business groups, reservations, entitlements, catalogs etc etc etc.

 

What I have learned so far from fooling around are:

  1. Native AD support is only available on the default “vsphere.local” tenant for some reason. Meaning if you want to use a different tenant for your AD you need to define each domain in your AD forest to be able to use users from all domains. A bit impractical.
  2. Renaming a cluster under vCenter that has been added as a compute resource in vCAC causes odd problems. Suddenly my tests were giving: Failed “CloneVM: Object reference is not set to an instance of an object”. I was unable to find anything else. Thinking my template had moved I started data collection on the compute resouce again. This failed with no error message to be found. Then I remembered that I had renamed the cluster. Solution was to remove reservations and then the cluster from the Fabric Group and then add the new name, recreate the reservation and then I was able to go again 🙂
  3. Adding users to a business group as users don’t give access to the actual resources. You need to entitle them which to me seems a bit redundant. Perhaps I have not yet seen the light on why this is smart.

 

I wil hopefully write more at a later time about this. My next goal is to get vCO running as an endpoint but due to some odd login problems with vCO 5.5.2.1 I can’t use my vCO at the moment!

Microsoft NLB and the consequences

Hello All

I am not usually one to bash certain pieces of technology over others, at least not in public. I know which things I preffer and which I avoid. But after having spent the better part of a work day cleaning up after Microsoft Network Load Balancer (NLB) I have to say that I am not amused!

We are currently working on deprecating an old HP switched network and moving all the involved VMs and hosts to our new Nexus infrastructure. This is a long process, at least when you want to minimize downtime. The two switching infrastructures are very different. Now I am a virtualization administrator with responsibilities for a lot of physical hardware as well so for the last month or two I have been planning this and the next weeks work with moving from the old infrastructure to the new.

Everything was ready, a layer 2 connection was established between the infrastructures allowing seamless migration between them only interrupted by the path change and for the physical machines the actual unplug of a cable to be reconnected by a new. No IP address changes, no gateway changes. Just changing uplinks. And it worked, a VM would resume connection when it moved to a host with the new uplink. Perfect!

Then disaster struck. Our Exchange setup starter creaking and within 20 minutes grinded to a halt. Something was wrong. But only on the client access layer. We quickly realized that the problem was that one of the 4 nodes in the NLB cluster running the CAS service had been moved to the new infrastructure. I hadn’t noticed it because they all still responded to ping and RDP but the NLB cluster was broken.

The reason; we use NLB with multicast. That means that on our old Catalyst switch/routers we had a special configuration that converted the unicast IP to a multicast MAC that was sent in the direction of the old infrastructure. This is a static configuration thus when we started changing the locations of the CAS servers this broke. Hard! Within an hour we had stabilized by moving two of the 4 nodes together on the same ESXi host on the new network and changing the static configuration on the Catalyst switch. But that left two nodes on the old HP networks unable to help take the load.

We have been spending the entire morning planning what to do and how to test it. Non of us had though of NLB as a problem but had we remembered this static MAC multicast configuration we might have avoided this.

My take away from this; avoid special configurations. Keep it as standard as possible. If you need to configure something customly you should stop and reconsider if you are doing it correctly.