vCenter Orchestrator 5.5.2.1 and SSO behind a load balancer

If you are like us in my organization and are crazy about HA solutions you have probably looked at putting SSO behind a load balancer. This may look like a daunting task and troubleshooting may not be easy. But hey we have an SSO server in our two sites that maintain the SSO service across the entire platform 🙂 Hurray!

Now this is not something I just recently configured. Alone reconfiguring an existing environment to a new SSO server seems like something I would avoid. Easier to just move ESXi hosts to new vCenter servers in a new setup. No, we were among first movers on SSO in HA. We installed vSphere 5.1 and configured the SSO for HA as described in KB2034157. Now vSphere 5.1 SSO had a host of other problems so only 4 months after installing vSphere 5.1 and moving production to this setup we upgraded to vSphere 5.5 and VMware were spot on with new documentation as there were major changes in SSO some URL’s changed, such as the /sts URL. For people like us the reconfiguration of the load balancer was described in KB2058838. Easy!

We have now been running this setup for about 12 months and it has been working well for us. Upgrading has been a bit tricky but having only applied vCenter patches twice in the period this was okay. We are running vCenter 5.5 U2b today so we have access to the new VMRC client for when Chrome stop supporting NPAPI.

But here is where the title of the post comes in. I recently (well it is almost two months ago now) upgraded our vCenter Orchestrator with the lates security patches pushing us to 5.5.2.1. Following this a problem occurred. I could no longer login via the client! This is a pretty serious problem. So I started debugging. Tried re-registering with the SSO. No problem. Test login in the configuration interface -> works. Login via client still fails. What the hell?

I the started browsing the vCO server.log file and looking at what happened when logins failed. There is what I found – 3 of these every login:

2014-11-27 09:52:33.716+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RestTemplate] GET request for "https://<sso-lb-fqdn>:7444//websso/SAML2/Metadata/vsphere.local" resulted in 404 (Not Found); invoking error handler
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RetriableOperation] Exception handled during retry operation with message: 404 Not Found
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] INFO {} [RetriableOperation] Retries left: [2]. Sleeping for [3] seconds before the next retry attempt.

Now these indicate that the vCO cannot talk to the SSO. Well I just re-registered it and tested login? How could this be? At this point I started a support case with VMware. And following over a months back and forth the support started looking into why there was a double slash “//” after the port number thinking that the SSO registration was somehow wrong. I at the same point realized something. Looking at the URL the vCO server was using a different URL that was configured in the configuration interface. What? And remembering back to the load balancer configuration I quickly realized that the problem was as simple as the /websso URL that vCO was using when logging via the client was not allowed through the load balancer that VMware provided above. At some point between vSphere 5.5 release and now some products (including vCAC/vRA) started using /websso instead of /sts.

From here I have spend about two weeks ask VMware how I should configure this with out getting real answers. Finally last week I got a paper describing how to configure an F5 load balancer for SSO when using vCAC. Now this would have been good if I could reverse engineer the approach that the F5 load balancer was using and configure that in the Apache load balancer. But no, those two configurations are completely different. So I decided to test something very simple. Copy the configuration block for /sts and rename everything to /websso. And guess what. So far it works! Here is how it looks:

###################################################################################
# Configure the websso for clustering

ProxyPass /websso/ balancer://webssocluster/ nofailover=On
ProxyPassReverse /websso/ balancer://webssocluster/

Header add Set-Cookie "ROUTEID=.%{BALANCER_WORKER_ROUTE}e; path=/websso" env=BALANCER_ROUTE_CHANGED
<Proxy balancer://webssocluster>
 BalancerMember https://<sso-node-1-fqdn>:7444/websso route=node1 loadfactor=100
 BalancerMember https://<sso-node-2-fqdn>:7444/websso route=node2 loadfactor=1
 ProxySet lbmethod=byrequests stickysession=ROUTEID
</Proxy>