Follow-up: Upgradring SSO 5.5

I wanted to do a follow up on my previous post about Upgrading SSO 5.5 and cover the resolution of the undesirable situation we had ended up in but first I’d like to cover the process leading up to the problematic upgrade.

Preparation phase:

We started preparing shortly after getting wind that vSphere 5.5 was going to be released and that the SSO had been reworked to work better with AD. We submitted 3-4 bugs related to SSO and AD (we run a single forest with multiple subdomains in a parent-child trust, imagine the problems we were having!) and the resolution to most of the were soon upgrade to vSphere 5.5. So we started researching the upgrade. Having a HA setup with a load balancer infront and the goal of doing this to our production environment ment that we were reading a lot of documentation and and KB articles as well as talking to VMware proffesionals.

At VMworld I talked personally with three different VMware employees about the process and was told every time that it was possible to upgrade the SSO and Web Client Server components and leave the Inventory Service and vCenter services for later.

So after having read a lot of this we found some issues in the documentation that made us contact VMware directly for a clarification. First off, reading through the upgrade documentation we stumbled upon this phrase:

DocumentationError1

As we read it we had to enter the load balanced hostname when upgrading the second node in the HA cluster which seems illogical. This was also different from KB2058239 which states the following:

DocumentationError2

So we contacted VMware support and got a clarification. The following response was given by email and later confirmed by phone:

In my scenario I was not able to use Load balancer’s DNS name and gave me a certificate error and the installation, however the installation went through by providing the DNS name of the primary node. This is being validated by my senior engineer and I will contact you tomorrow with any further update from my end.

We were still a bit unsure of the process and decided to revive the old test environment for the, at the time, running vSphere 5.1 configuration.

Testing phase:

It took a while to get it running again as it was being redeployed from an OVA that could not be directly deployed on our test cluster (too new VMX version) so it had to first be converted to a VMX and then through VMware Standalone Converter to downgrade the hardware. This worked fine and the machines booted. However the SSO was not running. So back to figuring that out. As it turns out, when changing the hardware of the machine its machineid changed. So we had to run the following command on both nodes and restart the services as described in this KB:

rsautil manage-secrets -a recover -m <masterPassword>

And presto! it was running again. We then performed the upgrade of the SSO and Web client. But as we had no indication that the vCenter services was going to be hit by this no testing of it was done (we were after all told that this was no problem by several different people at this point). The upgrade went mostly smooth with only to warnings, 1) Warning 25000 – SSL error which should be ignorable and that the OS had pending updates that needed to be installed before proceeding.

But we ran into the AD problem that I described in the previous post as linked at the start. The SSO was contacting a domain controller, presumably on port 3268, to get a GC connection and this was failing as the server it was communicating with was not running that service. We got a solution after a week but it seems that the problem had temporarily solved itself before the fix was given by VMware. So over the phone I agreed with VMware technical support that it was safe to procede on our production environment at if we got the same error again we should make a severity 1 support request as soon as possible.

Launch in 3. 2. 1…..:

Following the above phases we proceeded with the upgrade. We scheduled a maintenance window of about 3 hours (took 1 hour in the test environment). On the day we prepared by snapshotting the SSO and Web Client servers (5 in total). We took backups of the RSA database just in case and the proceeded. Everything seemed to work. The first node was successfully upgraded and then the second. After upgrading we reconfigured the load balancer as described in KB2058838 and it just worked out of the box it seemed. Until we had to update the service end points. The first went well. As did the second. But the third was giving a very wierd error, a name was missing. But everything was there, the file was identical to the others except the changed text to identify the groupcheck service. The we saw it, copypasta had struck again as we had missed a single “[” at the beginning of the file rendering the first block unparsable. Quick fix!

The first Web Client server was then updated to check everything and then we saw it. The vCenters were not connecting to the lookup service so the Web Client Server was not seeing them! Darn it! We hadn’t tested this. So we figured that restarting the vCenter service would solve the problem. It didn’t. In fact it turned out to make it worse. As the service was starting it failed to connect to the SSO and decided to wipe all permissions! That is a really bad default behaviour! And the service didn’t even start! It just shut down again. Digging a bit we found that the vCenter server had hard-coded the URL for the “ims” service in the vpxd.cfg file and as this had changed from “ims” to “sts” this was of course not working. The KB said to update the URL so we did. This helped a bit. The service was not starting but not connecting correctly. We could however now login with admin@System-Domain via the old C# client (Thank god that still works!). It is here we called VMware technical support, but too late. After getting a severity 1 case opened we were informed that we would get a callback within 4 business hours (business hours being 7am to 7pm GMT). At this point it was 5:45PM GMT. So we waited until 7PM but no call unfortunately. So we went home and came back early next morning. at about 7:45AM the callback came, I grabbed it and was then on the phone for 3 hours straight.

What the problem was is that it is not possible to upgrade SSO and Web Client server and leave vCenter and Inventory service. The last two needed to be upgraded aswell and the process of recovering the situation was to upgrade the vCenter servers and their Inventory services. This was a bit problematic for the vCenter that had been “corrupted” by restarting and trying to fix the config files. We ended up uninstalling the vCenter service and reinstalling it on the server and on the other vCenter that had not been touched we just upgraded unproblematically.

Aftermath:

When I hung up the phone with VMware technical support we were back online. Only the last Web Client server needed upgrading, easily done. But the one vCenter had lost all permissions, we are still recuperating from that but as this was just before Christmas holidays only critical permissions were restored.

So after a thing like this one needs to lean back and evaluate how to avoid this in the future so here are my morals from this:

  1. Test, test, test, test! We should have tested the upgrade with the vCenter servers as well to have caught this in the testing phase!
  2. Upgrade the entire management layer. Instead of trying to keep the vCenters on 5.1 we should have just planned on upgrading the entire management layer.
  3. Maintenance in business hours. We set our maintanence window outside normal business hours causing us to have to wait 12 hours for support from VMware.