ESXi disconnecting right after connecting

This morning was just one of those mornings! Got up, checked my phone, 22 alerts from vCops. Damnit 🙁

So got into work and could see that it was centralized around two server rooms located close to each other utilizing the same uplinks to our core network. I suspected a network error at first. Talk to a guy in networking, “Oh, didn’t you know? There was a planned power outage in that area”. Oh that explains it. Debugging further showed that only networks were affected by power interruption. Servers and storage continued to run.

So I suspected that the reason the hosts were still not responding was that they had been with out network for 4-6 hours. Chose to reconnect which worked.. at first. Immediatly after connecting the hosts disconnected again. This happened with all of them. Strange.

I remembered then an KB article I saw a while back: ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting

Aahh so port 902 might be blocked. Checked – Nope. Open both on TCP and UDP. Hmm. Aahh perhaps I need to restart the management agents but SSH was disabled. So connect the old C# client to one of the hosts directly. Enable SSH. Still no, network was filtering port 22. So no avail. Beginning to panic a bit PowerCLI came to mind. Perhaps there is a way to restart the management agents from PowerCLI.

There is! Here. But not all of them though as far as I could tell. So I tried restarting the vpxa service. Which luckily worked.

So a lot of clean up of the configurations are in order now. Personal todo: 1) allow SSH from management networks to the hosts. 2) Fix/Get access to iLO/Drac/IMPI of the hosts. 3) Get answers why I was not informed about the power work being done. And a bonus too. Need to figure out why all of the 8 hosts spread on 3 clusters have access to a 20 MB LUN that no one knows about and that while the vpxa services on 5 hosts was not working two hosts complained that they had lost access to that specific LUN.

Work work work.

Update on Updating WordPress and News

A while back I wrote a short article about how to update WordPress when your FTP server did not accept passive FTP connections. Since then I have not been trying to update anything until now. And the fix didn’t work. Or rather it did but there was still a permission problem on the files. Even chmod 777 on the files didn’t fix it for some reason.

So I started looking for a solution and found this instead:

http://www.hongkiat.com/blog/update-wordpress-without-ftp/

It describes how to switch method to direct instead of FTP. This fix worked better from the beginning and gave me more direct error messages. So now I knew that a chmod 777 would fix my problem while updating. So chmod -R 777 on the wordpress root dir while updating and then setting the permission back and voila my installation was updated. Perfect!

And now to the news part. Haven’t been very active lately with this blog. My company has had a lot of dealings with budget cuts and organizational changes in this newly formed IT unit. Not the best of starts to have to do within a year of starting. That has put a hold on many new projects. I have been working on consolidating older vSphere setups on new hardware and software which isn’t very interesting work but it has to be done. I will sometime next week hopefully write a post about how to handle CBT errors after migrating from one storage array to another. We had one of the old setups where the 2 machines were failing in odd ways. More on that later.

In other news I finally settled on a better name for the blog and will be implementing it as soon as possible. Be prepared, I think it is quite awesome 🙂

Follow-up: Upgradring SSO 5.5

I wanted to do a follow up on my previous post about Upgrading SSO 5.5 and cover the resolution of the undesirable situation we had ended up in but first I’d like to cover the process leading up to the problematic upgrade.

Preparation phase:

We started preparing shortly after getting wind that vSphere 5.5 was going to be released and that the SSO had been reworked to work better with AD. We submitted 3-4 bugs related to SSO and AD (we run a single forest with multiple subdomains in a parent-child trust, imagine the problems we were having!) and the resolution to most of the were soon upgrade to vSphere 5.5. So we started researching the upgrade. Having a HA setup with a load balancer infront and the goal of doing this to our production environment ment that we were reading a lot of documentation and and KB articles as well as talking to VMware proffesionals.

At VMworld I talked personally with three different VMware employees about the process and was told every time that it was possible to upgrade the SSO and Web Client Server components and leave the Inventory Service and vCenter services for later.

So after having read a lot of this we found some issues in the documentation that made us contact VMware directly for a clarification. First off, reading through the upgrade documentation we stumbled upon this phrase:

DocumentationError1

As we read it we had to enter the load balanced hostname when upgrading the second node in the HA cluster which seems illogical. This was also different from KB2058239 which states the following:

DocumentationError2

So we contacted VMware support and got a clarification. The following response was given by email and later confirmed by phone:

In my scenario I was not able to use Load balancer’s DNS name and gave me a certificate error and the installation, however the installation went through by providing the DNS name of the primary node. This is being validated by my senior engineer and I will contact you tomorrow with any further update from my end.

We were still a bit unsure of the process and decided to revive the old test environment for the, at the time, running vSphere 5.1 configuration.

Testing phase:

It took a while to get it running again as it was being redeployed from an OVA that could not be directly deployed on our test cluster (too new VMX version) so it had to first be converted to a VMX and then through VMware Standalone Converter to downgrade the hardware. This worked fine and the machines booted. However the SSO was not running. So back to figuring that out. As it turns out, when changing the hardware of the machine its machineid changed. So we had to run the following command on both nodes and restart the services as described in this KB:

rsautil manage-secrets -a recover -m <masterPassword>

And presto! it was running again. We then performed the upgrade of the SSO and Web client. But as we had no indication that the vCenter services was going to be hit by this no testing of it was done (we were after all told that this was no problem by several different people at this point). The upgrade went mostly smooth with only to warnings, 1) Warning 25000 – SSL error which should be ignorable and that the OS had pending updates that needed to be installed before proceeding.

But we ran into the AD problem that I described in the previous post as linked at the start. The SSO was contacting a domain controller, presumably on port 3268, to get a GC connection and this was failing as the server it was communicating with was not running that service. We got a solution after a week but it seems that the problem had temporarily solved itself before the fix was given by VMware. So over the phone I agreed with VMware technical support that it was safe to procede on our production environment at if we got the same error again we should make a severity 1 support request as soon as possible.

Launch in 3. 2. 1…..:

Following the above phases we proceeded with the upgrade. We scheduled a maintenance window of about 3 hours (took 1 hour in the test environment). On the day we prepared by snapshotting the SSO and Web Client servers (5 in total). We took backups of the RSA database just in case and the proceeded. Everything seemed to work. The first node was successfully upgraded and then the second. After upgrading we reconfigured the load balancer as described in KB2058838 and it just worked out of the box it seemed. Until we had to update the service end points. The first went well. As did the second. But the third was giving a very wierd error, a name was missing. But everything was there, the file was identical to the others except the changed text to identify the groupcheck service. The we saw it, copypasta had struck again as we had missed a single “[” at the beginning of the file rendering the first block unparsable. Quick fix!

The first Web Client server was then updated to check everything and then we saw it. The vCenters were not connecting to the lookup service so the Web Client Server was not seeing them! Darn it! We hadn’t tested this. So we figured that restarting the vCenter service would solve the problem. It didn’t. In fact it turned out to make it worse. As the service was starting it failed to connect to the SSO and decided to wipe all permissions! That is a really bad default behaviour! And the service didn’t even start! It just shut down again. Digging a bit we found that the vCenter server had hard-coded the URL for the “ims” service in the vpxd.cfg file and as this had changed from “ims” to “sts” this was of course not working. The KB said to update the URL so we did. This helped a bit. The service was not starting but not connecting correctly. We could however now login with admin@System-Domain via the old C# client (Thank god that still works!). It is here we called VMware technical support, but too late. After getting a severity 1 case opened we were informed that we would get a callback within 4 business hours (business hours being 7am to 7pm GMT). At this point it was 5:45PM GMT. So we waited until 7PM but no call unfortunately. So we went home and came back early next morning. at about 7:45AM the callback came, I grabbed it and was then on the phone for 3 hours straight.

What the problem was is that it is not possible to upgrade SSO and Web Client server and leave vCenter and Inventory service. The last two needed to be upgraded aswell and the process of recovering the situation was to upgrade the vCenter servers and their Inventory services. This was a bit problematic for the vCenter that had been “corrupted” by restarting and trying to fix the config files. We ended up uninstalling the vCenter service and reinstalling it on the server and on the other vCenter that had not been touched we just upgraded unproblematically.

Aftermath:

When I hung up the phone with VMware technical support we were back online. Only the last Web Client server needed upgrading, easily done. But the one vCenter had lost all permissions, we are still recuperating from that but as this was just before Christmas holidays only critical permissions were restored.

So after a thing like this one needs to lean back and evaluate how to avoid this in the future so here are my morals from this:

  1. Test, test, test, test! We should have tested the upgrade with the vCenter servers as well to have caught this in the testing phase!
  2. Upgrade the entire management layer. Instead of trying to keep the vCenters on 5.1 we should have just planned on upgrading the entire management layer.
  3. Maintenance in business hours. We set our maintanence window outside normal business hours causing us to have to wait 12 hours for support from VMware.

WordPress management

Hello everyone

Today’s topic is a bit different than usual but to make sure that I remember this small fix I decided to write a small blog post about it. I run a number of WordPress installations for various people and recently I setup a new one on a web server behind a very nasty firewall. This firewall is blocking passive FTP connections so when trying to automatically update plugins and such requiring FTP in WordPress I was hitting a wall that after entering credentials the page would just hang at “loading”. I think a spent about an hour debugging what was going on. Nothing in web server logs, login was successful in FTP server logs. What was going on?

Then it struck me! The FTP server did not allow passive FTP connections to upload, they could connect but not upload. So was WordPress using passive FTP?

Turns out, yes it does. And it is hard-coded. Not very nice WordPress! Following this post: http://wordpress.org/support/topic/define-ftp-active-mode-in-wp-config I found the file:

includes/class-wp-filesystem-ftpext.php

and it indeed containged the line:

@ftp_pasv( $this->link, true );

which hard-codes the use of passive FTP. Now changing it to false solved my problem! It would be nice if this setting could be set via wp-config.php as other FTP options can.

Current problems with vSphere Single Sign On

Hello again

Today I’m going to highlight some of the bugs/problems we have run into with vSphere Single Sign On (SSO), introduced in vSphere 5.1.

Using Active Directory groups from a Multi-Domain Forest with Parent-Child trust:

We are using a Active Directory (AD) forest that consists of a single root domain around 20 sub domains, a single level down. That is we have a root AD domain like example.com and below a single level we have all the domains, mostly on the form department.example.com. All users exist in one of the sub domains, none are placed in the root domain. It contains only service accounts, groups and servers.

The SSO as of version 5.1 U1b only supports External Two-way trusts (See: KB:2037410) so at first you think it won’t work. So you try adding a group from AD and you are able to do so. But trying to authenticate with a user of that group doesn’t work. Adding the user itself instead of the group works. VMware have stated to us that this is a problem with the transitivity of the trust in Parent-Child trusts which does not exist in the External Two-way trust. I suspect that this is also caused by another “bug” that we have noticed that I will explain below.

Workaround: use local SSO groups with AD users. This works just fine.

Users with same sAMAccountName in different domains in same forest cannot be used:

We discovered last week that we were unable to add users to permissions or SSO groups if the SSO already new a user with the same sAMAccountName but in a different domain. VMware confirmed this as a bug which will be fixed in an upcomming release. The bug only occurs in child domains in an AD forest.

Users are presented in the SSO with wrong domain when added:

This is small bug  but when adding a user (or group) the first time the user is added with username@example.com rather than the correct username@department.example.com which is shown when adding and despite selecting the correct domain in the “Add user” dialog. The result is that the first time the user logs in nothing is shown despite the user having permissions. Have the user log out and then remove and add the user solves the problem, as if the SSO figures out that the user is in the sub domain on first login. I suspect that this has something to do with the first problem as well.

Cannot remove user from SSO group:

Despite KB:2037102 stating that the bug was solved in vSphere 5.1 U1a we are experiencing it on this version aswell. We have not reported this yet but am doing so tomorrow morning. For some users no errors are shown but the user is not removed, for others an error is shown that the principal could not be removed. In the imsTrace log a Java exception shows that the principal cannot be removed because it does not exist on the group, even though the web client shows the user.

 

This post may be updated in the future:)