Microsoft NLB and the consequences

Hello All

I am not usually one to bash certain pieces of technology over others, at least not in public. I know which things I preffer and which I avoid. But after having spent the better part of a work day cleaning up after Microsoft Network Load Balancer (NLB) I have to say that I am not amused!

We are currently working on deprecating an old HP switched network and moving all the involved VMs and hosts to our new Nexus infrastructure. This is a long process, at least when you want to minimize downtime. The two switching infrastructures are very different. Now I am a virtualization administrator with responsibilities for a lot of physical hardware as well so for the last month or two I have been planning this and the next weeks work with moving from the old infrastructure to the new.

Everything was ready, a layer 2 connection was established between the infrastructures allowing seamless migration between them only interrupted by the path change and for the physical machines the actual unplug of a cable to be reconnected by a new. No IP address changes, no gateway changes. Just changing uplinks. And it worked, a VM would resume connection when it moved to a host with the new uplink. Perfect!

Then disaster struck. Our Exchange setup starter creaking and within 20 minutes grinded to a halt. Something was wrong. But only on the client access layer. We quickly realized that the problem was that one of the 4 nodes in the NLB cluster running the CAS service had been moved to the new infrastructure. I hadn’t noticed it because they all still responded to ping and RDP but the NLB cluster was broken.

The reason; we use NLB with multicast. That means that on our old Catalyst switch/routers we had a special configuration that converted the unicast IP to a multicast MAC that was sent in the direction of the old infrastructure. This is a static configuration thus when we started changing the locations of the CAS servers this broke. Hard! Within an hour we had stabilized by moving two of the 4 nodes together on the same ESXi host on the new network and changing the static configuration on the Catalyst switch. But that left two nodes on the old HP networks unable to help take the load.

We have been spending the entire morning planning what to do and how to test it. Non of us had though of NLB as a problem but had we remembered this static MAC multicast configuration we might have avoided this.

My take away from this; avoid special configurations. Keep it as standard as possible. If you need to configure something customly you should stop and reconsider if you are doing it correctly.

Veeam NFC stream error and missing symlinks

Today my colleague, who handles our Veeam installation was diagnosing an error we were sporadically seeing. The error was this (removed names):

Error: Client error: NFC storage connection is unavailable. Storage: [stg:datastore-xxxxx,nfchost:host-xxxx,conn:vcenter.server.fqdn]. Storage display name: [DatastoreName].

Failed to create NFC download stream. NFC path: [nfc://conn:vcenter.server.fqdn,nfchost:host-xxx,stg:datastore-xxxxx@VMNAME/VMNAME.vmx].

Now this error indicates that it failed to get a connection to the host via NFC stream (port 902). Or so I thought. We have seen sporadic problems for vCenter heartbeats over the same port so that was what we expected. Turns out that some of the hosts in the cluster were missing the “datastore symlink” in /vmfs/volumes.

When running “ls -1a /vmfs/volumes” the result was not the same on each host. 4 of 8 hosts were missing a symlink and two others had a wrongly named symlink. I recalled that when I was creating the datastores I used PowerCLI to change the names of the datastores several times in rapidily after each others as my script has slight errors when constructing the correct datastore names. It seems that this left some of the datastores on some hosts either with no symlink or a wrongly named symlink.

Fortunately the fix is easy:

  1. Enter Maintenance Mode
  2. Reboot host
  3. ?????
  4. Profit

That is it! 🙂

ESXi disconnecting right after connecting

This morning was just one of those mornings! Got up, checked my phone, 22 alerts from vCops. Damnit 🙁

So got into work and could see that it was centralized around two server rooms located close to each other utilizing the same uplinks to our core network. I suspected a network error at first. Talk to a guy in networking, “Oh, didn’t you know? There was a planned power outage in that area”. Oh that explains it. Debugging further showed that only networks were affected by power interruption. Servers and storage continued to run.

So I suspected that the reason the hosts were still not responding was that they had been with out network for 4-6 hours. Chose to reconnect which worked.. at first. Immediatly after connecting the hosts disconnected again. This happened with all of them. Strange.

I remembered then an KB article I saw a while back: ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting

Aahh so port 902 might be blocked. Checked – Nope. Open both on TCP and UDP. Hmm. Aahh perhaps I need to restart the management agents but SSH was disabled. So connect the old C# client to one of the hosts directly. Enable SSH. Still no, network was filtering port 22. So no avail. Beginning to panic a bit PowerCLI came to mind. Perhaps there is a way to restart the management agents from PowerCLI.

There is! Here. But not all of them though as far as I could tell. So I tried restarting the vpxa service. Which luckily worked.

So a lot of clean up of the configurations are in order now. Personal todo: 1) allow SSH from management networks to the hosts. 2) Fix/Get access to iLO/Drac/IMPI of the hosts. 3) Get answers why I was not informed about the power work being done. And a bonus too. Need to figure out why all of the 8 hosts spread on 3 clusters have access to a 20 MB LUN that no one knows about and that while the vpxa services on 5 hosts was not working two hosts complained that they had lost access to that specific LUN.

Work work work.

Update on Updating WordPress and News

A while back I wrote a short article about how to update WordPress when your FTP server did not accept passive FTP connections. Since then I have not been trying to update anything until now. And the fix didn’t work. Or rather it did but there was still a permission problem on the files. Even chmod 777 on the files didn’t fix it for some reason.

So I started looking for a solution and found this instead:

http://www.hongkiat.com/blog/update-wordpress-without-ftp/

It describes how to switch method to direct instead of FTP. This fix worked better from the beginning and gave me more direct error messages. So now I knew that a chmod 777 would fix my problem while updating. So chmod -R 777 on the wordpress root dir while updating and then setting the permission back and voila my installation was updated. Perfect!

And now to the news part. Haven’t been very active lately with this blog. My company has had a lot of dealings with budget cuts and organizational changes in this newly formed IT unit. Not the best of starts to have to do within a year of starting. That has put a hold on many new projects. I have been working on consolidating older vSphere setups on new hardware and software which isn’t very interesting work but it has to be done. I will sometime next week hopefully write a post about how to handle CBT errors after migrating from one storage array to another. We had one of the old setups where the 2 machines were failing in odd ways. More on that later.

In other news I finally settled on a better name for the blog and will be implementing it as soon as possible. Be prepared, I think it is quite awesome 🙂

Follow-up: Upgradring SSO 5.5

I wanted to do a follow up on my previous post about Upgrading SSO 5.5 and cover the resolution of the undesirable situation we had ended up in but first I’d like to cover the process leading up to the problematic upgrade.

Preparation phase:

We started preparing shortly after getting wind that vSphere 5.5 was going to be released and that the SSO had been reworked to work better with AD. We submitted 3-4 bugs related to SSO and AD (we run a single forest with multiple subdomains in a parent-child trust, imagine the problems we were having!) and the resolution to most of the were soon upgrade to vSphere 5.5. So we started researching the upgrade. Having a HA setup with a load balancer infront and the goal of doing this to our production environment ment that we were reading a lot of documentation and and KB articles as well as talking to VMware proffesionals.

At VMworld I talked personally with three different VMware employees about the process and was told every time that it was possible to upgrade the SSO and Web Client Server components and leave the Inventory Service and vCenter services for later.

So after having read a lot of this we found some issues in the documentation that made us contact VMware directly for a clarification. First off, reading through the upgrade documentation we stumbled upon this phrase:

DocumentationError1

As we read it we had to enter the load balanced hostname when upgrading the second node in the HA cluster which seems illogical. This was also different from KB2058239 which states the following:

DocumentationError2

So we contacted VMware support and got a clarification. The following response was given by email and later confirmed by phone:

In my scenario I was not able to use Load balancer’s DNS name and gave me a certificate error and the installation, however the installation went through by providing the DNS name of the primary node. This is being validated by my senior engineer and I will contact you tomorrow with any further update from my end.

We were still a bit unsure of the process and decided to revive the old test environment for the, at the time, running vSphere 5.1 configuration.

Testing phase:

It took a while to get it running again as it was being redeployed from an OVA that could not be directly deployed on our test cluster (too new VMX version) so it had to first be converted to a VMX and then through VMware Standalone Converter to downgrade the hardware. This worked fine and the machines booted. However the SSO was not running. So back to figuring that out. As it turns out, when changing the hardware of the machine its machineid changed. So we had to run the following command on both nodes and restart the services as described in this KB:

rsautil manage-secrets -a recover -m <masterPassword>

And presto! it was running again. We then performed the upgrade of the SSO and Web client. But as we had no indication that the vCenter services was going to be hit by this no testing of it was done (we were after all told that this was no problem by several different people at this point). The upgrade went mostly smooth with only to warnings, 1) Warning 25000 – SSL error which should be ignorable and that the OS had pending updates that needed to be installed before proceeding.

But we ran into the AD problem that I described in the previous post as linked at the start. The SSO was contacting a domain controller, presumably on port 3268, to get a GC connection and this was failing as the server it was communicating with was not running that service. We got a solution after a week but it seems that the problem had temporarily solved itself before the fix was given by VMware. So over the phone I agreed with VMware technical support that it was safe to procede on our production environment at if we got the same error again we should make a severity 1 support request as soon as possible.

Launch in 3. 2. 1…..:

Following the above phases we proceeded with the upgrade. We scheduled a maintenance window of about 3 hours (took 1 hour in the test environment). On the day we prepared by snapshotting the SSO and Web Client servers (5 in total). We took backups of the RSA database just in case and the proceeded. Everything seemed to work. The first node was successfully upgraded and then the second. After upgrading we reconfigured the load balancer as described in KB2058838 and it just worked out of the box it seemed. Until we had to update the service end points. The first went well. As did the second. But the third was giving a very wierd error, a name was missing. But everything was there, the file was identical to the others except the changed text to identify the groupcheck service. The we saw it, copypasta had struck again as we had missed a single “[” at the beginning of the file rendering the first block unparsable. Quick fix!

The first Web Client server was then updated to check everything and then we saw it. The vCenters were not connecting to the lookup service so the Web Client Server was not seeing them! Darn it! We hadn’t tested this. So we figured that restarting the vCenter service would solve the problem. It didn’t. In fact it turned out to make it worse. As the service was starting it failed to connect to the SSO and decided to wipe all permissions! That is a really bad default behaviour! And the service didn’t even start! It just shut down again. Digging a bit we found that the vCenter server had hard-coded the URL for the “ims” service in the vpxd.cfg file and as this had changed from “ims” to “sts” this was of course not working. The KB said to update the URL so we did. This helped a bit. The service was not starting but not connecting correctly. We could however now login with admin@System-Domain via the old C# client (Thank god that still works!). It is here we called VMware technical support, but too late. After getting a severity 1 case opened we were informed that we would get a callback within 4 business hours (business hours being 7am to 7pm GMT). At this point it was 5:45PM GMT. So we waited until 7PM but no call unfortunately. So we went home and came back early next morning. at about 7:45AM the callback came, I grabbed it and was then on the phone for 3 hours straight.

What the problem was is that it is not possible to upgrade SSO and Web Client server and leave vCenter and Inventory service. The last two needed to be upgraded aswell and the process of recovering the situation was to upgrade the vCenter servers and their Inventory services. This was a bit problematic for the vCenter that had been “corrupted” by restarting and trying to fix the config files. We ended up uninstalling the vCenter service and reinstalling it on the server and on the other vCenter that had not been touched we just upgraded unproblematically.

Aftermath:

When I hung up the phone with VMware technical support we were back online. Only the last Web Client server needed upgrading, easily done. But the one vCenter had lost all permissions, we are still recuperating from that but as this was just before Christmas holidays only critical permissions were restored.

So after a thing like this one needs to lean back and evaluate how to avoid this in the future so here are my morals from this:

  1. Test, test, test, test! We should have tested the upgrade with the vCenter servers as well to have caught this in the testing phase!
  2. Upgrade the entire management layer. Instead of trying to keep the vCenters on 5.1 we should have just planned on upgrading the entire management layer.
  3. Maintenance in business hours. We set our maintanence window outside normal business hours causing us to have to wait 12 hours for support from VMware.

WordPress management

Hello everyone

Today’s topic is a bit different than usual but to make sure that I remember this small fix I decided to write a small blog post about it. I run a number of WordPress installations for various people and recently I setup a new one on a web server behind a very nasty firewall. This firewall is blocking passive FTP connections so when trying to automatically update plugins and such requiring FTP in WordPress I was hitting a wall that after entering credentials the page would just hang at “loading”. I think a spent about an hour debugging what was going on. Nothing in web server logs, login was successful in FTP server logs. What was going on?

Then it struck me! The FTP server did not allow passive FTP connections to upload, they could connect but not upload. So was WordPress using passive FTP?

Turns out, yes it does. And it is hard-coded. Not very nice WordPress! Following this post: http://wordpress.org/support/topic/define-ftp-active-mode-in-wp-config I found the file:

includes/class-wp-filesystem-ftpext.php

and it indeed containged the line:

@ftp_pasv( $this->link, true );

which hard-codes the use of passive FTP. Now changing it to false solved my problem! It would be nice if this setting could be set via wp-config.php as other FTP options can.

Current problems with vSphere Single Sign On

Hello again

Today I’m going to highlight some of the bugs/problems we have run into with vSphere Single Sign On (SSO), introduced in vSphere 5.1.

Using Active Directory groups from a Multi-Domain Forest with Parent-Child trust:

We are using a Active Directory (AD) forest that consists of a single root domain around 20 sub domains, a single level down. That is we have a root AD domain like example.com and below a single level we have all the domains, mostly on the form department.example.com. All users exist in one of the sub domains, none are placed in the root domain. It contains only service accounts, groups and servers.

The SSO as of version 5.1 U1b only supports External Two-way trusts (See: KB:2037410) so at first you think it won’t work. So you try adding a group from AD and you are able to do so. But trying to authenticate with a user of that group doesn’t work. Adding the user itself instead of the group works. VMware have stated to us that this is a problem with the transitivity of the trust in Parent-Child trusts which does not exist in the External Two-way trust. I suspect that this is also caused by another “bug” that we have noticed that I will explain below.

Workaround: use local SSO groups with AD users. This works just fine.

Users with same sAMAccountName in different domains in same forest cannot be used:

We discovered last week that we were unable to add users to permissions or SSO groups if the SSO already new a user with the same sAMAccountName but in a different domain. VMware confirmed this as a bug which will be fixed in an upcomming release. The bug only occurs in child domains in an AD forest.

Users are presented in the SSO with wrong domain when added:

This is small bug  but when adding a user (or group) the first time the user is added with username@example.com rather than the correct username@department.example.com which is shown when adding and despite selecting the correct domain in the “Add user” dialog. The result is that the first time the user logs in nothing is shown despite the user having permissions. Have the user log out and then remove and add the user solves the problem, as if the SSO figures out that the user is in the sub domain on first login. I suspect that this has something to do with the first problem as well.

Cannot remove user from SSO group:

Despite KB:2037102 stating that the bug was solved in vSphere 5.1 U1a we are experiencing it on this version aswell. We have not reported this yet but am doing so tomorrow morning. For some users no errors are shown but the user is not removed, for others an error is shown that the principal could not be removed. In the imsTrace log a Java exception shows that the principal cannot be removed because it does not exist on the group, even though the web client shows the user.

 

This post may be updated in the future:)