Veeam NFC stream error and missing symlinks

Today my colleague, who handles our Veeam installation was diagnosing an error we were sporadically seeing. The error was this (removed names):

Error: Client error: NFC storage connection is unavailable. Storage: [stg:datastore-xxxxx,nfchost:host-xxxx,conn:vcenter.server.fqdn]. Storage display name: [DatastoreName].

Failed to create NFC download stream. NFC path: [nfc://conn:vcenter.server.fqdn,nfchost:host-xxx,stg:datastore-xxxxx@VMNAME/VMNAME.vmx].

Now this error indicates that it failed to get a connection to the host via NFC stream (port 902). Or so I thought. We have seen sporadic problems for vCenter heartbeats over the same port so that was what we expected. Turns out that some of the hosts in the cluster were missing the “datastore symlink” in /vmfs/volumes.

When running “ls -1a /vmfs/volumes” the result was not the same on each host. 4 of 8 hosts were missing a symlink and two others had a wrongly named symlink. I recalled that when I was creating the datastores I used PowerCLI to change the names of the datastores several times in rapidily after each others as my script has slight errors when constructing the correct datastore names. It seems that this left some of the datastores on some hosts either with no symlink or a wrongly named symlink.

Fortunately the fix is easy:

  1. Enter Maintenance Mode
  2. Reboot host
  3. ?????
  4. Profit

That is it! 🙂

ESXi disconnecting right after connecting

This morning was just one of those mornings! Got up, checked my phone, 22 alerts from vCops. Damnit 🙁

So got into work and could see that it was centralized around two server rooms located close to each other utilizing the same uplinks to our core network. I suspected a network error at first. Talk to a guy in networking, “Oh, didn’t you know? There was a planned power outage in that area”. Oh that explains it. Debugging further showed that only networks were affected by power interruption. Servers and storage continued to run.

So I suspected that the reason the hosts were still not responding was that they had been with out network for 4-6 hours. Chose to reconnect which worked.. at first. Immediatly after connecting the hosts disconnected again. This happened with all of them. Strange.

I remembered then an KB article I saw a while back: ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting

Aahh so port 902 might be blocked. Checked – Nope. Open both on TCP and UDP. Hmm. Aahh perhaps I need to restart the management agents but SSH was disabled. So connect the old C# client to one of the hosts directly. Enable SSH. Still no, network was filtering port 22. So no avail. Beginning to panic a bit PowerCLI came to mind. Perhaps there is a way to restart the management agents from PowerCLI.

There is! Here. But not all of them though as far as I could tell. So I tried restarting the vpxa service. Which luckily worked.

So a lot of clean up of the configurations are in order now. Personal todo: 1) allow SSH from management networks to the hosts. 2) Fix/Get access to iLO/Drac/IMPI of the hosts. 3) Get answers why I was not informed about the power work being done. And a bonus too. Need to figure out why all of the 8 hosts spread on 3 clusters have access to a 20 MB LUN that no one knows about and that while the vpxa services on 5 hosts was not working two hosts complained that they had lost access to that specific LUN.

Work work work.

Update on Updating WordPress and News

A while back I wrote a short article about how to update WordPress when your FTP server did not accept passive FTP connections. Since then I have not been trying to update anything until now. And the fix didn’t work. Or rather it did but there was still a permission problem on the files. Even chmod 777 on the files didn’t fix it for some reason.

So I started looking for a solution and found this instead:

http://www.hongkiat.com/blog/update-wordpress-without-ftp/

It describes how to switch method to direct instead of FTP. This fix worked better from the beginning and gave me more direct error messages. So now I knew that a chmod 777 would fix my problem while updating. So chmod -R 777 on the wordpress root dir while updating and then setting the permission back and voila my installation was updated. Perfect!

And now to the news part. Haven’t been very active lately with this blog. My company has had a lot of dealings with budget cuts and organizational changes in this newly formed IT unit. Not the best of starts to have to do within a year of starting. That has put a hold on many new projects. I have been working on consolidating older vSphere setups on new hardware and software which isn’t very interesting work but it has to be done. I will sometime next week hopefully write a post about how to handle CBT errors after migrating from one storage array to another. We had one of the old setups where the 2 machines were failing in odd ways. More on that later.

In other news I finally settled on a better name for the blog and will be implementing it as soon as possible. Be prepared, I think it is quite awesome 🙂

Follow-up: Upgradring SSO 5.5

I wanted to do a follow up on my previous post about Upgrading SSO 5.5 and cover the resolution of the undesirable situation we had ended up in but first I’d like to cover the process leading up to the problematic upgrade.

Preparation phase:

We started preparing shortly after getting wind that vSphere 5.5 was going to be released and that the SSO had been reworked to work better with AD. We submitted 3-4 bugs related to SSO and AD (we run a single forest with multiple subdomains in a parent-child trust, imagine the problems we were having!) and the resolution to most of the were soon upgrade to vSphere 5.5. So we started researching the upgrade. Having a HA setup with a load balancer infront and the goal of doing this to our production environment ment that we were reading a lot of documentation and and KB articles as well as talking to VMware proffesionals.

At VMworld I talked personally with three different VMware employees about the process and was told every time that it was possible to upgrade the SSO and Web Client Server components and leave the Inventory Service and vCenter services for later.

So after having read a lot of this we found some issues in the documentation that made us contact VMware directly for a clarification. First off, reading through the upgrade documentation we stumbled upon this phrase:

DocumentationError1

As we read it we had to enter the load balanced hostname when upgrading the second node in the HA cluster which seems illogical. This was also different from KB2058239 which states the following:

DocumentationError2

So we contacted VMware support and got a clarification. The following response was given by email and later confirmed by phone:

In my scenario I was not able to use Load balancer’s DNS name and gave me a certificate error and the installation, however the installation went through by providing the DNS name of the primary node. This is being validated by my senior engineer and I will contact you tomorrow with any further update from my end.

We were still a bit unsure of the process and decided to revive the old test environment for the, at the time, running vSphere 5.1 configuration.

Testing phase:

It took a while to get it running again as it was being redeployed from an OVA that could not be directly deployed on our test cluster (too new VMX version) so it had to first be converted to a VMX and then through VMware Standalone Converter to downgrade the hardware. This worked fine and the machines booted. However the SSO was not running. So back to figuring that out. As it turns out, when changing the hardware of the machine its machineid changed. So we had to run the following command on both nodes and restart the services as described in this KB:

rsautil manage-secrets -a recover -m <masterPassword>

And presto! it was running again. We then performed the upgrade of the SSO and Web client. But as we had no indication that the vCenter services was going to be hit by this no testing of it was done (we were after all told that this was no problem by several different people at this point). The upgrade went mostly smooth with only to warnings, 1) Warning 25000 – SSL error which should be ignorable and that the OS had pending updates that needed to be installed before proceeding.

But we ran into the AD problem that I described in the previous post as linked at the start. The SSO was contacting a domain controller, presumably on port 3268, to get a GC connection and this was failing as the server it was communicating with was not running that service. We got a solution after a week but it seems that the problem had temporarily solved itself before the fix was given by VMware. So over the phone I agreed with VMware technical support that it was safe to procede on our production environment at if we got the same error again we should make a severity 1 support request as soon as possible.

Launch in 3. 2. 1…..:

Following the above phases we proceeded with the upgrade. We scheduled a maintenance window of about 3 hours (took 1 hour in the test environment). On the day we prepared by snapshotting the SSO and Web Client servers (5 in total). We took backups of the RSA database just in case and the proceeded. Everything seemed to work. The first node was successfully upgraded and then the second. After upgrading we reconfigured the load balancer as described in KB2058838 and it just worked out of the box it seemed. Until we had to update the service end points. The first went well. As did the second. But the third was giving a very wierd error, a name was missing. But everything was there, the file was identical to the others except the changed text to identify the groupcheck service. The we saw it, copypasta had struck again as we had missed a single “[” at the beginning of the file rendering the first block unparsable. Quick fix!

The first Web Client server was then updated to check everything and then we saw it. The vCenters were not connecting to the lookup service so the Web Client Server was not seeing them! Darn it! We hadn’t tested this. So we figured that restarting the vCenter service would solve the problem. It didn’t. In fact it turned out to make it worse. As the service was starting it failed to connect to the SSO and decided to wipe all permissions! That is a really bad default behaviour! And the service didn’t even start! It just shut down again. Digging a bit we found that the vCenter server had hard-coded the URL for the “ims” service in the vpxd.cfg file and as this had changed from “ims” to “sts” this was of course not working. The KB said to update the URL so we did. This helped a bit. The service was not starting but not connecting correctly. We could however now login with admin@System-Domain via the old C# client (Thank god that still works!). It is here we called VMware technical support, but too late. After getting a severity 1 case opened we were informed that we would get a callback within 4 business hours (business hours being 7am to 7pm GMT). At this point it was 5:45PM GMT. So we waited until 7PM but no call unfortunately. So we went home and came back early next morning. at about 7:45AM the callback came, I grabbed it and was then on the phone for 3 hours straight.

What the problem was is that it is not possible to upgrade SSO and Web Client server and leave vCenter and Inventory service. The last two needed to be upgraded aswell and the process of recovering the situation was to upgrade the vCenter servers and their Inventory services. This was a bit problematic for the vCenter that had been “corrupted” by restarting and trying to fix the config files. We ended up uninstalling the vCenter service and reinstalling it on the server and on the other vCenter that had not been touched we just upgraded unproblematically.

Aftermath:

When I hung up the phone with VMware technical support we were back online. Only the last Web Client server needed upgrading, easily done. But the one vCenter had lost all permissions, we are still recuperating from that but as this was just before Christmas holidays only critical permissions were restored.

So after a thing like this one needs to lean back and evaluate how to avoid this in the future so here are my morals from this:

  1. Test, test, test, test! We should have tested the upgrade with the vCenter servers as well to have caught this in the testing phase!
  2. Upgrade the entire management layer. Instead of trying to keep the vCenters on 5.1 we should have just planned on upgrading the entire management layer.
  3. Maintenance in business hours. We set our maintanence window outside normal business hours causing us to have to wait 12 hours for support from VMware.

Upgrading SSO 5.5

*EDIT* After 3 hours on the phone with VMware Technical Support we are finally running again. Permissions on one vCenter were wiped and have to be recreated and a bug with Win2k12 domain controllers has hit is meaning that we had to create each domain as an identity source. But we are running again! *EDIT*

 

This is going to be a short post about my current experiences with upgrading the SSO component of vCenter. Short because it is still not working and I am waiting for VMware support to contact me.

So after VMworld I was pushing to upgrade to vSphere 5.5, atleast web client and SSO as this would solve a lot of our AD problems. So planning began and in November I revived my old test environment for the vSphere 5.1 upgrade. I dusted it off, got it running (more on that in a later post) and started the upgrade. The upgrade went fine – as such. After coming online again I could not enumerate users in SSO groups. I could search the AD via the new Integrated Windows Authentication Identity Source but not enumerate members of SSO groups. The web client would be “loading” for a LOOONG time and the following could be seen over and over again in the vmware-sts-idmd.log:

07:07:04,885 WARN   [LdapErrorChecker] Error received by LDAP client: com.vmware.identity.interop.ldap.WinLdapClientLibrary, error code: 81
07:07:04,885 WARN   [ServerUtils] cannot bind connection: [ldap://ADSERVERNAME, null]
07:07:04,885 ERROR  [ServerUtils] cannot establish connection with uri: [ldap://ADSERVERNAME]
07:07:04,885 ERROR  [ActiveDirectoryProvider] Failed to get GC connection to domain aau.dkLdap_sasl_bind failedServer Down

I purged dates and servernames from the log. But in essence it was looking like the SSO was attempting to make a Global Catalog (GC) connection to an AD server but it was selecting an AD server which was not running the GC service and when attempting to connect it decides that the server is down but keeps trying to contact it. Our AD forest consists of 1 root domain and about 20 subdomains and in total about 55 domain controllers where a bit over half run the GC service. There is no setting to tell SSO which server to talk to in the AD forest. So no change.

I got the following from VMware technical support a few days later:

Procedure

1 Log in to the vSphere Web Client as administrator@vsphere.local or as another user with vCenter Single Sign-On administrator privileges.

2 Browse to Administration > Single Sign-On > Configuration.

3 On the Identity Sources tab, select an identity source and click the Set as Default Domain icon.

In the domain display, the default domain shows (default) in the Domain column.

Please restart the services and login to the web client/ vi client.

When I tried that it seemed to work. I reverted and it still worked leading me to think that it was not the above that fixed it but simply time as it would at some point select another server. I was assured over the phone by VMware technical support that customers that had seen the above error had fixed it with the above.

But now, to tell you the truth, the above log snippet is from today. I went ahead and upgrade SSO and Web client yesterday and got stuck with this error after spending about 2 hours reconnecting/repointing a vCenter 5.1 to the new lookup service. It requires you to change the vpxd.cfg file as it has hard coded the path to the STS service in that file and it points to the old /ims and not the new /sts url. It also caused a total wipe of permissions on the one vCenter I repointed causing some incosistent behavior in the web client. Looking under vCenter I cannot see it but if I browse down through Administration -> licensing and find it there I can see all objects with the admin@System-Domain account. The only account left with permissions on the vCenter.

I’m back at the office early this morning, slept like crap due to this. This is going to be a long day!

Since the last time

So it’s been a while since my last post, a lot has been going on. I have been through my first “employee performance interview” (Medarbejderudviklingssamtale or MUS in Danish). It was good and a lot of things were discussed in regard to this new organization. Some steps to increase my skills were also planned and I will get back to that later.

Since last time I attended VMWorld Europe 2013 in Barcelona! It was an awesome conference as always and I got a lot of new things with me home. One of the things I did different this year as compared to the previous two years was spend a lot more time on the Solutions Exchange. I focused primarily on storage vendors as I have taken fancy in new flash accelerated or all-flash storage systems. So I think I visited every booth with just the slightest connection to storage.

I also had the chance to discuss some of the new technologies coming out of VMware and also discuss the upgrade procedure for vSphere 5.5 when running with an SSO behind a load balancer. That was really useful and insightful and provided me with most of the information I need to perform an upgrade of the SSO and Web Client in our environment to vSphere 5.5 to relieve all the AD problems we have had. I will post a blog article on this later as there are still some hick-ups in the documentation and procedure that I need to test out and receive confirmation on from VMware support.

Our consolidation process has not been moving that much. Shortly after returning from Barcelona I took part in a live migration of VMs between our data center and a remote server room across a distance of about a kilometer. Without going into details about how everything was connected suffice to say that we had a single 10Gbit Ethernet connection between our one of our data center routers and one of the server room’s routers. We also had a single FC connection between a storage array in the data center and the blade chassis in the server room. This allowed us to evacuate a single blade in the server room and move it to an identical blade chassis after this we used another blade in the server room as the “transport host”. We vMotion VMs onto it as it could see both data center and server room storage arrays. The Storage vMotioned the VMs to the data center array and finally vMotion onto the host in the data center. Then one by one we evacuated all blades in the server room and moved them to the data center. The process took about 2 days including the move of a few physical hosts as well and was all in all very successful.

We had a single error during the move which caused an unexplained HA restart. The largest of the VMs (1TB storage spread on 4 different VMDKs) was set to change format to thin provisioned during the Storage vMotion. But at some point during migration we got an unexpected error (This was the actual message from the vSphere client). 30 seconds later HA spontaneously rebooted the VM even though Virtual Machine monitoring was disabled and the host didn’t crash. Luckily the VM handled the reboot well and it occurred close to midnight with no users online.

Right now my colleagues are planning the consolidation of two other VMware installation which will most likely be done with cold migrations. The amount of VMs is small and the fiber connections and licenses of these installations will not allow us to do a live migration. They are also planning a move similar to the one I worked on which we hope to complete some time in December. I am working on a cold migration of VMware installation as well where most of the VMs will be reinstalled on a new cluster rather than migrating them.

That was a status on what we are working on. Also back to the “I will get back to this later”. During the next month I will be working on a test installation of vCloud Automation Center to experiment with it and research if this is something we can use in our organization. The initial tests will be confined to the infrastructure department but if it works out it might be scaled up.

WordPress management

Hello everyone

Today’s topic is a bit different than usual but to make sure that I remember this small fix I decided to write a small blog post about it. I run a number of WordPress installations for various people and recently I setup a new one on a web server behind a very nasty firewall. This firewall is blocking passive FTP connections so when trying to automatically update plugins and such requiring FTP in WordPress I was hitting a wall that after entering credentials the page would just hang at “loading”. I think a spent about an hour debugging what was going on. Nothing in web server logs, login was successful in FTP server logs. What was going on?

Then it struck me! The FTP server did not allow passive FTP connections to upload, they could connect but not upload. So was WordPress using passive FTP?

Turns out, yes it does. And it is hard-coded. Not very nice WordPress! Following this post: http://wordpress.org/support/topic/define-ftp-active-mode-in-wp-config I found the file:

includes/class-wp-filesystem-ftpext.php

and it indeed containged the line:

@ftp_pasv( $this->link, true );

which hard-codes the use of passive FTP. Now changing it to false solved my problem! It would be nice if this setting could be set via wp-config.php as other FTP options can.

The Missing Pop Up

During the work with consolidating several VMware installations into a single platform I have to manage permissions. Most of the installations were maintained by a single person or a smaller group of people thus permissions were not that complex, however they were all mostly the same; assign Administrator role to all people given at the vCenter level.

When moving to the new platform this is not really possible. I, as part of the virtualization group, of course have administrative privileges on the two vCenter servers and everything below. But the old administrators still need their permissions due them continuing to maintain legacy systems. We decided to grant this access on a cluster level. This also required us to create a folder in the VMs and Templates view, the Datastores and Datastore Clusters and the Networking view for each old installation. On the clusters and these folders we granted Administrator rights.

Now comes the funny part of this. We had not noticed the problem of a missing popup until a persistent administrator was testing his new rights out. We had the cluster he was on set on DRS Manual mode. This of course requires you to choose which server to start a VM on when powering it on. But he wasn’t getting the popup in the web client. I, at first, thought it was a web client problem but it turned out it was something else.

He would choose the action as below, Power On:

01

In my C# client I could see the following happening:

02

But the VM wasn’t powering on:

03

I tried powering it on manually and saw the popup in my client and wondered what was wrong since he didn’t get it. That is when I realized what might be the problem and tested it out. The problem being that powering on a VM and selecting which host to run on when in DRS Manual mode is a Datacenter privilege and as the user only had permissions on folders and clusters below the Datacenter he was not seeing the popup. If I added his permissions on the Datacenter (and not propagating) he would see the popup as below. This only happened when in DRS Manual mode:

04

Consolidation continued

A while back I wrote an post about the process of consolidation where I was describing the primary aspects of the process for us and the solutions we chose. We are a little ways down the road an a lot of consolidation has already happened.

We have shutdown (or emptied and left running) 5 vCenter servers and consolidated 8 clusters and 7 single hosts in our new vCenter setup. The process has been pain free for the most part. In all of these migrations (6 different maintenance windows) we have only encountered a single problem which I will describe a little later. This has given us quite the track record for performing well under maintenance windows in our little virtualization team 🙂

So the problem we encountered was not in the actual process of disconnecting from one vCenter and connecting to another but instead in the preparing phase. What happened was that while we were migrating from VDS switches to VSS switches we needed to change port groups for a lot of VMs; PowerCLI to the rescue. This was automated using a translation table in the form of a hash table that would have the old VDS port group names as keys and the new VSS port group names as values. Foreach VM and foreach of the network adapters on the VM look at the port group name and find the new value in the hash table. Simple!

But. There is always a but. After change some 60-70 port groups, the port group of a specific VM was changed and it was working fine it would seem. About 15 minutes later I got the support call that a website was down on one of the VMs. I started looking and could not see anything wrong (I’m not that fond of IIS web servers and their way of logging!). The network of the VM had not been changed yet so what was causing this 503 Service Unavailable error? And even more odd it was only one out fo 17-20 websites that was not working

I googled some things no luck, grabbed a colleague to help look, no answer. The Google paid off. The mention of the words “Application pool”. I quickly looked up the application pool of the problem web site and sure enough it was in a stopped state. Started it again and the site was running again. But why had it stopped? The logs show nothing™. So I got to thinking and the only explanation that made sense to me was the fact that I had 15 minutes before the call changed the port group of the aforementioned VM which was running the database for the website. A small MSSQL machine. The only logical thing I can see having happened is that in changing the port group the connection to the database was cut and the application running in the pool handled it badly.

Now looking back at it this is a minor problem. One web site between 20 running on one VM between hundreds. All in all the consolidation process is running well. We have a few more vCenters and single hosts to gather under the new setup still. And after that comes the process of phasing out old hardware. We have recently got 6 new blade servers to replace some older hardware, some of the most powerfull we have had yet. 2×8 core Intel CPUs with 256 GB RAM and if all goes well a new storage controller dedicated for these machines and virtualization in general. But moving to new hardware, shutting down old servers, giving new IP’s to production machines, all of this is not an easy process and requires coordination across the entire IT organization.