Veeam NFC stream error and missing symlinks

Today my colleague, who handles our Veeam installation was diagnosing an error we were sporadically seeing. The error was this (removed names):

Error: Client error: NFC storage connection is unavailable. Storage: [stg:datastore-xxxxx,nfchost:host-xxxx,conn:vcenter.server.fqdn]. Storage display name: [DatastoreName].

Failed to create NFC download stream. NFC path: [nfc://conn:vcenter.server.fqdn,nfchost:host-xxx,stg:datastore-xxxxx@VMNAME/VMNAME.vmx].

Now this error indicates that it failed to get a connection to the host via NFC stream (port 902). Or so I thought. We have seen sporadic problems for vCenter heartbeats over the same port so that was what we expected. Turns out that some of the hosts in the cluster were missing the “datastore symlink” in /vmfs/volumes.

When running “ls -1a /vmfs/volumes” the result was not the same on each host. 4 of 8 hosts were missing a symlink and two others had a wrongly named symlink. I recalled that when I was creating the datastores I used PowerCLI to change the names of the datastores several times in rapidily after each others as my script has slight errors when constructing the correct datastore names. It seems that this left some of the datastores on some hosts either with no symlink or a wrongly named symlink.

Fortunately the fix is easy:

  1. Enter Maintenance Mode
  2. Reboot host
  3. ?????
  4. Profit

That is it! 🙂

ESXi disconnecting right after connecting

This morning was just one of those mornings! Got up, checked my phone, 22 alerts from vCops. Damnit 🙁

So got into work and could see that it was centralized around two server rooms located close to each other utilizing the same uplinks to our core network. I suspected a network error at first. Talk to a guy in networking, “Oh, didn’t you know? There was a planned power outage in that area”. Oh that explains it. Debugging further showed that only networks were affected by power interruption. Servers and storage continued to run.

So I suspected that the reason the hosts were still not responding was that they had been with out network for 4-6 hours. Chose to reconnect which worked.. at first. Immediatly after connecting the hosts disconnected again. This happened with all of them. Strange.

I remembered then an KB article I saw a while back: ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting

Aahh so port 902 might be blocked. Checked – Nope. Open both on TCP and UDP. Hmm. Aahh perhaps I need to restart the management agents but SSH was disabled. So connect the old C# client to one of the hosts directly. Enable SSH. Still no, network was filtering port 22. So no avail. Beginning to panic a bit PowerCLI came to mind. Perhaps there is a way to restart the management agents from PowerCLI.

There is! Here. But not all of them though as far as I could tell. So I tried restarting the vpxa service. Which luckily worked.

So a lot of clean up of the configurations are in order now. Personal todo: 1) allow SSH from management networks to the hosts. 2) Fix/Get access to iLO/Drac/IMPI of the hosts. 3) Get answers why I was not informed about the power work being done. And a bonus too. Need to figure out why all of the 8 hosts spread on 3 clusters have access to a 20 MB LUN that no one knows about and that while the vpxa services on 5 hosts was not working two hosts complained that they had lost access to that specific LUN.

Work work work.