Today I found a couple of VMs in a cluster, that had HA with VM monitoring enabled, that were showing a “vSphere HA virtual machine monitoring error” with a couple of different dates.
Looking into the event log of the VM via vCenter I could see the following events about every 20 seconds:
This indicated that HA VM monitoring wanted to reset the VM but failed. I tried searching for answers on first Google but with no luck. I remembered that the setup had vRealize Log Insight installed and collecting data so perhaps Log Insight had more logs to look at.
Made a simple filter on the VM name and found repeating logs starting with this error from “fdm” which is the HA component on the ESXi host.
This error looks bad to me. Not being able to find the MoRef of a running VM? Hmm. I asked out on Twitter if anyone had seen this before with out much luck. Think about it over lunch I figured that maybe it was the host that was running the VM that had gotten into some silly state about the VM and not knowing it’s MoRef. So the quick fix I tried was to simply do a compute and storage migration of the VM to a different host and a different datastore to clean up any stale references to files or the VM world running. The event log of the VM immediately stopped spamming the HA message and returned to normal after migration.
I cannot say for sure what caused the issue and what the root cause is but that at least solved it.