## System logging not Configured on host

A few weeks ago I noticed a warning on some of our hosts in our HyperFlex clusters and wondered what was going on. It was only hitting Compute Only nodes in the clusters.

The warning is indicating that the Syslog.global.logDir is not set as per KB2006834. But when I looked via ssh on the host it was logging data and the config option was set so it was working – so why the warning?

Well it turns out to be something not that complicated to fix. The admin who set up the nodes set the option to:

[] /vmfs/volumes/<UUID>/logs/hostname

That is giving it an absolute path on the host like you would do with the ScratchConfig.ConfiguredScratchLocation option. This works but triggers the warning as if it was not set.

The fix is simple. Simply change it to use the DatastoreName notation as this:

[DatastoreName] logs/hostname

This immediately removed the warning and everything continued as it had before.

## Migrating VMs to an older ESXi version

Hi all, just a short post about a small task I had on my desk last week. Customer needed to migrate a 3 servers from current provider to one of our older platforms. Few issues to overcome. First off only had VPN access to the provider and access to the ESXi 6.0 web UI that was running the VMs. So how to export them without downtime? No way to do something like a Veeam replication as I only had the VPN connection. Had to do an export-import scenario. Clone them? Requires vCenter. Hmm.

So what did I do? Well I asked nicely and was allowed to deploy a temporary VCSA onto the host and add the host itself to the vCenter. This allowed me to clone the 3 VMs (after removing a metric f***ton of snapshots). Then I could export the cloned VMs as OVFs to a machine in our network. It was lucky that I could do this and did not need to do the entire operation in a service window. The last copy of the machines were not needed as it was more of a “configuration copy” than anything else. So while the customers systems were running we could move the cloned VMs.

Now came the tricky part that I did not foresee but quickly identified! I exported from a vCSA 6.7 U1 and an ESXi 6.0 host. This makes a SHA256 signed OVF. Trying to import this to a 5.5 vCenter fails as the 5.5 vCenter does not support SHA256. OVFtool has a nice feature where you can convert the OVF from SHA256 to SHA1 by making a new package with the following command:

ovftool --shaAlgorithm=SHA1 .\path\to\file.ovf .\path\to\destionation.ova

Simple! Converts the OVF to an OVA with SHA1 instead of SHA256. Import now works. Wait not it did not! The machines are VMX-11 which does not run on ESXi 5.5. What to do. Recommended approaches are to use VMware converter to convert the VM and downgrade the hardware version. I took a slightly more simple but probably also more unsported route.

VMX version is defined in the OVF som it was simple to open the OVF file and locate the “SystemType” parameter and change it from “vmx-11” to “vmx-10”. This however breaks the manifest files SHA256 hash of the OVF file. This is simple to fix aswell using “certutil” on Windows. Following command generates a SHA256 has for a file:

certutil -hashfile .\path\to\file.ovf sha256

Simple replace the SHA256 thumbprint in the manifest file with the one generated above. Rerun the SHA1 conversion above and import now works. My colleague who needed the VMs converted reported back later that day and confirmed machines were booting fine and running as they should so he could continue reconfiguration of the machines.

## vExpert 2018

Hi All

Just a quick update today to announce that I was able to get accepted as a vExpert 2018 in the second half as announced by VMware here: https://blogs.vmware.com/vexpert/2018/08/03/vexpert-2018-second-half-award-announcement/

After a long break in 2017 when I was at home with my twins I did not manage to get out much content to the community but after coming back in full force it is nice to be allowed back in to the program for my 4th consecutive year! I am very humbled to be let into this group of highly talented people who share so much great information with the community and the world.

As I mentioned in previous posts my role switched a bit when I changed jobs so I have a lot more areas to cover and less focus on VMware. I try and get as much time doing stuff with vSphere and hopefully vRO/vRA again soon but for now my focus is mainly developing our cloud platform along with my colleagues.

That is it for now – back into the machine room!

## vSphere HA virtual machine monitoring error on VM

Hi all

Today I found a couple of VMs in a cluster, that had HA with VM monitoring enabled, that were showing a “vSphere HA virtual machine monitoring error” with a couple of different dates.

Looking into the event log of the VM via vCenter I could see the following events about every 20 seconds:

This indicated that HA VM monitoring wanted to reset the VM but failed. I tried searching for answers on first Google but with no luck. I remembered that the setup had vRealize Log Insight installed and collecting data so perhaps Log Insight had more logs to look at.

Made a simple filter on the VM name and found repeating logs starting with this error from “fdm” which is the HA component on the ESXi host.

This error looks bad to me. Not being able to find the MoRef of a running VM? Hmm. I asked out on Twitter if anyone had seen this before with out much luck. Think about it over lunch I figured that maybe it was the host that was running the VM that had gotten into some silly state about the VM and not knowing it’s MoRef. So the quick fix I tried was to simply do a compute and storage migration of the VM to a different host and a different datastore to clean up any stale references to files or the VM world running. The event log of the VM immediately stopped spamming the HA message and returned to normal after migration.

I cannot say for sure what caused the issue and what the root cause is but that at least solved it.

## Simple network debugging – ESXi and VCSA

I ran into an old issue yesterday with ESXi 5.5 and vCenter disconnecting after 60 seconds. The issues is described in KB1029919 and is due to heartbeat UDP packets from ESXi not reaching vCenter on port 902 within 60 seconds or at all.

Now how to test this! Luckily the KB has a simple Python script that allows you to send UDP packets. Just edit the script and insert the IP of your vCenter server and run it. It will default send 10 packets of 100 bytes.

Now the KB mentions using Wireshark as an option on the vCenter side but when using VCSA that is not really an option. Luckily VCSA 6.5 comes with tcpdump installed! 5.5 and 6.0 don’t but worry not, VMware to the rescue. You can install tcpdump on 5.5 and 6.0 by following KB2084896. This gives you access to a client on ESXi to send UDP packets and a monitoring option on VCSA to see if the packets arrive.

All that needs to be done then is run the following:

#on the VCSA
tcpdump -i eth0 udp port 902 -vv | grep <srcIP>

#on ESXi
python udp_client.py

And then look at the results. You can lose the “| grep <srcIP>” if you want to see all packets but depending on your setup there may be many ESXi hosts sending heartbeat.

Today I had to check up on some license keys for a customer. There was not a complete state of keys, enterprise accounts and support contracts so I started looking into how to collect data.

This may be common knowledge to many but VMware has this nifty tool on my.vmware.com:

https://www.vmware.com/support/serialNumberTrack.portal

Here you plop in your key and it returns support contract, type and which EA number it is connected to.

To get all the keys of a vCenter the following piece of PowerCLI can be used:

$licman = Get-View (Get-View ServiceInstance).Content.LicenseManager$licman.Licenses | Select LicenseKey, EditionKey

This returns the list of keys on the vCenter easy to copy paste into the tracking tool. Nifty!

## IO Issues on Cisco C220 M4

Hi All

As promised more content is finally available! Unfortunately I cannot share any screenshots from this actual issue I worked on – you will have to take my word for it.

At my new job I was tasked with solving some issues that had been observed on a remote site on two Cisco C220 M4’s with local disks. The hosts run a set of redundant services but nothing is shared between them except network equipment. The issues were 1) some times around 04.00 AM the software running in the virtual machines would throw an alarm briefly and 2) powering on/off or snapshotting a VM would cause the software to throw alarms as well but on other VMs than the one being powered on/off or snapshotted. The event log on both hosts showed intermittent I/O latency warnings not connected to any of the above issues but nothing alarming.

Now one note though. The software running on these VMs is very latency sensitive so something like snapshots could potentially be a problem in any case but powering on VM A should not affect VM B unless the host is hurting for resources which is not the case.

Before diving in I asked out in the vExpert Slack if anyone had seen issues like this before or had any ideas of what to look for. James Kilby and Bilal Ahmed were quick to throw some ideas on the table. James suggested that vswp file being created on power on might cause the problem and Bilal suggested looking at the network. With those things in mind I started debugging.

First off I had already decided to update the ESXi version to latest 5.5 U3 + patches – it was a little out dated. Also decided to firmware upgrade the servers with the latest 3.0.3c release for Standalone C220 M4s. I had also found the latest supported drivers for the Cisco RAID controller (lsi-mr3) and the Intel NIC (igb) to rule out any compatibility issues. Also it was my hope that an update would remove some of the I/O latency warnings.

Now before upgrading anything I tested to see if the problem was still there. It was. Powering on a test VM with no OS and just 1 core and 4 GB RAM – instant alarms kicking off in the application. However powering off did not cause any noticeable problems. I proceeded to firmware update as my hope was this would solve the issues. Firmware upgrading on a remote site through a small connection is painful! It took a while. But once the first host was updated I proceeded to test if the issues were still there. There were. Damnit. Time to dig deeper. I tried out James’ idea of vswp being the problem and setting a 100% reservation seemed to solve the problem. However this was not a viable solution as this would only solve the problem if the powering on VM has a reservation. If anyone powered on a VM without it, it would still affect all other VMs, regardless of reservation on those.

I booted up our favorite debugging tool ESXTOP and vent into HBA mode and set delay down to 2 seconds. I then observed the Cisco RAID controllers behavior during power on operations and that freaked me out. It would happily do anything between 100 and 1000 IOPS at 5 to 150 ms while not powering on. The latency would spike high but nothing I was that scared of on a small set of local 10k disks. However when powering on a VM without reservation the HBA would stop doing any operations for upwards of 4 refreshes in ESXTOP (at 2 second intervals!). All indicators showing 0. No IO was passed so no latency was observed. This scared me a bit. Latest firmware and supported drivers. Damn. We weren’t seeing the same issue on another site with Dell servers but they also had SSDs instead of 10k disks. Was this the 10k disks not performing enough?

We had a short talk internally about what to do. My boss suggested that we looked at this Cisco bug:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCut37134/?referring_site=bugquickviewredir

This bug references an issue for some C220 M4’s that were installed with a specific version of the 5.5 ESXi Cisco Custom ISO. It was not the ISO these hosts were installed with but the solution was to use a different driver than the one ESXi default selects for the Cisco RAID controller HBA – the lsi-mr3 driver. Instead it instructs to make sure that the megaraid-sas driver is installed and to remove the lsi-mr3 and lsi-msgpt3 drivers and reboot which will make the megaraid-sas the active driver. We decided to try this. Downloaded the latest support megaraid-sas driver for the server and remove the lsi-mr3 and lsimsgpt3 drivers. Reboot and wait.

After getting online again with one host we tested. Powered on the test VM – and nothing. No alarms. What a difference. Tried it again looking at ESXTOP. No drops in IO. It was now doing 8.000 IOPS @ 15ms no problem. Major difference. Mass powering on VMs 15 at a time had taken minutes before the actual power on task was done and the machines would start booting. It took seconds now.

So what is the moral here? Apparently it could benefit you as a Cisco UCS server user to use the megaraid-sas driver instead of the lsi-mr3 driver. Both are included on the Cisco Custom ISO but it defaults to the lsi-mr3 so you actively have to do something to change that.

## 2017 Update Post

Wauw. Almost a year has past since my last update – lots have happened. Not much time for write content.

So why the content drought? Well the primary reason is that on April 21st of this year, my wife gave birth to our two twin boys 5 weeks early and 3 weeks ahead of expected due date. With twins the birth is usually started 2 weeks (week 38) before normal because of slight increases in risk. However 3 weeeks ahead of that time the water broke and then you sorta have to go with the flow from there 🙂

That meant that I went on paternity leave for a looong time. The first 5 weeks we were on a neonatal intensive care unit at two different hospitals as the boys were not large enough to go home. During that time I was allowed to stay with my family and not have to work.

After the 5 weeks we came home and my real paternity leave started. I was luck enough to be able to take almost 5 months leave with pay by combining it with some holidays.

So from end April until October 1st I was off from work.. And had my hands full with the kids 🙂 Twins is a lot of work if anyone ever asks.

Just before returning to my job at the university October 1st I was offered a new job which i ended up accepting. I had one months notice at the university so went to work there handing over my assignments to other people and then on November 1st I started my new job as an IT System Consultant @ Lytzen IT.

Now getting a new job and having to get grips on everything also puts a dampener on things, content wise. I have been busy doing lots of exciting things and hopefully I will get some more time to do some updates. I have one thing lined up already for one of the next days hopefully.

Stay Tuned!

## vExpert 2017!

Again this year I have been granted the title of vExpert. This is to me a great honor and I have been looking forward to the announcements and hoping to be amongst the many talented people on the list.

Robert Jensen has again this year made a complete list of all Danish vExperts. The list is available here: http://www.robert-jensen.dk/danish-vexperts-2017/

And if you are reading this Robert, I work for Aalborg University and this is the blog url! 🙂

## Using tags for something interesting

Hello all

Thought I would try and share a bit of what I have been working on internally for a while now. As we are an organization that is still does not have all processes and data management in place I have been working on rewriting an Python based page I hacked together with a colleague to list information about initially virtual machines but it grew to include DNS, physical machines, NAT relations as well as relying on ARP data to resolve MAC to IP where needed. I finished version 1.0 of this new system written in Django 1.10 a few weeks back and am already working on a 1.1 release with a few feature requests.

Now the system grew out of a need to identify people and relations around virtual machines in our infrastructure and today heavily relies on Tags in the web client. We were already using tags to include VM’s in backup jobs in Veeam (something I can highly recommend doing!).

So what and how do we do it? We use 5 tag categories with the names Owner, Department, SysAdm, AppAdm and Service. These 5 tag types give a lot of information directly on the VM on who to contact and what service the VM belongs to. This information is then on a daily basis exported to the new system in the form of a CSV file at the moment. This CSV file along with a file from our NAT device, on from the datacenter routers with ARP, a list of DNS records, a CSV file from our Racktables installation for physical machines where the 5 tag categories are also defined and a list of systems with the SCOM agent installed.

The data is then imported into a data model defined in Django’s Object-relational mapping that tries to correlate some of the information from the different files. The end result is a web page where all systems and DNS records in theory are listed and can be searched, filtered, sorted etc. Where one can find a system(physical or virtual) based on the IP, DNS record (A, CNAME or MX), name, type, OS, you name it. This has some of the characteristics of a CMDB or CMS but instead of showing what it should be it is showing what it actually is at the time of the latest export. We have used this to help ourselves in the Infrastructure department as well as allowing some supports to find information to help route incidents and service requests to the correct groups in our service management system.

Below I have included a blurred screenshot from the system to show one of the views defined. A cut out from a simple system list. Note that the last 5 columns are the tag categories from vSphere/Racktables with the danish names we chose for them.

So that is an example of how we use tags in our day to day operations. We still in some cases miss the old Custom attributes (I know they are still in the API but not exposed in the web client) with the option of inputting variable information like expiry dates etc. Having to do something like that with tags would in my opinion be a mess (imagine a tag for every single date of expiry).

A bonus of doing this you can actually correlate this information with information from vRops via e.g PowerCLI. As an example one could send an email based on an alert in vRops to the addresses set via the SysAdm tags.