IO Issues on Cisco C220 M4

Hi All

As promised more content is finally available! Unfortunately I cannot share any screenshots from this actual issue I worked on – you will have to take my word for it.

At my new job I was tasked with solving some issues that had been observed on a remote site on two Cisco C220 M4’s with local disks. The hosts run a set of redundant services but nothing is shared between them except network equipment. The issues were 1) some times around 04.00 AM the software running in the virtual machines would throw an alarm briefly and 2) powering on/off or snapshotting a VM would cause the software to throw alarms as well but on other VMs than the one being powered on/off or snapshotted. The event log on both hosts showed intermittent I/O latency warnings not connected to any of the above issues but nothing alarming.

Now one note though. The software running on these VMs is very latency sensitive so something like snapshots could potentially be a problem in any case but powering on VM A should not affect VM B unless the host is hurting for resources which is not the case.

Before diving in I asked out in the vExpert Slack if anyone had seen issues like this before or had any ideas of what to look for. James Kilby and Bilal Ahmed were quick to throw some ideas on the table. James suggested that vswp file being created on power on might cause the problem and Bilal suggested looking at the network. With those things in mind I started debugging.

First off I had already decided to update the ESXi version to latest 5.5 U3 + patches – it was a little out dated. Also decided to firmware upgrade the servers with the latest 3.0.3c release for Standalone C220 M4s. I had also found the latest supported drivers for the Cisco RAID controller (lsi-mr3) and the Intel NIC (igb) to rule out any compatibility issues. Also it was my hope that an update would remove some of the I/O latency warnings.

Now before upgrading anything I tested to see if the problem was still there. It was. Powering on a test VM with no OS and just 1 core and 4 GB RAM – instant alarms kicking off in the application. However powering off did not cause any noticeable problems. I proceeded to firmware update as my hope was this would solve the issues. Firmware upgrading on a remote site through a small connection is painful! It took a while. But once the first host was updated I proceeded to test if the issues were still there. There were. Damnit. Time to dig deeper. I tried out James’ idea of vswp being the problem and setting a 100% reservation seemed to solve the problem. However this was not a viable solution as this would only solve the problem if the powering on VM has a reservation. If anyone powered on a VM without it, it would still affect all other VMs, regardless of reservation on those.

I booted up our favorite debugging tool ESXTOP and vent into HBA mode and set delay down to 2 seconds. I then observed the Cisco RAID controllers behavior during power on operations and that freaked me out. It would happily do anything between 100 and 1000 IOPS at 5 to 150 ms while not powering on. The latency would spike high but nothing I was that scared of on a small set of local 10k disks. However when powering on a VM without reservation the HBA would stop doing any operations for upwards of 4 refreshes in ESXTOP (at 2 second intervals!). All indicators showing 0. No IO was passed so no latency was observed. This scared me a bit. Latest firmware and supported drivers. Damn. We weren’t seeing the same issue on another site with Dell servers but they also had SSDs instead of 10k disks. Was this the 10k disks not performing enough?

We had a short talk internally about what to do. My boss suggested that we looked at this Cisco bug:

This bug references an issue for some C220 M4’s that were installed with a specific version of the 5.5 ESXi Cisco Custom ISO. It was not the ISO these hosts were installed with but the solution was to use a different driver than the one ESXi default selects for the Cisco RAID controller HBA – the lsi-mr3 driver. Instead it instructs to make sure that the megaraid-sas driver is installed and to remove the lsi-mr3 and lsi-msgpt3 drivers and reboot which will make the megaraid-sas the active driver. We decided to try this. Downloaded the latest support megaraid-sas driver for the server and remove the lsi-mr3 and lsimsgpt3 drivers. Reboot and wait.

After getting online again with one host we tested. Powered on the test VM – and nothing. No alarms. What a difference. Tried it again looking at ESXTOP. No drops in IO. It was now doing 8.000 IOPS @ 15ms no problem. Major difference. Mass powering on VMs 15 at a time had taken minutes before the actual power on task was done and the machines would start booting. It took seconds now.

So what is the moral here? Apparently it could benefit you as a Cisco UCS server user to use the megaraid-sas driver instead of the lsi-mr3 driver. Both are included on the Cisco Custom ISO but it defaults to the lsi-mr3 so you actively have to do something to change that.

Production Cluster Upgrade

During the spring of this year me and a few of my colleagues spent several months of meetings with storage solution providers and server hardware manufacturers to figure out if we should try out something new for our VMware production clusters. We had a budget for setting up a new cluster so we wanted to look at our options for trying something other than our traditional blade solutions we a spinning disk FC array which we have been using for years.

Some of the considerations we made regarding storage were that we wanted to start to leverage flash in some way or form to boost intense workloads. So the storage solution would need to use flash to accelerate IO. We also wanted to look at if server side flash could accelerate our IO as well. This lead us to the conclusion that we would like to avoid blades this time around. We would have more flexibility using rack servers with respect to more disk slots, PCIe expansions etc. Going with e.g. 1U server we would be sacrificing 6 additional rack units compared to 16 blades in a 10U blade chassis. Not huge in our infrastructure.

So we a bunch of different storage vendors, some new ones like Nimble Storage, Tintri, Pure Storage and some of the old guys like Hitachi and EMC. On the server side we talk to the regulars like Dell and HP but also Hitachi and Cisco.

All in all it was a great technically interesting spring and by summer we were ready to make our decision. In the end we decided to go with a known storage vendor but a new product. We chose a Hitachi VSP G200 as it in controller strength was on par with our existing HUS130 controllers but with smarter software and more cache. The configuration we went with was a tiered storage pool with a tier 1 layer consisting of 4 FMD 1.6TB in RAID10. This gives us 3.2TB Tier 1 storage and from the tests we have run – this tier is REALLY fast! The second and last tier is a large pool of 10K 1.2 TB disks for capacity. Totally we have just shy of 100TB of disk space on the array. It is setup so all new pages are written to the 10k layer but if data is hot it is migrated to the FMD layer within 30 seconds utilising Hitachi’s Active Flash technology. This feature takes some CPU cycles from the controller but from what we see right now this is a good trade off. We can grow to twice the size in capacity and performance as the configuration is at the moment so we should be safe for the duration of the arrays life.

On the server side we chose something new to us. We went with a rack server based Cisco UCS solution. A cluster consisting of 4x C220 M4 with 2x E5-2650V3 CPU’s and 384GB memory. We use a set of 10k disks in RAID1 for ESXi OS (yes we are very traditional and not very “Cisco UCS” like). The servers are equipped with 4x 10G in the form of a Cisco VIC 1227 MLOM and a Cisco VIC 1225 PCIe. As we were not really that hooked on setting up a SSD read cache (looking at vFlash for now) in production with out trying it we actually got a set of additional Cisco servers for some test environments. These are identical to the above but as some of my colleagues needed to test additional PCIe cards we went with C240 M4 instead for the additional PCIe slots. Two of these servers got a pair of 400GB SSD’s to test out vFlash. If it works we are moving those SSD’s to the production servers for use.

As I said we got the servers late summer and put the into production about 2½ months ago and boy we are not disappointed. Some of our workloads have experienced 20-50% improvements in performance. We ended up installing ESXi5.5 U3a and joining our existing 5.5 infrastructure due to time constraints. We are still working on getting vSphere 6.0 ready so hopefully that will happen in early spring next year.

We have made some interesting configurations on the Cisco UCS solution regarding the network adapters and vNic placement so I will throw up something later on how this was done. We also configured AD login using UserPrincipalName instead of sAMAccountName which was not in the documentation – stay tuned for that as well. And finally – have a nice Christmas all!