2021 in Retrospect

Let’s start of with the easy stuff – my blogging has not been up to par this year. I have had way too little time to actually push any new content. This kind of bugs me a bit too much but the positive thing is it means I’ve been busy doing other stuff.

So what has happened in 2021? Coming into this year we had a major plan at work. Having been fed up with subpar performance of our existing HCI platform we had decided to purchase hardware to start converting all our old HCI platforms to vSAN. This would become one of the major tasks inside 2021.

I’d like to dive a bit further into this because of the magnitude (at least for me) of this task. Internally we have been running with 6 pods from one HCI vendor complemented by a few clusters using Netapp storage and some standalone nodes.

On top of this we implemented a simple 8-node stretched cluster on Cisco B200 M4 blades to run vSAN on. This was our first vSAN pod and it was built based on specs from the vSAN Ready Node configuration of B200 M4s but changing out some of the disk types with other supported models and more performant CPUs and more memory. This pod came to be based on a licensing optimization and would run only non-Windows based workloads.

We had an amazing experience with this pod which fueled our desire to switch the old HCI platform for vSAN as well. At the start of the year we had 8 2U nodes that were capable of being retrofitted for vSAN All-Flash. They were on the HCL and all components were as well. We actually only had to change a riser card to get additional NVMe slots as well as adding more NVMe caching devices.

Once we had this pod operational in a stretched cluster configuration (4+4) we started by emptying one of the existing HCI hybrid pods onto this new pod temporarily. Once emptied we could start by replacing the old 3.2 TB SAS SSD caching device and replace it with 2 1.6 TB NVMe devices instead. We could have reused the 3.2 TB SAS SSD and purchased an additional one but it was cheaper to replace it with the 2 NVMe drives instead. The hybrid pod had 12 8TB spinning disks in front so we needed a minimum of 2 disk groups to handle all the disks and with 2 NVMe slots in the back of the server the choice was easy.

We did performance testing on the new vSAN hybrid pod and my god it was fast compared to running the old HCI software. During the performance testing I managed to make several disk groups exit the cluster by running our performance workload for too long. I had a very good talk with VMware GSS about this and was recommended some changes to our test workload, primarily around duration, that would show a better picture. Our testing methodology is basically throw the worst kinds of workload we can at the pod and if performance is good enough we will have no issue running the workload we needed to put on the pod.

After migrating back the hybrid workload (and enjoying extra available capacity change to vSAN provided) we started migrating our most critical stretched workload to the new vSAN All-Flash pod. This process took forever. The primary thing was a thing I had not noticed before because it is usually not a problem. Our new vSAN All-Flash pod had been put into Skylake EVC mode because it was running 6200 series Xeon’s and would be supplemented with some 6100 series at a later point. Skylake being highest common denominator. However the old pod that we were migrating from was running on 6100 series Xeon’s without EVC mode enabled. One would think that Skylake native and Skylake EVC would be the same – but no, not the case as shown in KB76155.

This meant that about half of the 400 machines that needed to be moved would need to either be moved powered off (tough sale with the customers) or have a short maintenance window to update hardware version to 14 or 15 and then enable Per-VM EVC mode. Most of our customers were a breeze with minor service impacts to do this but one customer in particular was a bit rough which dragged the process on across the fall of this year.

But we finally managed to empty the old pod and power it of. Our next step was to reconfigure the released hardware to a vSAN certified configuration. We then proceeded to install it as a new vSAN pod and it became ready for production just 2 weeks ago. We’ll utilize this new pod to empty the next of our old HCI platforms so we can liberate the hardware from that pod for even more conversions. The process is simple but it does take time.

I have one outstanding issue that I need to solve in the new year. Some the older systems are Cisco C240 M4SX nodes. These only have internal SD boot as well as 24 drive slots in the front hooked up to a single RAID controller via 2 SAS expanders. With VMware deprecating SD/USB boot in the close future (KB85685) and vSAN not allowing non-vSAN disks on the same controller as vSAN disks we need to figure out how to boot these servers – if anyone has a solution I’m all ears! I could do some sort of iSCSI boot but I’d prefer not to!

On top of these conversions we also needed to manage all our normal operations as well as another major project that was started up in the late spring early summer. We needed to replace our vRA 7.6 install with VMware Cloud Director.

With vCD not really dying as was foretold years ago and vRA carrying a cost that vCD isn’t in our Cloud Provider licensing coupled with some usability issues with vRA from our customers we set out to test vCD in the summer and look through all the pain points of vRA to see how that compared in vCD.

Result was that we decided to roll out vCD in the fall and started the process of setting up a 10.3 production environment. We had done tests on 10.2.2 and upgraded the test to 10.3 before rolling the production environment out but yet we found good surprises!

First many machines were very easy to get imported but suddenly I had an issue where I could not import and move VMs into a vApp. I did some testing and found that if I created a new vApp I could move into that vApp. After a lot of debugging with our vTAM and GSS we found that one of our clients had deleted 2 VMs via vRA AFTER they had been imported into vCD and into that vApp. That stuck those two VMs in Partially Powered Off and blocked additional imports into the vApp.

We figured out with the help of GSS that we could run the following commands to be allowed to delete the VMs (you cannot delete a Partially Powered Off VM):

$vm = Get-CIVM <VMNAME>
$vm.Undeploy("force")

This allowed us to continue only to find the next bug. We found that some VMs would not be allowed to be moved into a vApp after Auto-import. They failed with an error about not being allowed to change bus or unit numbers while powered on – but why would it need to change those?

Turns out a bug was introduced in 10.3 (we didn’t see it in 10.2.2 at least) where VMs that had disks that weren’t in sequential unit numbers on the controllers would be forced to try to “correct” that. A unneeded operation. We opened a GSS case on it and managed to get a response that 10.3.1 fixed the issue – which it fortunately did, but it was an undocumented fix.

We have by December 1st powered down our old vRA platform and replacement with vCD has been completed. A few special machines still remain to be imported but we are 99% there which is a great feeling to end the year with.

Next year will be more vSAN conversions (we have a few Citrix pods and some disaster recovery pods to convert) as well as more vCD. We might have some NSX-T in the future as well which will likely challenge my networking skills a lot. We have been doing ACI networking for the last 4 years and I am finally at a point where I feel comfortable with the basic configurations of that platform but NSX-T just looks to have features that are easier to use by the looks of it.

This year was also the year I got my first VMware certification – VCP-DCV2021 in January. I also managed to get the vSAN Specialist badge in July making it a very good certification year for me.

Now that was a very long blog post and I hope you bear with me along it all. I have really had a lot of VMware under my nails this year but also mountains of networking and server operations. Hope I can have more time to dive into solutions in the new year.

Happy Christmas everyone and a good new year to you all!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.