2021 in Retrospect

Let’s start of with the easy stuff – my blogging has not been up to par this year. I have had way too little time to actually push any new content. This kind of bugs me a bit too much but the positive thing is it means I’ve been busy doing other stuff.

So what has happened in 2021? Coming into this year we had a major plan at work. Having been fed up with subpar performance of our existing HCI platform we had decided to purchase hardware to start converting all our old HCI platforms to vSAN. This would become one of the major tasks inside 2021.

I’d like to dive a bit further into this because of the magnitude (at least for me) of this task. Internally we have been running with 6 pods from one HCI vendor complemented by a few clusters using Netapp storage and some standalone nodes.

On top of this we implemented a simple 8-node stretched cluster on Cisco B200 M4 blades to run vSAN on. This was our first vSAN pod and it was built based on specs from the vSAN Ready Node configuration of B200 M4s but changing out some of the disk types with other supported models and more performant CPUs and more memory. This pod came to be based on a licensing optimization and would run only non-Windows based workloads.

We had an amazing experience with this pod which fueled our desire to switch the old HCI platform for vSAN as well. At the start of the year we had 8 2U nodes that were capable of being retrofitted for vSAN All-Flash. They were on the HCL and all components were as well. We actually only had to change a riser card to get additional NVMe slots as well as adding more NVMe caching devices.

Once we had this pod operational in a stretched cluster configuration (4+4) we started by emptying one of the existing HCI hybrid pods onto this new pod temporarily. Once emptied we could start by replacing the old 3.2 TB SAS SSD caching device and replace it with 2 1.6 TB NVMe devices instead. We could have reused the 3.2 TB SAS SSD and purchased an additional one but it was cheaper to replace it with the 2 NVMe drives instead. The hybrid pod had 12 8TB spinning disks in front so we needed a minimum of 2 disk groups to handle all the disks and with 2 NVMe slots in the back of the server the choice was easy.

We did performance testing on the new vSAN hybrid pod and my god it was fast compared to running the old HCI software. During the performance testing I managed to make several disk groups exit the cluster by running our performance workload for too long. I had a very good talk with VMware GSS about this and was recommended some changes to our test workload, primarily around duration, that would show a better picture. Our testing methodology is basically throw the worst kinds of workload we can at the pod and if performance is good enough we will have no issue running the workload we needed to put on the pod.

After migrating back the hybrid workload (and enjoying extra available capacity change to vSAN provided) we started migrating our most critical stretched workload to the new vSAN All-Flash pod. This process took forever. The primary thing was a thing I had not noticed before because it is usually not a problem. Our new vSAN All-Flash pod had been put into Skylake EVC mode because it was running 6200 series Xeon’s and would be supplemented with some 6100 series at a later point. Skylake being highest common denominator. However the old pod that we were migrating from was running on 6100 series Xeon’s without EVC mode enabled. One would think that Skylake native and Skylake EVC would be the same – but no, not the case as shown in KB76155.

This meant that about half of the 400 machines that needed to be moved would need to either be moved powered off (tough sale with the customers) or have a short maintenance window to update hardware version to 14 or 15 and then enable Per-VM EVC mode. Most of our customers were a breeze with minor service impacts to do this but one customer in particular was a bit rough which dragged the process on across the fall of this year.

But we finally managed to empty the old pod and power it of. Our next step was to reconfigure the released hardware to a vSAN certified configuration. We then proceeded to install it as a new vSAN pod and it became ready for production just 2 weeks ago. We’ll utilize this new pod to empty the next of our old HCI platforms so we can liberate the hardware from that pod for even more conversions. The process is simple but it does take time.

I have one outstanding issue that I need to solve in the new year. Some the older systems are Cisco C240 M4SX nodes. These only have internal SD boot as well as 24 drive slots in the front hooked up to a single RAID controller via 2 SAS expanders. With VMware deprecating SD/USB boot in the close future (KB85685) and vSAN not allowing non-vSAN disks on the same controller as vSAN disks we need to figure out how to boot these servers – if anyone has a solution I’m all ears! I could do some sort of iSCSI boot but I’d prefer not to!

On top of these conversions we also needed to manage all our normal operations as well as another major project that was started up in the late spring early summer. We needed to replace our vRA 7.6 install with VMware Cloud Director.

With vCD not really dying as was foretold years ago and vRA carrying a cost that vCD isn’t in our Cloud Provider licensing coupled with some usability issues with vRA from our customers we set out to test vCD in the summer and look through all the pain points of vRA to see how that compared in vCD.

Result was that we decided to roll out vCD in the fall and started the process of setting up a 10.3 production environment. We had done tests on 10.2.2 and upgraded the test to 10.3 before rolling the production environment out but yet we found good surprises!

First many machines were very easy to get imported but suddenly I had an issue where I could not import and move VMs into a vApp. I did some testing and found that if I created a new vApp I could move into that vApp. After a lot of debugging with our vTAM and GSS we found that one of our clients had deleted 2 VMs via vRA AFTER they had been imported into vCD and into that vApp. That stuck those two VMs in Partially Powered Off and blocked additional imports into the vApp.

We figured out with the help of GSS that we could run the following commands to be allowed to delete the VMs (you cannot delete a Partially Powered Off VM):

$vm = Get-CIVM <VMNAME>
$vm.Undeploy("force")

This allowed us to continue only to find the next bug. We found that some VMs would not be allowed to be moved into a vApp after Auto-import. They failed with an error about not being allowed to change bus or unit numbers while powered on – but why would it need to change those?

Turns out a bug was introduced in 10.3 (we didn’t see it in 10.2.2 at least) where VMs that had disks that weren’t in sequential unit numbers on the controllers would be forced to try to “correct” that. A unneeded operation. We opened a GSS case on it and managed to get a response that 10.3.1 fixed the issue – which it fortunately did, but it was an undocumented fix.

We have by December 1st powered down our old vRA platform and replacement with vCD has been completed. A few special machines still remain to be imported but we are 99% there which is a great feeling to end the year with.

Next year will be more vSAN conversions (we have a few Citrix pods and some disaster recovery pods to convert) as well as more vCD. We might have some NSX-T in the future as well which will likely challenge my networking skills a lot. We have been doing ACI networking for the last 4 years and I am finally at a point where I feel comfortable with the basic configurations of that platform but NSX-T just looks to have features that are easier to use by the looks of it.

This year was also the year I got my first VMware certification – VCP-DCV2021 in January. I also managed to get the vSAN Specialist badge in July making it a very good certification year for me.

Now that was a very long blog post and I hope you bear with me along it all. I have really had a lot of VMware under my nails this year but also mountains of networking and server operations. Hope I can have more time to dive into solutions in the new year.

Happy Christmas everyone and a good new year to you all!

Getting my performance back in Workstation 16

Back in may of last year I was tripping to get my hands on WSL2 with the new backend and improved performance. I wrote a few blogposts about it and even wrote my, to date, most viewed and commented post about it (WSL2 issues – and how to fix some of them).

Now the issue that hurt me the most at first was Workstation 15.5 was not able to run with WSL2 installed as this enabled the Hyper-V features of Windows 10 which collide with Workstation.

The day after WSL2 released VMware pushed 15.5.5 which allowed Workstation to run even with Hyper-V enabled but at greatly reduced performance – just Google it and be amazed.

It does not really come as a surprise as having Workstation (A virtualization engine) run on top of Hyper-V (also a virtualization engine) on top of hardware is not a recipe for performance!

As a result I have not been using my Windows 10 VM that much the last many months – until now!

I got my hands on a Workstation 16 Pro license and went in for an upgrade to see if any of the improvements in 16.1 would alleviate some of my performance issues. And after completing the install which prompted me to enable the Windows Hypervisor Platform I spun up my Windows 10 machine from suspend. I quickly got a popup noting me that I had “side channel mitigations” enabled as show below here:

Now from working with vSphere I realize that many of the side channel mitigations can have heavy impact on performance so I updated my Windows 10 OS and shut it down and followed KB79832 as linked in the popup to disable the mitigations.

I powered on my VM again and could immediately feel the difference. I may not have the exact same performance I had with 15.5 on an non-Hyper-V enabled host but it is a LOT better than it was. Major problem now seems to be that fact that my tiny i7-7600U dual core CPU can’t keep up! Dear Dell when are you rolling out some Latitude’s with Ryzen 7 5800U’s??

vCloud Usage Meter 4.3 .local resolution issues

As part of our ongoing engagement with VMware we are required to operate vCloud Usage Meter to measure rental license usage for reporting back to VMware. We have been running an older build for a long time now waiting for the 4.3 release to come out because this new release could correctly measure vRealize Automation usage based on the Flex bundle Addon model rather than per OSI.

I got the appliance deployed just before the holidays but ran into several issues that I’d like to share with you.

First issue I ran into actually prompted me to redeploy because the migration of configuration from the old appliance ended in a bad state. It was caused by two things 1) I was missing a Conditional Forwarder for a domain on the DNS servers on the new appliance was using and 2) systemd-resolved is a nightmare to work with!

It like to focus in on the systemd-resolved. I really don’t like this piece of software as it is insanely frustrating to troubleshoot on. What it basically does is set the /etc/resolv.conf server to a local address on the server (127.0.0.53) and on that IP a daemon is listening for requests. If it can answer the request it does otherwise it passes the request onwards as normal.

But – and this is the crucial part – it handles “.local” domains a bit different. What it actually does I cannot answer completely but .local is being used by some services like Bonjour and mDNS. This is crucial as if you do not explicitly state that a .local domain needs to be resolved via actual DNS systemd-resolved won’t do it.

To jump a bit – the new Usage Meter 4.3 appliance runs on Photon OS which uses systemd. The older appliances use SLES which doesn’t and thus don’t have the issues. I had to do a lot of tinkering to get this working but managed by following this article: https://github.com/vmware/photon/issues/987 and making sure that both my required .local domains were present in the search path parameter and that the DNS servers were explicitly inserted into the 10-eth0.network config file.

I had to do both things otherwise it did not work. Search path can be configured correctly on deploy if you remember it. The DNS settings must be done after deployment but before running the migration script. Double check DNS resolution before attempting migration – it’ll save you headaches!

The appliance has been deployed and config migrated which prompted me with to errors – that old 5.5 vCenter that hadn’t been fixed yet and a currently unknown bug in registering a vRealize Automation 7.6 install – VMware support are investigating!

VMworld 2020 and General Announcements

Ohh it has been a while again since the last time I got to writing. Being busy with maintenance work is not really something that makes for great blog articles.

But last week I got to attend VMworld 2020! This year due to the situation world wide it was a virtual setting so for me it was two days in the home office watching a lot of great content on Kubernetes, NSX, vSAN and much more.

So many great things we announced. But the thing that struck me first was the acquisition of SaltStack. This is a major move to actually incorporate a configuration management system into the VMware portfolio and will certainly strengthen vRealize Automation in the future and hopefully also other parts of the ecosystem!

Another very huge announcement was Project Monterey. Although I’m still trying to wrap my head around the use cases and oppertunities this presents I do like the idea very much! Being able to offload vSAN and NFV workloads to the a SmartNIC is a great idea and I hope to see it evolve in the future.

This week also saw some the GA release of several new versions of the core products from VMware. These were announced previously but I was not aware that they would be releasing so soon – but that is just the cherry on top!

First up is the release of vSphere 7 U1! Biggest new feature has got to be the ability to run vSphere with Tanzu as well as new scalability maximums for VMs.

Along with vSphere 7 U1 there is of course also a vSAN 7 U1 release! Here features like HCI mesh allowing you to share the vsanDatastore natively between vSAN pods is one of my top features. Improvements to the fileservices of vSAN also landed as well as the option to only run compression on vSAN and not both compression and deduplication. Great features! For those running 2-node clusters or stretched clusters requiring witness a huge improvement has also landed allowing a witness server to be shared by up to 64 clusters! Very nice!

Another feature also seems to have crept in as detailed by John Nicholson. It is the option to run the iSCSI feature on stretched clusters. Again a very nice feature to have included for those needing it.

Last bit of GA material that I wanted to comment on aswell is the release of vRealize Automation 8.2. There are much needed improvements to the multi-tenancy of vRA as well as improvements to Infrastructure-as-code workflows and Kubernetes.

It can be a daunting task to keep up with all the releases from VMware but their ability to push new releases and features never ceases to amaze me!

Working with Cisco PSS APIs

As I work for a Cisco Partner at the moment I have been looking to get access to the Cisco PSS APIs specifically to get coverage status on a Cisco device serial number.

If you have a Cisco account you can access the Device Coverage Checker online and check up to 20 serial numbers at a time. I have used this extensively. The same information can also be viewed if you have access to Intersight.

But I am looking to integrate with our DCIM tool Netbox to allow for easy check of coverage via API calls. Those API calls are for us available via the PSS API call to the endpoint SN2INFOv2.

Now of course this requires some sort of authentication and Cisco has an intricate process for getting access which boils down to creating a TAC case an request access.

Once you have access you need to create an application and grant that application access to the SN2INFOv2 APIs with “Client Credential” privileges. This generates a Key and a Client Secret unique to the application which is needed to get access.

Now here’s the problem. The Cisco API Developer has great documentation on the SN2INFOv2 API and how to format the request – but those need a Token to be accessed. The token needs to be generated first which was not immediately clear how to do.

I deciphered that I needed to do a OAUTH2 login agains cloudsso.cisco.com but could not find the documentation on how to format the request. I searched around to figure out how and found reference to a different API that showed an example on how to do this.

Problem was it refenced a “Client ID” which I did not seem to have. So I guessed a bit and assumed that “Client ID” must be the “Key” I had as the login required “Client ID” and “Client Secret” and I had “Key” and “Client Secret”.

So formatted the GET request but got a 405 Method not allowed. Now I was a bit lost. But searching a bit more I fell upon a dodgy PHP developer forum which I will not link to. But here was an example of a cURL request that showed me an approach. The request looked like this:

 curl -s -k -H "Content-Type: application/x-www-form-urlencoded" -X POST -d "client_id=..." -d "client_secret=..." -d "grant_type=client_credentials" https://cloudsso.cisco.com/as/token.oauth2

Now there was still reference to “Client ID” but again assumed it to be the “Key” I had and would you know – the API returned me an access token.

This access token needs to be passed on requests to the SN2INFOv2 API as:

curl -X GET -s -k -H "Accept: application/json" -H "Authorization: Bearer <TOKEN>" https://api.cisco.com/sn2info/v2/coverage/status/serial_numbers/<SERIALNUMBER>

And there you go! Easy to setup in Postman or Golang or Python or what ever you prefer!

WSL2 issues – and how to fix some of them

I have been waiting in anticipation for WSL2 (Windows Subsystem for Linux) and on May 28th when the update released for general availability I updated immediately.

At first I was super hyped. WSL2 and the Ubuntu 20.04 image just worked and ran smoothly and quickly. Combined it with the release version of Windows Terminal it was a real delight.

I also went and grabbed Docker Desktop for Windows as it now has support for WSL2 as the underlying system. And joy it just installed and worked. Now being capable of running Docker containers directly from my shell without doing some of doing it the way I did before having a Ubuntu VM running in VMware Workstation and connecting to it via docker-machine on my WSL1 Ubuntu image. A hassle to get to work and not a very smooth operation.

Having the option to just start Docker containers is amazing!

But then I had to get some actual work done and booted up VMware Workstation to boot a VM. And it failed. With a Device Guard error. I followed the guides and attempted to disable Device Guard to no avail. Then it dawned on my. WSL2 probably enables the Hyper-V role! And that is exactly what happened.

Hyper-V and Workstation (or VirtualBox for that matter) do not mix well – that is until VMware released Workstation 15.5.5 to fix this exact problem just the day after WSL2 released. Perfect timing!

Simple fix – just update Workstation to 15.5.5 and reboot and WSL2 and Workstation now coexisted fine!

I played a bit more with WSL2 in the following days but ended up hitting some wierd issues where networking would stop working in the WSL2 image. No real fixes found. Many indicate DNS issues and stuff like that. Just Google “WSL2 DNS not working” and look at the mountains of issues.

But I suspected something else because DNS not working was just a symptom – routing out of the WSL2 image was not working. Pinging IPs outside the image did not work. Not even the gateway IP. And if the default gateway is not working of course DNS is not working.

I found that restarting fixed the issue so got past it that way but today it was back. I was very interested in figuring out what happened. And then I realized the potential problem and tested the fix. I was connected to my work network via Cisco AnyConnect. I tried disconnecting from VPN and testing connectivity in WSL again – now it works. Connected to VPN again and connectivity was gone.

Okay – source found – what’s the fix? I found this thread on Github that mentions issues with other VPN providers even when not connected. Looking through the comments I found a reference to a different issue of the same problem but regarding AnyConnect specifically.

I looked through the comments and many fixes around changing DNS IP and other things but the fix that seem to do the trick was running the following two lines of Powershell in an elevated shell after connecting to VPN

Get-NetIPInterface -InterfaceAlias "vEthernet (WSL)" | Set-NetIPInterface -InterfaceMetric 1
Get-NetAdapter | Where-Object {$_.InterfaceDescription -Match "Cisco AnyConnect"} | Set-NetIPInterface -InterfaceMetric 6000

Those two lines change the Interface Metric so that the WSL interface has a higher priority than the VPN connection. This inadvertently also fixed an issue that I had with local breakout when on VPN not working correctly.

Downside of the fix is that this needs to be run every time you connect to VPN. I implemented a simple Powershell function in my profile so I just have to open an elevated shell and type “Fix-WSLNet”.

That is all for now!

vRealize Orchestrator 8.1 (and others) announced!

I’m late to the party as usual but simply needed to write up a little quick post on this.

VMware announced a whole slew of new releases yesterday with the primary focus being on vSphere 7 and the new Kubernetes integrations that brings. I hope to get time to look more into Kubernetes on vSphere once that becomes available as this is an area I have much interest in learning more about.

But the biggest thing for me as of right now is the announcements for vRealize Orchestrator 8.1!

I have really wanted to like the new HTML 5 interface that came in 7.6 but it had issues! No lie there. And as I have not had the time to test it in 8.0 yet I hope that 8.1 will bring back some of the glory to vRO!

Among the features I will look forward to the most is the return of the “Tree-View” to show a hierarchical sorting and bundling of related workflows. The tag based approach used in 7.6 and 8.0 don’t really appeal to me. I like to be able to tag workflows but not being able to sort and organize them in any other way is not optimal.

But that said. The absolute biggest wish on my wishlist for vRO has come true! To quote the announcement:

Multiple Scripting Languages: PowerShell, Node.js,Python. Support for multiple scripting languages have been added: PowerShell, Node.js, and Python. This makes vRealize Orchestrator more accessible and easier to use for non-JavaScript users. “

Finally Powershell will be directly available in vRO not requiring a complicated setup using a Windows host and all of the double hop authentication issues that arise from this. And to get Python as well! It’s almost Christmas!

I can’t and won’t go over all the announcements yesterday – other bloggers out there are already doing this and I’d like to give some credit to those working hard on this. For that reason I will point you all to Eric Siebert’s list of links to articles and annoucements regarding vSphere 7 and related releases.

Take look at the list here: http://vsphere-land.com/news/vsphere-7-0-link-o-rama.html

vRealize Orchestrator VC plugin version

I keep forgetting this to be a problem so might as well write it down for myself and anyone else stumbling upon this.

When using vRO, in my case 7.5 or 7.6, you might get a problem where you are unable to add a vCenter instance of a vCenter version 6.7. The error is not very informative:

It doesn’t really scream out what the error is. But as I had seen the error before I had a hunch when my colleague was configuring vRO in our vRealize Automation platform.

On the vRO VMTN forum there is a post that contains the latest release of the vRO VC plugin – https://communities.vmware.com/docs/DOC-32872

Simply download the zip attached. Unpack the vmoapp file. Login to the vRO control-center on https://<FQDNorIP>:8283/vco-controlcenter/ and select “Manage Plug-ins”. Here under “Install plug-in” click browse and select the vmoapp file and upload. Accept EULA and install. After about 2 minutes the vRO will have restarted and the plugin updated.

vCenter instances can now be added 🙂

Updated udp_client.py for testing UDP heartbeats

A while back i stumbled on a set of KB’s for testing UDP heartbeat connectivity between ESXi and vCenter. I wrote this article to describe how to do it.

Now today I had to do the same and went back to these KB’s to find the script. This was however on newer 6.5 U2 hosts and not old 5.5 hosts. And as KB1029919 describes it is only applicable to 4.0.x to 5.5.x versions of ESXi.

Why is this important? Because between ESXi 5.5.x and 6.5 U2 the included Python was updated from 2 to 3. Some of you may know that there are many breaking changes in Python 3 when compared to Python 2 and some of those were present in the original udp_client.py script.

So I took the time to fix the few issues that the script had and upload a version to GitHub here. In the Python folder there is a version of udp_client.py that is Python 3 compatible and I included the original script as udp_client-v2.py for reference.

The major changes were in line 25 that print is now a function and has to be used with parentheses and the “%” change to “,” as seen here:

original:
print "\nSent %s UDP packets over port %s to %s" % (numtimes,port,host) 

python 3:
print("\nSent %s UDP packets over port %s to %s", (numtimes, port, host)) 

After syntax error was fixed I found that there was a change to how “socket.sendto” works and it now expects a bytearray instead of a string. Simple fix was to introduce a int variable “datasize” set to 100 and change the “data” variable from “100” to “bytearray(datasize)” as seen here:

original:
data = "100" 

python 3:
datasize = 100       
data = bytearray(datasize) 

After this the scipt works on a 6.5 U2 host and I was able to UDP connectivity.

This also marks the first time I have my own public Github repsitory so – yay! 🙂

System logging not Configured on host

A few weeks ago I noticed a warning on some of our hosts in our HyperFlex clusters and wondered what was going on. It was only hitting Compute Only nodes in the clusters.

The warning is indicating that the Syslog.global.logDir is not set as per KB2006834. But when I looked via ssh on the host it was logging data and the config option was set so it was working – so why the warning?

Well it turns out to be something not that complicated to fix. The admin who set up the nodes set the option to:

[] /vmfs/volumes/<UUID>/logs/hostname

That is giving it an absolute path on the host like you would do with the ScratchConfig.ConfiguredScratchLocation option. This works but triggers the warning as if it was not set.

The fix is simple. Simply change it to use the DatastoreName notation as this:

[DatastoreName] logs/hostname

This immediately removed the warning and everything continued as it had before.