This got me a bit confused so started looking into the filesystem of the ESXi host and discovered that the symlink /productLocker was pointing to a folder in red which usually means a broken link.
ProductLocker contains the files from VMware tools.
/productLocker is a symlink to a folder inside the /locker folder which inturn is a symlink to /store which symlinks to a partition on the boot volume. I tried changing into the /store and doing an ls which corrupted my terminal output.
Turns out the filesystem on the /store partition was corrupted. I check the two bootbank partitions which were okay and then realized that this host and it’s partner were both booting from SD cards which I hate working with!
As this was old 6.0 hosts – support was out of the question so I started looking around the internet for a possible fix that didn’t involve reinstalling, and to my luck I found a blog post from VirtualHackey from 2020 which detailed almost exactly the same situation as I was seeing.
The fix was simple – find and copy the actual content from the corrupted filesystem, format the partition and copy content back. He even describes with a link to another article how to locate the content. Unfortunately I could not find the files inside the filesystem.
So I had two options: format the partition and hope I could upgrade without the packages present or reinstall. That is until I realized that I had another site with identical hosts and ESXi 6.0 build. I checked those and found the /store partition in perfect state.
So I reached out to my TAM to ask if the method of formatting the partition and copying the content from a working host was viable. This was of course best effort support but reached out to see if the procedure perhaps was done in an old support case.
I got response that it ought to work otherwise a reinstall was necessary.
So – I tried it and it worked like a charm. I completed the upgrades without further hitches.
So shout out to VirtualHackey for providing the method to fix this problem – much appreciated!
Today VMware is releasing the next installment in their series on Multi-Cloud Briefings named “The Future of Cloud Management” which will cover new additions to VMware’s vision of multi-cloud management at new features added to VMware Aria.
If you haven’t seen any of the previous multi-cloud briefings I recommend checking out the Youtube channel where all the previous briefings can be watched.
Much of the portfolio of VMware Aria I have some experience with from the older on-prem solutions but you may, as I, not know what VMware Aria actually covers.
VMware Aria is VMware’s portfolio of tools for multi-cloud management (formerly known as vRealize Cloud Management) which was announced in August – check out this blog post for more information.
Many of the products and tools of Aria are rebranded from old products as well as addition of new features built upon that foundation.
First of you may have used products like vRealize Automation, Operations, Log Insight or Network Insight before. All of these products are part of Aria now and have been rebranded to Aria Automation, Aria Operations, Aria Operations for Logs and Aria Operations for Networks respectively. As well as now including Skyline and CloudHelath and CloudHelath Secure State, that latter two being rebranded to Aria Cost powered by CloudHealth and Aria Automation for Secure Clouds.
These tools have been integrated into a new product named Aria Hub (formerly Project Ensemble). Underneath Aria Hub sits Aria Graph which is the data source that powers the new features of the Aria portfolio.
Data from the above tools and solutions along with data that is pulled from the cloud providers that you want like Azure, AWS, VMConAWS etc is then collected (cool detail is that data is not duplicated but rather referenced via pointers) into an inventory in Aria Graph.
From Aria Graph, Aria Hub is then able to show you your cost, usage, problems, performance and possible security compliance issues based on the data across all cloud endpoints.
Via the Aria Hub UI you can look at different perspectives based on whether you are a business manager, application owner or operations engineer. You can customize the home page for your different groups of people so that they get the info they need, be it cost, performance or security.
You can select your SDDCs or your applications and drill down into the elements that make them up like VM instances, Kubernetes pods, networks, storage etc and look at consumption, performance, cost etc for the entire SDDC or application on all the way down the stack to the individual components that it is made of.
Aria Guardrails and Business Insights
VMware Aria is not just a rebranding of existing products but also introduces new features built on-top of those products like Guardrails and Business Insights.
Guardrails allows you to setup automatable policies for things like security, cost, performance etc that can be enforced on the different applications and SDDCs attached to your Aria hub.Being powered by and “everything-as-code” approach Guardrails includes a library of policy templates that can be imported and customized to your environments and allow for automatic remediation of things like making sure all your Kubernetes pods are attached to Tanzu Mission Control so that security policies inside TMC can be enforced and monitored.
Business Insights integrates with Guardrails and the other products inside Aria Graph to allow for AI/ML powered analytics to inform you of compliance issues and optimizations that can be useful – all available via Aria Hub and tailored for the specific class of user logged in.
One of the very exciting new features coming is Aria Migrations which will assist you in analyzing and planning migrations of applications running in your on-prem infrastructure to VMC. Currently it is the only migration type supported but the types of migrations will be expanded in the future.
Via Aria Hub you can plan the migrations by selecting the subset of resources you want to migrate and then Migrations will assist you in identifying any dependencies that are not in your scope that might impact your performance or security if migrated. You can then add these dependencies if needed.
You can then compare the expected TCO of keeping the application on-prem versus in the cloud and make decisions based on this – all powered by Aria Cost.
If you want to perform the migration you can continue to the planning of the migration where App Migrations will assist you in planning bundles of workloads to be moved and in what order.
After splitting up the migration each bundle of workloads can be planned across multiple migration steps.
All of the migration being powered by HCX and CodeStream allowing for testing of the steps, rerunning failed steps and monitoring the process.
But how does Aria know what entities are linked to each other you may ask? Well there are multiple ways like using flow information from Aria Operations for Networks to see who is communicating with who or what deployment the entity is a part of in Aria Automation.
When a link is detected users will be presented with the discovery of the entity and the link to an application and can then either confirm the link is correct or let it automatically be accepted. Very neat!
All about the APIs
All of this sounds excellent but how does it fit into your existing business? Well here’s the cool part. With Aria Graph at the base of everything you get a full GraphQL based API. Everything you can do in the UI can be done via GraphQL against Aria Graph so if your can write a piece code that can read or write to a GraphQL based data source you can integrate it to your existing tooling like Service Now.
The approach of allowing focusing on API to powered the UI is not a new but it is very nice to see VMware take this approach so seriously with Aria Hub and Aria Graph.
Final words and the freebie
If this sounds interesting to you I highly recommend heading over to the Aria Hub landing page and sigining up for the free trial and trying out the product.
Personally I am very excited to see where this is going – being a on-prem data center operations engineer this a big world for me to step into and a lot of information to digest.
One note for the security minded people. VMware Aria Hub is currently a SaaS offering which means your data will be located in a cloud. This might prohibit you from using the offerings. This is of course a shame but VMware have informed that they might on a +1 year timeframe look into an on-prem version of the product.
Is mentioned in part 1 this is part of a larger post that I never got around to finishing because it just grew to unmanagable size. This is part 2 and I will touch on my configurations of the UCS profiles that I use to run vSAN. Primarily this part is done because, despite Cisco having vSAN Ready Nodes, they lack a validated design of how to set it up – if you know of one please let me know!
Cisco UCS managed servers have a great advantage of being easy to make consistently configured while still maintaining easy options to update configurations across multiple servers. I have been working with UCS Manager for close to 8 years now and have had a lot of problems occur and learned alot about the function of the product through that. These “recommendations” are based on my personal preferences and borrows from different types of best practices and configurations that I have encountered over the years.
As I primarily work with M5 configurations today I will focus on that and inject points for some M4 stuff I have encountered and their fixes/workarounds
One thing that I, all though not necessary, do is to separate my clusters into separate Sub-Organizations inside of UCS Manager. This gives a nice clean look where I can make generic policies in the Root organiation and specific policies under each sub-organization.
Boot Drive configuration
On M5 (and M6) I use a Storage Profile define the OS LUN for ESXi. As all other disks are in JBOD mode nothing needs to be done other than confirm JBOD mode (which is default if you select a SAS HBA for the server). The Storage Profile consists of a Disk Group policy and a Local LUN definition.
The Disk Group policy I usually define is setting RAID Level to RAID 1 Mirrored and then flipping Disk Group Configuration to manual. I then defined disk number 253 and 254 as the constituents of the Disk Group as these are always the two disks on the M.2 HW RAID controller. Everything else I leave deafult.
With this Disk Group Policy in hand I create a Storage Profile and under the Local LUNs section I create a LUN. I normally call the LUN OS and set a Size of 32 GB. Auto Deploy is set and Expand To Available is checked and finally the Disk Group Policy is set.
I could set the 32 GB larger but given the Expand to Available is enabled it will automatically fill in the 240 GB RAID 1 volume or 960 if choosing the large boot drives.
For M4 I use a different method which will be mentioned in the Boot Policy section.
Next up to configure is networks. Here I have borrowed a bit of what Cisco HyperFlex does. Hyperflex is Cisco answer to vSAN and it works to some extent in a similar manner.
First thing to do is to allow for QoS to have the correct MTU settings so that I can utilize CoS Preserve on the upstream switches if need be. Below table shows the settings I use in my environments.
QoS System Class
Iuse Platinum for vSAN storage traffic, Gold for VM guest traffic, Silver for the ESXi management traffic and Bronze for vMotion interfaces. Note that both Bronze and Platinum allow MTU 9000 Jumbo frames to be used inside ESXi for optimum performance. Make sure the upstream switches from your Fabric Interconnects support MTU 9216.
I take these classes and create matching QoS Policies from. Simply use the same name and select the priority and I use all default settings otherwise. I need these policies when configuring vNICs.
I usually also create a Network Control Policy that allows CDP and LLDP both recieve and transmit, allow forged MAC and set the action to Link Down when an uplink fails. More on that later.
Before we start defining vNICs and LAN Connectivity Policies we need MAC addresses for the vNICs. UCS Manager allows you to define your own MAC addresses inside of the 00:25:B5 and then defining as much of the remaining as you want. You can easily just create a single pool and have UCS Manager assign MAC addresses from that pool but we borrow an idea from how Hyperflex designs their MAC pools.
What you do in Hyperflex is to select the 4th octet of the MAC as a prefix for a cluster e.g A1 so that start of each MAC is 00:25:B5:A1. That means you can identify a cluster in your network based on the 4th octet alone. Neat!
Next Hyperflex uses the 5th octet to define vNIC number and attached fabric. This means that vNIC1 will have A1 and vNIC2 will have B2. That means when setting it up you can match the 5th octet to a function. I use A1 and B2 for esxi management, vSAN on A3 and B4, guest traffic on A5 and B6, vMotion on A7 and B8 and any additional required NICs continue from there.
I create MAC pools to match a minimum of 8 vNics (2 mgmt, 2 vSAN, 2 guest and 2 vMotion). Then add 2 for NFS and 2 for virtual networking if needed.
With the MAC pools in hand and the policies from above I create a set of vNICs for ESXi. I prefer to have 2 for each function, one on fabric A and one on fabric B without fabric failover – ESXi can easily handle the failover and if I set it up like this ESXi can use both links from the server if need be and in case of a failure on one of the links I would rather see the vNICs go down and have ESXi handle the failover instead of it being transparent for ESXi.
Each vNic name is suffixed with the expected fabric so e.g. esxi-mgmt-a and esxi-mgmt-b. I set the “-a” as primary template in a Peer redundancy setup an “-b” to the secondary. This allows me to only update vlans and configuration on the “-a” vNIC and configuration will be in sync with the “-b” vNIC. The Template type is set to Updating to allow for adding things like additional vLANs to all servers using this vNIC without having to go through every profile. MTU needs to match the QoS policy selected and defined above. Select the matching MAC pool and set the Network Control Policy and done. Then repeat for each required vNIC.
I use the created vNICs to create a LAN Connectivity Policy which contains all the vNICs and setsthe adapter policy to VMWare (yes Cisco capitalizes it wrong 🙁 ). And that is it for networking for now. We will use the LAN Connectivity Policy when defining the Server Profile Template.
Server Profile Policies
I need a couple of Server Policies before we can create the Server Profile Template. First one I create is a Scrub Policy. This policy I generally make in the Root scope as I globally want scrub to be disabled for all types; Disk, BIOS, FlexFlash and Persistent Memory. I generally don’t want UCS to wipe settings unless specifically instructed to do so.
Next up is a Boot Policy. For M5 I define a Policy that uses Boot Mode UEFI and with Secure Boot enabled. Then I add a single boot option of type Local LUN using the LUN Name OS, which we defined in the Storage Profile previously.
If attempting to boot from an internal drive in an M4 as described in Part 1 some special options need to be set. Instead of using Local LUN select Embedded Disk and then modify the Uefi Boot Parameters option to set Boot Loader Name to “BOOTX64.EFI” and Boot Loader Path to “\EFI\BOOT\”. This is the only way I found to do UEFI secure boot on those drives.
I setup a Maintenance Policy for the Server Profile as well to set every action that might require reboots to “User Ack” which means that I need to manually approve any reboots of the host from profile changes. I also set the “On Next Boot” option to allow for easy firmware updating while updating ESXi. On Next Boot will apply any pending changes if the host reboots like when applying ESXi updates. Convenient!
Lastly I create a Host Firmware Package policy which sets the version of firmware to use in that cluster. As firmware packages can contain firmware for the SAS HBAs I want tight control as to which firmware is used. This also allows me to change the firmware level of the cluster in one step and then have pending changes for each host ready for when I’m ready to do the reboot to update firmware.
Server Profile Template
With all those profiles and things ready I can now create the template that each server will be instantiated from. This will be an updating template to allow for changes to be done consistently on all hosts and avoid configuration drift.
I usually just run through the wizard and select the policies created where applicable. As we don’t have any FC in our setup I usually don’t setup any vHBA’s. These can be added later given the Updating setting.
Only thing I do manually is to select the LAN Connectivity Policy to get the required vNICs for ESXi attached. Once added I complete the Wizard and go back into the network tab of the template to click “Modify vNIC/vHBA Placement”. I do this because the view to edit is easier to manage when access from there instead of in the wizard. I then manually place the vNICs in the order I want to force.
With all that there is now a profile template that can be used to produce identical ESXi hosts for vSAN usage. The profile even works on “compute only” nodes that don’t provide any storage to the system as long as they still use the M.2 HWRAID boot module. Very nice in my opinion.
Next up in part 3 I will go over some of my ESXi configurations that I prefer in the vSAN pods I run.
I have had parts of this post saved in draft for months without getting it finished because it was turning to a monster of a post if I tried covering it all. I finally found the drive to finish it when I realized that this was probably better if I split it in multiple posts instead of trying to include hardware considerations, UCS manager / standalone profile configurations and ESXi configurations into one single post.
So without further ado lets dive into the hardware part of VMware vSAN on Cisco UCS.
Please do note that these are my personal opinions and may or may not align with what you need in your datacenter solution. Most designs are individual for at specific use case and as such cannot be taken directly from here.
Cisco has a bunch of certified vSAN Ready nodes based on M4, M5 and M6 branches of servers. M3 isn’t supported as the hardware is both EOL and most of the controllers available for M3 models weren’t powerful enough for running vSAN workloads. The most common to use are Cisco’s C240 M5SX 2U models which allow for 24-26 drive bays total. For smaller deployments the C220 M5SX is also an excellent option with up 10 drives in 1U.
It is technically possible to run vSAN on other types of servers like the S3260 and B200 blades but they limit your options in terms of storage to compute ratio (S3260 being able to provide massive amounts of storage but little in compute and B200 being the opposite due to only having 2 disk slots).
One thing to note is if you plan on using NVMe storage options you need to focus on M5 and M6. M5 allows for up to 4 NVMe devices in U.2 format while M6 can support up to 24 NVMe devices. M4 only supports PCIe NVMe devices.
Cisco has traditionally been a network boot company and as such the primary local boot option on M3 and M4 is SD cards if you don’t want to waste disk slots on boot devices. On B200 M4 with only 2 disk slots SD card is currently the only option as the disk slots are needed for a caching and capacity disk. On all M5 and M6 models (B200 included) there is a new dedicated slot for inserting a UCS-M2-HWRAID controller which can fit 2 M.2 drives (either 240 or 960 GB) and can do actual RAID that ESXi supports. Do not use the UCS-MSTOR-M2 controller which fits the same slot and fits 2 M.2 as well but this only supports the onboard LSI-SW RAID from the Intel chipset and that is only supported by Windows and Linux and not ESXi. It is not that expensive – just by the HWRAID controller 🙂
Specifically on the C240 M4 if you choose a UCSC-PCI-1C-240M4 you can insert up to two drives internally in the server that are managed by the onboard controller. You won’t have RAID functionality but it beats SD card booting by miles!
My go to here is using M5 servers with a UCSC-MLOM-C40Q-03 (VIC 1387) in combination with 6300 series Fabric Interconnects. That provides 2x40G per server which pairs nicely if your upstream network is 40 or 100G. On M6 that would be UCSC-M-V100-04 (VIC 1477) that provides the same.
If you are using 6400 series Fabric Interconnects and a 25G infrastructure you might want to go with UCSC-MLOM-C25Q-04 (VIC 1457) on M5 and UCSC-M-V25-04 (VIC 1467) on M6 to give 4×10/25G connections instead. Depends on your infrastructure.
On M4 it is technically possible to use the UCSC-MLOM-C40Q-03 (VIC 1387) all though the UCSC-MLOM-CSC-02 (VIC 1227) adapter is way more common but only provides 2x10G connections. If you run a pure 10G infrastructure and continue to do so I recommend adding an additional UCSC-PCIE-CSC-02 (VIC 1225) to provide 2x10G. I see this combination primarily used with 6200 series Fabric Interconnects.
For blades the standard is UCSB-MLOM-40G-03 (VIC 1340) for M4 and UCSB-MLOM-40G-04 (VIC 1440) for M5 and M6. Both cards are 2x40G. These need to be paired with IOM’s in the blade chassis which can limit the speed of the vNICs presented. Usually you get 2x20G on IOM 2304 and 2208. Consult your Cisco vendor to confirm how to get optimum speeds for your setup.
Now the probably most crucial part of the any vSAN deployment – the controller. Albiet less important if you go for all-NVMe or even the new ESA option in vSAN 8 you need at SAS/SATA controller to handle your disks.
On C240 M4 this is usually UCSC-SAS12GHBA or UCSC-MRAID12G with a UCSC-MRAID12G-1GB cache module. Both are on the HCL but SAS HBA is prefferable over the RAID controller
On C220 and C240 M5 the only real options for vSAN are UCSC-SAS-M5 and UCSC-SAS-M5HD respectively. Primary difference is how many drives the controller is capable of utilizing which of course needs to be higher for the C240.
On the C240 M6 the option is CSC-SAS-M6T (UCSC-SAS-240M6) which allows for up to 16 disks but to be honest – if you are going for M6 nodes you should probably go for an M6N og M6SN for all NVMe configuration instead.
I won’t touch too much on this as various use cases and requirements need different numbers of disk groups and capacity devices. You use case may vary. We primarily use 3.8 TB Enterprise Value SATA SSD’s for capacity simply because they are fast enough and readily available to us. We aim to use NVMe caching devices if at all possible but if not we select a high endurance and performance SAS SSD for caching.
One note to have in mind. M4 only supports PCIe NVMe devices. On the C220 M5SX two front slots can be used for NVMe and on C220 M5SN all 10 slots can be NVMe. On the C240 M5SX slots 1 and 2 as well as 25 and 26 (on the rear) can be used for NVMe’s and on the C240M5SN bays 1-8 can be used for NVMe.
If you are retrofitting NVMe’s into existing C2x0 M5’s note that on the C220 M5 you need a CBL-NVME-220F to be able to use the front facing NVMe drives if not already present.
On the C240 M5 I recommend going for a UCSC-RIS-2C-240M5 which supports both 2xfront and 2xrear mounted NVMe’s if you remember to order a CBL-NVME-240SFF and UCSC-RNVME-240M5 to connect the front and rear slots respectively to the riser. This configuration allows you up to 4 NVMe caching devices while using SAS/SATA capacity drives up to 5 drives per group which can be a lot of disk and performance.
So those are the notes on hardware I have. I have not touched on CPU types and memory configurations at all as this is something that needs to match your workload. Somethings might need 3.0 Ghz base clock and no memory or loads of cores and memory. Pick something that matches the workload but I would recommend sticking to Xeon Gold CPU’s to get a good balance of performance and cores and selecting a configuration of 12 DIMMs for M5’s to get maximum memory bandwidth.
In the next article I’ll touch on the UCS Manager configurations that I use for vSAN.
It’s been too long since I could get around to blogging something relevant again. This year so far as just been sooo busy with continued migrations to vSAN from old HCI platforms, implementing network solutions and onboarding customers to our platform in general and lately making sure that we got the platforms onto vSphere 7 before 6.7 went EOL – I know late to join the game but given the many many issues with the earlier releases of 7.0 we opted to wait for 7.0 U3g for our most critical pods which ment late summer upgrades.
Now with that sorted we started having a bunch of fun problems – even as late adopters! I have had more VMware cases with GSS the last 4 weeks than almost the entire year. Primarily regarding vSAN itself and vSAN/Usage Meter problems.
Today I’m going to do a little write-up mostly for myself as I spent way too much time getting the correct info out of GSS regarding the calculation.
So the short story, we, as a VCPP partner, are required to upload our vSAN usage every month (the data is uploaded every hour) to calculate how much we need to pay. Pretty stanard solution for Cloud or Managed Service Providers. The gathering and upload of data is handled by an on-prem Usage Meter (UM) that collects data and uploads to vCloud Usage Insight (VUI). At the start of each month data is processesed and sent to VMware Cloud Provider Commerce Portal (VCP) for us to validate or adjust and then submit.
This month I was doing the validation part when I realized a lot of our usage had shifted around between the available license levels. I was confused – becasue with UM 184.108.40.206 and vSphere 7.0 U3h we were supported so data should be okay. My assumption was that the data moved from VUI to VCP was wrong but upon checking VUI I could see that data was wrong there as well. So now either our UM uploaded data incorrectly or VUI was processing incorrectly. My assumption was that VUI was at fault so opened a GSS case.
I will spare you the details of the case and it taking over a week to get to the bottom of but it was confirmed that there is a bug in 220.127.116.11 that is fixed in 4.6 – but not listed in release notes. Where if UM detects that a cluster is using a Shared Witness the uploaded data forgets to include the stretch cluster option causing. We aren’t using shared witness but inspection of the cluster-history.tsv file that can be downloaded from VUI confirmed that UM thought we were and we could make a direct connection between the time our vCenter was upgraded and the error starting to occur.
So that is a VMware error right? Their product is reporting incorrectly and thus data is processed incorrectly. Should be easy for them to fix? No. I was instructed to do the calculation manually and adjust numbers on the MBO in VCP.
I was linked the Product Detection Guide which states that the calculation should be:
average GB = (Sum of consumed storage capacity in GB per-hourly collections) / (hours in a month)
Okay – should be easy. And given the problem was feature detection and not actual consumption I could validate the calculation against the Monthly Usage Report by summing that usage up across all licenses types. Numbers should be the same – just differently split across license levels (Standard, Advanced or Enterprise).
So I imported the data into Excel and made a Pivot table that summed all collections of usage in MB per cluster and divided that number by 1024 to get GB and then again by 744 which is the hours in the month. Easy. Well no. That gave me a difference of 56TB of usage or close to 10%
Something was wrong with the calculation or the numbers in the report. GSS was vague for a while and at one point stating that the difference was caused by the calculation happening on bytes and not MB which could not really account for that amount of difference.
Finally a got the details from GSS or rather from the backend team supporting GSS. The calculation in the Product Detection Guide is an oversimplification of the actual calculation – it works because usually each measurement interval is 1 hour. but one of our pods had intervals of both 2, 3 or even up to 6 hours. The tsv file shows this.
So what is VMware actually doing? Well, as licensing is based on features used and hourly collections it is possible to change your license level up and down by the hour so calculation of usage is actually done for each collection interval and not across the entire month.
What is actually done is that each collection interval by first calculating a coefficient that is based on how long the interval is by taking the field in the tsv called “interval (Hours)” and dividing that by the hours of the month times 1024 like:
coefficient = "Interval (hours)" / (hours of month * 1024)
The 1024 is to convert the consumed storage from MB to GB and hours is of course not the same every month. Next the collected usage is measured against the vsanFInt field which defines which features are used – how to calculate that is detailed in the Product Detection Guide. This will place the usage in MB into either Enterprise, Advanced or Standard usage. The usage is then multiplied by the coefficient giving a GB usage per licens for the collection interval despite it’s length.
Finally you can just sum the usage after multiplying with the coefficient per license level to figure out how your usage is split for reporting.
Now that may be a mouthful to explain so I hope that if you need to do this you understand me otherwise please reach out and I’ll be happy to help. And all of this was simply a problem because of a bug in the vsanFInt UM was calculating.
Let’s start of with the easy stuff – my blogging has not been up to par this year. I have had way too little time to actually push any new content. This kind of bugs me a bit too much but the positive thing is it means I’ve been busy doing other stuff.
So what has happened in 2021? Coming into this year we had a major plan at work. Having been fed up with subpar performance of our existing HCI platform we had decided to purchase hardware to start converting all our old HCI platforms to vSAN. This would become one of the major tasks inside 2021.
I’d like to dive a bit further into this because of the magnitude (at least for me) of this task. Internally we have been running with 6 pods from one HCI vendor complemented by a few clusters using Netapp storage and some standalone nodes.
On top of this we implemented a simple 8-node stretched cluster on Cisco B200 M4 blades to run vSAN on. This was our first vSAN pod and it was built based on specs from the vSAN Ready Node configuration of B200 M4s but changing out some of the disk types with other supported models and more performant CPUs and more memory. This pod came to be based on a licensing optimization and would run only non-Windows based workloads.
We had an amazing experience with this pod which fueled our desire to switch the old HCI platform for vSAN as well. At the start of the year we had 8 2U nodes that were capable of being retrofitted for vSAN All-Flash. They were on the HCL and all components were as well. We actually only had to change a riser card to get additional NVMe slots as well as adding more NVMe caching devices.
Once we had this pod operational in a stretched cluster configuration (4+4) we started by emptying one of the existing HCI hybrid pods onto this new pod temporarily. Once emptied we could start by replacing the old 3.2 TB SAS SSD caching device and replace it with 2 1.6 TB NVMe devices instead. We could have reused the 3.2 TB SAS SSD and purchased an additional one but it was cheaper to replace it with the 2 NVMe drives instead. The hybrid pod had 12 8TB spinning disks in front so we needed a minimum of 2 disk groups to handle all the disks and with 2 NVMe slots in the back of the server the choice was easy.
We did performance testing on the new vSAN hybrid pod and my god it was fast compared to running the old HCI software. During the performance testing I managed to make several disk groups exit the cluster by running our performance workload for too long. I had a very good talk with VMware GSS about this and was recommended some changes to our test workload, primarily around duration, that would show a better picture. Our testing methodology is basically throw the worst kinds of workload we can at the pod and if performance is good enough we will have no issue running the workload we needed to put on the pod.
After migrating back the hybrid workload (and enjoying extra available capacity change to vSAN provided) we started migrating our most critical stretched workload to the new vSAN All-Flash pod. This process took forever. The primary thing was a thing I had not noticed before because it is usually not a problem. Our new vSAN All-Flash pod had been put into Skylake EVC mode because it was running 6200 series Xeon’s and would be supplemented with some 6100 series at a later point. Skylake being highest common denominator. However the old pod that we were migrating from was running on 6100 series Xeon’s without EVC mode enabled. One would think that Skylake native and Skylake EVC would be the same – but no, not the case as shown in KB76155.
This meant that about half of the 400 machines that needed to be moved would need to either be moved powered off (tough sale with the customers) or have a short maintenance window to update hardware version to 14 or 15 and then enable Per-VM EVC mode. Most of our customers were a breeze with minor service impacts to do this but one customer in particular was a bit rough which dragged the process on across the fall of this year.
But we finally managed to empty the old pod and power it of. Our next step was to reconfigure the released hardware to a vSAN certified configuration. We then proceeded to install it as a new vSAN pod and it became ready for production just 2 weeks ago. We’ll utilize this new pod to empty the next of our old HCI platforms so we can liberate the hardware from that pod for even more conversions. The process is simple but it does take time.
I have one outstanding issue that I need to solve in the new year. Some the older systems are Cisco C240 M4SX nodes. These only have internal SD boot as well as 24 drive slots in the front hooked up to a single RAID controller via 2 SAS expanders. With VMware deprecating SD/USB boot in the close future (KB85685) and vSAN not allowing non-vSAN disks on the same controller as vSAN disks we need to figure out how to boot these servers – if anyone has a solution I’m all ears! I could do some sort of iSCSI boot but I’d prefer not to!
On top of these conversions we also needed to manage all our normal operations as well as another major project that was started up in the late spring early summer. We needed to replace our vRA 7.6 install with VMware Cloud Director.
With vCD not really dying as was foretold years ago and vRA carrying a cost that vCD isn’t in our Cloud Provider licensing coupled with some usability issues with vRA from our customers we set out to test vCD in the summer and look through all the pain points of vRA to see how that compared in vCD.
Result was that we decided to roll out vCD in the fall and started the process of setting up a 10.3 production environment. We had done tests on 10.2.2 and upgraded the test to 10.3 before rolling the production environment out but yet we found good surprises!
First many machines were very easy to get imported but suddenly I had an issue where I could not import and move VMs into a vApp. I did some testing and found that if I created a new vApp I could move into that vApp. After a lot of debugging with our vTAM and GSS we found that one of our clients had deleted 2 VMs via vRA AFTER they had been imported into vCD and into that vApp. That stuck those two VMs in Partially Powered Off and blocked additional imports into the vApp.
We figured out with the help of GSS that we could run the following commands to be allowed to delete the VMs (you cannot delete a Partially Powered Off VM):
$vm = Get-CIVM <VMNAME>
This allowed us to continue only to find the next bug. We found that some VMs would not be allowed to be moved into a vApp after Auto-import. They failed with an error about not being allowed to change bus or unit numbers while powered on – but why would it need to change those?
Turns out a bug was introduced in 10.3 (we didn’t see it in 10.2.2 at least) where VMs that had disks that weren’t in sequential unit numbers on the controllers would be forced to try to “correct” that. A unneeded operation. We opened a GSS case on it and managed to get a response that 10.3.1 fixed the issue – which it fortunately did, but it was an undocumented fix.
We have by December 1st powered down our old vRA platform and replacement with vCD has been completed. A few special machines still remain to be imported but we are 99% there which is a great feeling to end the year with.
Next year will be more vSAN conversions (we have a few Citrix pods and some disaster recovery pods to convert) as well as more vCD. We might have some NSX-T in the future as well which will likely challenge my networking skills a lot. We have been doing ACI networking for the last 4 years and I am finally at a point where I feel comfortable with the basic configurations of that platform but NSX-T just looks to have features that are easier to use by the looks of it.
This year was also the year I got my first VMware certification – VCP-DCV2021 in January. I also managed to get the vSAN Specialist badge in July making it a very good certification year for me.
Now that was a very long blog post and I hope you bear with me along it all. I have really had a lot of VMware under my nails this year but also mountains of networking and server operations. Hope I can have more time to dive into solutions in the new year.
Happy Christmas everyone and a good new year to you all!
Back in may of last year I was tripping to get my hands on WSL2 with the new backend and improved performance. I wrote a few blogposts about it and even wrote my, to date, most viewed and commented post about it (WSL2 issues – and how to fix some of them).
Now the issue that hurt me the most at first was Workstation 15.5 was not able to run with WSL2 installed as this enabled the Hyper-V features of Windows 10 which collide with Workstation.
The day after WSL2 released VMware pushed 15.5.5 which allowed Workstation to run even with Hyper-V enabled but at greatly reduced performance – just Google it and be amazed.
It does not really come as a surprise as having Workstation (A virtualization engine) run on top of Hyper-V (also a virtualization engine) on top of hardware is not a recipe for performance!
As a result I have not been using my Windows 10 VM that much the last many months – until now!
I got my hands on a Workstation 16 Pro license and went in for an upgrade to see if any of the improvements in 16.1 would alleviate some of my performance issues. And after completing the install which prompted me to enable the Windows Hypervisor Platform I spun up my Windows 10 machine from suspend. I quickly got a popup noting me that I had “side channel mitigations” enabled as show below here:
Now from working with vSphere I realize that many of the side channel mitigations can have heavy impact on performance so I updated my Windows 10 OS and shut it down and followed KB79832 as linked in the popup to disable the mitigations.
I powered on my VM again and could immediately feel the difference. I may not have the exact same performance I had with 15.5 on an non-Hyper-V enabled host but it is a LOT better than it was. Major problem now seems to be that fact that my tiny i7-7600U dual core CPU can’t keep up! Dear Dell when are you rolling out some Latitude’s with Ryzen 7 5800U’s??
As part of our ongoing engagement with VMware we are required to operate vCloud Usage Meter to measure rental license usage for reporting back to VMware. We have been running an older build for a long time now waiting for the 4.3 release to come out because this new release could correctly measure vRealize Automation usage based on the Flex bundle Addon model rather than per OSI.
I got the appliance deployed just before the holidays but ran into several issues that I’d like to share with you.
First issue I ran into actually prompted me to redeploy because the migration of configuration from the old appliance ended in a bad state. It was caused by two things 1) I was missing a Conditional Forwarder for a domain on the DNS servers on the new appliance was using and 2) systemd-resolved is a nightmare to work with!
It like to focus in on the systemd-resolved. I really don’t like this piece of software as it is insanely frustrating to troubleshoot on. What it basically does is set the /etc/resolv.conf server to a local address on the server (127.0.0.53) and on that IP a daemon is listening for requests. If it can answer the request it does otherwise it passes the request onwards as normal.
But – and this is the crucial part – it handles “.local” domains a bit different. What it actually does I cannot answer completely but .local is being used by some services like Bonjour and mDNS. This is crucial as if you do not explicitly state that a .local domain needs to be resolved via actual DNS systemd-resolved won’t do it.
To jump a bit – the new Usage Meter 4.3 appliance runs on Photon OS which uses systemd. The older appliances use SLES which doesn’t and thus don’t have the issues. I had to do a lot of tinkering to get this working but managed by following this article: https://github.com/vmware/photon/issues/987 and making sure that both my required .local domains were present in the search path parameter and that the DNS servers were explicitly inserted into the 10-eth0.network config file.
I had to do both things otherwise it did not work. Search path can be configured correctly on deploy if you remember it. The DNS settings must be done after deployment but before running the migration script. Double check DNS resolution before attempting migration – it’ll save you headaches!
The appliance has been deployed and config migrated which prompted me with to errors – that old 5.5 vCenter that hadn’t been fixed yet and a currently unknown bug in registering a vRealize Automation 7.6 install – VMware support are investigating!
Ohh it has been a while again since the last time I got to writing. Being busy with maintenance work is not really something that makes for great blog articles.
But last week I got to attend VMworld 2020! This year due to the situation world wide it was a virtual setting so for me it was two days in the home office watching a lot of great content on Kubernetes, NSX, vSAN and much more.
So many great things we announced. But the thing that struck me first was the acquisition of SaltStack. This is a major move to actually incorporate a configuration management system into the VMware portfolio and will certainly strengthen vRealize Automation in the future and hopefully also other parts of the ecosystem!
Another very huge announcement was Project Monterey. Although I’m still trying to wrap my head around the use cases and oppertunities this presents I do like the idea very much! Being able to offload vSAN and NFV workloads to the a SmartNIC is a great idea and I hope to see it evolve in the future.
This week also saw some the GA release of several new versions of the core products from VMware. These were announced previously but I was not aware that they would be releasing so soon – but that is just the cherry on top!
First up is the release of vSphere 7 U1! Biggest new feature has got to be the ability to run vSphere with Tanzu as well as new scalability maximums for VMs.
Along with vSphere 7 U1 there is of course also a vSAN 7 U1 release! Here features like HCI mesh allowing you to share the vsanDatastore natively between vSAN pods is one of my top features. Improvements to the fileservices of vSAN also landed as well as the option to only run compression on vSAN and not both compression and deduplication. Great features! For those running 2-node clusters or stretched clusters requiring witness a huge improvement has also landed allowing a witness server to be shared by up to 64 clusters! Very nice!
Another feature also seems to have crept in as detailed by John Nicholson. It is the option to run the iSCSI feature on stretched clusters. Again a very nice feature to have included for those needing it.
Last bit of GA material that I wanted to comment on aswell is the release of vRealize Automation 8.2. There are much needed improvements to the multi-tenancy of vRA as well as improvements to Infrastructure-as-code workflows and Kubernetes.
It can be a daunting task to keep up with all the releases from VMware but their ability to push new releases and features never ceases to amaze me!
As I work for a Cisco Partner at the moment I have been looking to get access to the Cisco PSS APIs specifically to get coverage status on a Cisco device serial number.
If you have a Cisco account you can access the Device Coverage Checker online and check up to 20 serial numbers at a time. I have used this extensively. The same information can also be viewed if you have access to Intersight.
But I am looking to integrate with our DCIM tool Netbox to allow for easy check of coverage via API calls. Those API calls are for us available via the PSS API call to the endpoint SN2INFOv2.
Now of course this requires some sort of authentication and Cisco has an intricate process for getting access which boils down to creating a TAC case an request access.
Once you have access you need to create an application and grant that application access to the SN2INFOv2 APIs with “Client Credential” privileges. This generates a Key and a Client Secret unique to the application which is needed to get access.
Now here’s the problem. The Cisco API Developer has great documentation on the SN2INFOv2 API and how to format the request – but those need a Token to be accessed. The token needs to be generated first which was not immediately clear how to do.
I deciphered that I needed to do a OAUTH2 login agains cloudsso.cisco.com but could not find the documentation on how to format the request. I searched around to figure out how and found reference to a different API that showed an example on how to do this.
Problem was it refenced a “Client ID” which I did not seem to have. So I guessed a bit and assumed that “Client ID” must be the “Key” I had as the login required “Client ID” and “Client Secret” and I had “Key” and “Client Secret”.
So formatted the GET request but got a 405 Method not allowed. Now I was a bit lost. But searching a bit more I fell upon a dodgy PHP developer forum which I will not link to. But here was an example of a cURL request that showed me an approach. The request looked like this: