It’s been too long since I could get around to blogging something relevant again. This year so far as just been sooo busy with continued migrations to vSAN from old HCI platforms, implementing network solutions and onboarding customers to our platform in general and lately making sure that we got the platforms onto vSphere 7 before 6.7 went EOL – I know late to join the game but given the many many issues with the earlier releases of 7.0 we opted to wait for 7.0 U3g for our most critical pods which ment late summer upgrades.
Now with that sorted we started having a bunch of fun problems – even as late adopters! I have had more VMware cases with GSS the last 4 weeks than almost the entire year. Primarily regarding vSAN itself and vSAN/Usage Meter problems.
Today I’m going to do a little write-up mostly for myself as I spent way too much time getting the correct info out of GSS regarding the calculation.
So the short story, we, as a VCPP partner, are required to upload our vSAN usage every month (the data is uploaded every hour) to calculate how much we need to pay. Pretty stanard solution for Cloud or Managed Service Providers. The gathering and upload of data is handled by an on-prem Usage Meter (UM) that collects data and uploads to vCloud Usage Insight (VUI). At the start of each month data is processesed and sent to VMware Cloud Provider Commerce Portal (VCP) for us to validate or adjust and then submit.
This month I was doing the validation part when I realized a lot of our usage had shifted around between the available license levels. I was confused – becasue with UM 22.214.171.124 and vSphere 7.0 U3h we were supported so data should be okay. My assumption was that the data moved from VUI to VCP was wrong but upon checking VUI I could see that data was wrong there as well. So now either our UM uploaded data incorrectly or VUI was processing incorrectly. My assumption was that VUI was at fault so opened a GSS case.
I will spare you the details of the case and it taking over a week to get to the bottom of but it was confirmed that there is a bug in 126.96.36.199 that is fixed in 4.6 – but not listed in release notes. Where if UM detects that a cluster is using a Shared Witness the uploaded data forgets to include the stretch cluster option causing. We aren’t using shared witness but inspection of the cluster-history.tsv file that can be downloaded from VUI confirmed that UM thought we were and we could make a direct connection between the time our vCenter was upgraded and the error starting to occur.
So that is a VMware error right? Their product is reporting incorrectly and thus data is processed incorrectly. Should be easy for them to fix? No. I was instructed to do the calculation manually and adjust numbers on the MBO in VCP.
I was linked the Product Detection Guide which states that the calculation should be:
average GB = (Sum of consumed storage capacity in GB per-hourly collections) / (hours in a month)
Okay – should be easy. And given the problem was feature detection and not actual consumption I could validate the calculation against the Monthly Usage Report by summing that usage up across all licenses types. Numbers should be the same – just differently split across license levels (Standard, Advanced or Enterprise).
So I imported the data into Excel and made a Pivot table that summed all collections of usage in MB per cluster and divided that number by 1024 to get GB and then again by 744 which is the hours in the month. Easy. Well no. That gave me a difference of 56TB of usage or close to 10%
Something was wrong with the calculation or the numbers in the report. GSS was vague for a while and at one point stating that the difference was caused by the calculation happening on bytes and not MB which could not really account for that amount of difference.
Finally a got the details from GSS or rather from the backend team supporting GSS. The calculation in the Product Detection Guide is an oversimplification of the actual calculation – it works because usually each measurement interval is 1 hour. but one of our pods had intervals of both 2, 3 or even up to 6 hours. The tsv file shows this.
So what is VMware actually doing? Well, as licensing is based on features used and hourly collections it is possible to change your license level up and down by the hour so calculation of usage is actually done for each collection interval and not across the entire month.
What is actually done is that each collection interval by first calculating a coefficient that is based on how long the interval is by taking the field in the tsv called “interval (Hours)” and dividing that by the hours of the month times 1024 like:
coefficient = "Interval (hours)" / (hours of month * 1024)
The 1024 is to convert the consumed storage from MB to GB and hours is of course not the same every month. Next the collected usage is measured against the vsanFInt field which defines which features are used – how to calculate that is detailed in the Product Detection Guide. This will place the usage in MB into either Enterprise, Advanced or Standard usage. The usage is then multiplied by the coefficient giving a GB usage per licens for the collection interval despite it’s length.
Finally you can just sum the usage after multiplying with the coefficient per license level to figure out how your usage is split for reporting.
Now that may be a mouthful to explain so I hope that if you need to do this you understand me otherwise please reach out and I’ll be happy to help. And all of this was simply a problem because of a bug in the vsanFInt UM was calculating.