Updating ESXi 6.0 with depot fails with Errno 32 – broken pipe

This fall I had a task to upgrade some old ESXi 6.0 hosts in a tightly controlled environment without internet access for vCenter and other conveniences. So I resorted to doing the old classic:

esxcli software vib install -d /vmfs/volumes/datastore/esxi-depot.zip

This worked from one of the sites no issue and updates completed quickly. However once I got to the second site the first host of 3 updated no problem but second failed with the error:

[Errno 32] Broken pipe vibs = VMware_locker_tools-light_6.0.0-3.76.6856897.vib

This got me a bit confused so started looking into the filesystem of the ESXi host and discovered that the symlink /productLocker was pointing to a folder in red which usually means a broken link.

ProductLocker contains the files from VMware tools.

/productLocker is a symlink to a folder inside the /locker folder which inturn is a symlink to /store which symlinks to a partition on the boot volume. I tried changing into the /store and doing an ls which corrupted my terminal output.

Turns out the filesystem on the /store partition was corrupted. I check the two bootbank partitions which were okay and then realized that this host and it’s partner were both booting from SD cards which I hate working with!

As this was old 6.0 hosts – support was out of the question so I started looking around the internet for a possible fix that didn’t involve reinstalling, and to my luck I found a blog post from VirtualHackey from 2020 which detailed almost exactly the same situation as I was seeing.

The fix was simple – find and copy the actual content from the corrupted filesystem, format the partition and copy content back. He even describes with a link to another article how to locate the content. Unfortunately I could not find the files inside the filesystem.

So I had two options: format the partition and hope I could upgrade without the packages present or reinstall. That is until I realized that I had another site with identical hosts and ESXi 6.0 build. I checked those and found the /store partition in perfect state.

So I reached out to my TAM to ask if the method of formatting the partition and copying the content from a working host was viable. This was of course best effort support but reached out to see if the procedure perhaps was done in an old support case.

I got response that it ought to work otherwise a reinstall was necessary.

So – I tried it and it worked like a charm. I completed the upgrades without further hitches.

So shout out to VirtualHackey for providing the method to fix this problem – much appreciated!

The Future of Cloud Management

Today VMware is releasing the next installment in their series on Multi-Cloud Briefings named “The Future of Cloud Management” which will cover new additions to VMware’s vision of multi-cloud management at new features added to VMware Aria.

If you haven’t seen any of the previous multi-cloud briefings I recommend checking out the Youtube channel where all the previous briefings can be watched.

VMware Aria

Much of the portfolio of VMware Aria I have some experience with from the older on-prem solutions but you may, as I, not know what VMware Aria actually covers.

VMware Aria is VMware’s portfolio of tools for multi-cloud management (formerly known as vRealize Cloud Management) which was announced in August – check out this blog post for more information.

Many of the products and tools of Aria are rebranded from old products as well as addition of new features built upon that foundation.

First of you may have used products like vRealize Automation, Operations, Log Insight or Network Insight before. All of these products are part of Aria now and have been rebranded to Aria Automation, Aria Operations, Aria Operations for Logs and Aria Operations for Networks respectively. As well as now including Skyline and CloudHelath and CloudHelath Secure State, that latter two being rebranded to Aria Cost powered by CloudHealth and Aria Automation for Secure Clouds.

These tools have been integrated into a new product named Aria Hub (formerly Project Ensemble). Underneath Aria Hub sits Aria Graph which is the data source that powers the new features of the Aria portfolio.

Data from the above tools and solutions along with data that is pulled from the cloud providers that you want like Azure, AWS, VMConAWS etc is then collected (cool detail is that data is not duplicated but rather referenced via pointers) into an inventory in Aria Graph.

From Aria Graph, Aria Hub is then able to show you your cost, usage, problems, performance and possible security compliance issues based on the data across all cloud endpoints.

Via the Aria Hub UI you can look at different perspectives based on whether you are a business manager, application owner or operations engineer. You can customize the home page for your different groups of people so that they get the info they need, be it cost, performance or security.

You can select your SDDCs or your applications and drill down into the elements that make them up like VM instances, Kubernetes pods, networks, storage etc and look at consumption, performance, cost etc for the entire SDDC or application on all the way down the stack to the individual components that it is made of.

Aria Guardrails and Business Insights

VMware Aria is not just a rebranding of existing products but also introduces new features built on-top of those products like Guardrails and Business Insights.

Guardrails allows you to setup automatable policies for things like security, cost, performance etc that can be enforced on the different applications and SDDCs attached to your Aria hub.Being powered by and “everything-as-code” approach Guardrails includes a library of policy templates that can be imported and customized to your environments and allow for automatic remediation of things like making sure all your Kubernetes pods are attached to Tanzu Mission Control so that security policies inside TMC can be enforced and monitored.

Business Insights integrates with Guardrails and the other products inside Aria Graph to allow for AI/ML powered analytics to inform you of compliance issues and optimizations that can be useful – all available via Aria Hub and tailored for the specific class of user logged in.

App Migrations

One of the very exciting new features coming is Aria Migrations which will assist you in analyzing and planning migrations of applications running in your on-prem infrastructure to VMC. Currently it is the only migration type supported but the types of migrations will be expanded in the future.

Via Aria Hub you can plan the migrations by selecting the subset of resources you want to migrate and then Migrations will assist you in identifying any dependencies that are not in your scope that might impact your performance or security if migrated. You can then add these dependencies if needed.

You can then compare the expected TCO of keeping the application on-prem versus in the cloud and make decisions based on this – all powered by Aria Cost.

If you want to perform the migration you can continue to the planning of the migration where App Migrations will assist you in planning bundles of workloads to be moved and in what order.

After splitting up the migration each bundle of workloads can be planned across multiple migration steps.

All of the migration being powered by HCX and CodeStream allowing for testing of the steps, rerunning failed steps and monitoring the process.

But how does Aria know what entities are linked to each other you may ask? Well there are multiple ways like using flow information from Aria Operations for Networks to see who is communicating with who or what deployment the entity is a part of in Aria Automation.

When a link is detected users will be presented with the discovery of the entity and the link to an application and can then either confirm the link is correct or let it automatically be accepted. Very neat!

All about the APIs

All of this sounds excellent but how does it fit into your existing business? Well here’s the cool part. With Aria Graph at the base of everything you get a full GraphQL based API. Everything you can do in the UI can be done via GraphQL against Aria Graph so if your can write a piece code that can read or write to a GraphQL based data source you can integrate it to your existing tooling like Service Now.

The approach of allowing focusing on API to powered the UI is not a new but it is very nice to see VMware take this approach so seriously with Aria Hub and Aria Graph.

Final words and the freebie

If this sounds interesting to you I highly recommend heading over to the Aria Hub landing page and sigining up for the free trial and trying out the product.

Personally I am very excited to see where this is going – being a on-prem data center operations engineer this a big world for me to step into and a lot of information to digest.

One note for the security minded people. VMware Aria Hub is currently a SaaS offering which means your data will be located in a cloud. This might prohibit you from using the offerings. This is of course a shame but VMware have informed that they might on a +1 year timeframe look into an on-prem version of the product.

Manually calculating vSAN Usage for Cloud Providers

It’s been too long since I could get around to blogging something relevant again. This year so far as just been sooo busy with continued migrations to vSAN from old HCI platforms, implementing network solutions and onboarding customers to our platform in general and lately making sure that we got the platforms onto vSphere 7 before 6.7 went EOL – I know late to join the game but given the many many issues with the earlier releases of 7.0 we opted to wait for 7.0 U3g for our most critical pods which ment late summer upgrades.

Now with that sorted we started having a bunch of fun problems – even as late adopters! I have had more VMware cases with GSS the last 4 weeks than almost the entire year. Primarily regarding vSAN itself and vSAN/Usage Meter problems.

Today I’m going to do a little write-up mostly for myself as I spent way too much time getting the correct info out of GSS regarding the calculation.

So the short story, we, as a VCPP partner, are required to upload our vSAN usage every month (the data is uploaded every hour) to calculate how much we need to pay. Pretty stanard solution for Cloud or Managed Service Providers. The gathering and upload of data is handled by an on-prem Usage Meter (UM) that collects data and uploads to vCloud Usage Insight (VUI). At the start of each month data is processesed and sent to VMware Cloud Provider Commerce Portal (VCP) for us to validate or adjust and then submit.

This month I was doing the validation part when I realized a lot of our usage had shifted around between the available license levels. I was confused – becasue with UM 4.5.0.1 and vSphere 7.0 U3h we were supported so data should be okay. My assumption was that the data moved from VUI to VCP was wrong but upon checking VUI I could see that data was wrong there as well. So now either our UM uploaded data incorrectly or VUI was processing incorrectly. My assumption was that VUI was at fault so opened a GSS case.

I will spare you the details of the case and it taking over a week to get to the bottom of but it was confirmed that there is a bug in 4.5.0.1 that is fixed in 4.6 – but not listed in release notes. Where if UM detects that a cluster is using a Shared Witness the uploaded data forgets to include the stretch cluster option causing. We aren’t using shared witness but inspection of the cluster-history.tsv file that can be downloaded from VUI confirmed that UM thought we were and we could make a direct connection between the time our vCenter was upgraded and the error starting to occur.

So that is a VMware error right? Their product is reporting incorrectly and thus data is processed incorrectly. Should be easy for them to fix? No. I was instructed to do the calculation manually and adjust numbers on the MBO in VCP.

I was linked the Product Detection Guide which states that the calculation should be:

average GB = (Sum of consumed storage capacity in GB per-hourly collections) / (hours in a month)

Okay – should be easy. And given the problem was feature detection and not actual consumption I could validate the calculation against the Monthly Usage Report by summing that usage up across all licenses types. Numbers should be the same – just differently split across license levels (Standard, Advanced or Enterprise).

So I imported the data into Excel and made a Pivot table that summed all collections of usage in MB per cluster and divided that number by 1024 to get GB and then again by 744 which is the hours in the month. Easy. Well no. That gave me a difference of 56TB of usage or close to 10%

Something was wrong with the calculation or the numbers in the report. GSS was vague for a while and at one point stating that the difference was caused by the calculation happening on bytes and not MB which could not really account for that amount of difference.

Finally a got the details from GSS or rather from the backend team supporting GSS. The calculation in the Product Detection Guide is an oversimplification of the actual calculation – it works because usually each measurement interval is 1 hour. but one of our pods had intervals of both 2, 3 or even up to 6 hours. The tsv file shows this.

So what is VMware actually doing? Well, as licensing is based on features used and hourly collections it is possible to change your license level up and down by the hour so calculation of usage is actually done for each collection interval and not across the entire month.

What is actually done is that each collection interval by first calculating a coefficient that is based on how long the interval is by taking the field in the tsv called “interval (Hours)” and dividing that by the hours of the month times 1024 like:

coefficient = "Interval (hours)" / (hours of month * 1024)

The 1024 is to convert the consumed storage from MB to GB and hours is of course not the same every month. Next the collected usage is measured against the vsanFInt field which defines which features are used – how to calculate that is detailed in the Product Detection Guide. This will place the usage in MB into either Enterprise, Advanced or Standard usage. The usage is then multiplied by the coefficient giving a GB usage per licens for the collection interval despite it’s length.

Finally you can just sum the usage after multiplying with the coefficient per license level to figure out how your usage is split for reporting.

Now that may be a mouthful to explain so I hope that if you need to do this you understand me otherwise please reach out and I’ll be happy to help. And all of this was simply a problem because of a bug in the vsanFInt UM was calculating.

2021 in Retrospect

Let’s start of with the easy stuff – my blogging has not been up to par this year. I have had way too little time to actually push any new content. This kind of bugs me a bit too much but the positive thing is it means I’ve been busy doing other stuff.

So what has happened in 2021? Coming into this year we had a major plan at work. Having been fed up with subpar performance of our existing HCI platform we had decided to purchase hardware to start converting all our old HCI platforms to vSAN. This would become one of the major tasks inside 2021.

I’d like to dive a bit further into this because of the magnitude (at least for me) of this task. Internally we have been running with 6 pods from one HCI vendor complemented by a few clusters using Netapp storage and some standalone nodes.

On top of this we implemented a simple 8-node stretched cluster on Cisco B200 M4 blades to run vSAN on. This was our first vSAN pod and it was built based on specs from the vSAN Ready Node configuration of B200 M4s but changing out some of the disk types with other supported models and more performant CPUs and more memory. This pod came to be based on a licensing optimization and would run only non-Windows based workloads.

We had an amazing experience with this pod which fueled our desire to switch the old HCI platform for vSAN as well. At the start of the year we had 8 2U nodes that were capable of being retrofitted for vSAN All-Flash. They were on the HCL and all components were as well. We actually only had to change a riser card to get additional NVMe slots as well as adding more NVMe caching devices.

Once we had this pod operational in a stretched cluster configuration (4+4) we started by emptying one of the existing HCI hybrid pods onto this new pod temporarily. Once emptied we could start by replacing the old 3.2 TB SAS SSD caching device and replace it with 2 1.6 TB NVMe devices instead. We could have reused the 3.2 TB SAS SSD and purchased an additional one but it was cheaper to replace it with the 2 NVMe drives instead. The hybrid pod had 12 8TB spinning disks in front so we needed a minimum of 2 disk groups to handle all the disks and with 2 NVMe slots in the back of the server the choice was easy.

We did performance testing on the new vSAN hybrid pod and my god it was fast compared to running the old HCI software. During the performance testing I managed to make several disk groups exit the cluster by running our performance workload for too long. I had a very good talk with VMware GSS about this and was recommended some changes to our test workload, primarily around duration, that would show a better picture. Our testing methodology is basically throw the worst kinds of workload we can at the pod and if performance is good enough we will have no issue running the workload we needed to put on the pod.

After migrating back the hybrid workload (and enjoying extra available capacity change to vSAN provided) we started migrating our most critical stretched workload to the new vSAN All-Flash pod. This process took forever. The primary thing was a thing I had not noticed before because it is usually not a problem. Our new vSAN All-Flash pod had been put into Skylake EVC mode because it was running 6200 series Xeon’s and would be supplemented with some 6100 series at a later point. Skylake being highest common denominator. However the old pod that we were migrating from was running on 6100 series Xeon’s without EVC mode enabled. One would think that Skylake native and Skylake EVC would be the same – but no, not the case as shown in KB76155.

This meant that about half of the 400 machines that needed to be moved would need to either be moved powered off (tough sale with the customers) or have a short maintenance window to update hardware version to 14 or 15 and then enable Per-VM EVC mode. Most of our customers were a breeze with minor service impacts to do this but one customer in particular was a bit rough which dragged the process on across the fall of this year.

But we finally managed to empty the old pod and power it of. Our next step was to reconfigure the released hardware to a vSAN certified configuration. We then proceeded to install it as a new vSAN pod and it became ready for production just 2 weeks ago. We’ll utilize this new pod to empty the next of our old HCI platforms so we can liberate the hardware from that pod for even more conversions. The process is simple but it does take time.

I have one outstanding issue that I need to solve in the new year. Some the older systems are Cisco C240 M4SX nodes. These only have internal SD boot as well as 24 drive slots in the front hooked up to a single RAID controller via 2 SAS expanders. With VMware deprecating SD/USB boot in the close future (KB85685) and vSAN not allowing non-vSAN disks on the same controller as vSAN disks we need to figure out how to boot these servers – if anyone has a solution I’m all ears! I could do some sort of iSCSI boot but I’d prefer not to!

On top of these conversions we also needed to manage all our normal operations as well as another major project that was started up in the late spring early summer. We needed to replace our vRA 7.6 install with VMware Cloud Director.

With vCD not really dying as was foretold years ago and vRA carrying a cost that vCD isn’t in our Cloud Provider licensing coupled with some usability issues with vRA from our customers we set out to test vCD in the summer and look through all the pain points of vRA to see how that compared in vCD.

Result was that we decided to roll out vCD in the fall and started the process of setting up a 10.3 production environment. We had done tests on 10.2.2 and upgraded the test to 10.3 before rolling the production environment out but yet we found good surprises!

First many machines were very easy to get imported but suddenly I had an issue where I could not import and move VMs into a vApp. I did some testing and found that if I created a new vApp I could move into that vApp. After a lot of debugging with our vTAM and GSS we found that one of our clients had deleted 2 VMs via vRA AFTER they had been imported into vCD and into that vApp. That stuck those two VMs in Partially Powered Off and blocked additional imports into the vApp.

We figured out with the help of GSS that we could run the following commands to be allowed to delete the VMs (you cannot delete a Partially Powered Off VM):

$vm = Get-CIVM <VMNAME>
$vm.Undeploy("force")

This allowed us to continue only to find the next bug. We found that some VMs would not be allowed to be moved into a vApp after Auto-import. They failed with an error about not being allowed to change bus or unit numbers while powered on – but why would it need to change those?

Turns out a bug was introduced in 10.3 (we didn’t see it in 10.2.2 at least) where VMs that had disks that weren’t in sequential unit numbers on the controllers would be forced to try to “correct” that. A unneeded operation. We opened a GSS case on it and managed to get a response that 10.3.1 fixed the issue – which it fortunately did, but it was an undocumented fix.

We have by December 1st powered down our old vRA platform and replacement with vCD has been completed. A few special machines still remain to be imported but we are 99% there which is a great feeling to end the year with.

Next year will be more vSAN conversions (we have a few Citrix pods and some disaster recovery pods to convert) as well as more vCD. We might have some NSX-T in the future as well which will likely challenge my networking skills a lot. We have been doing ACI networking for the last 4 years and I am finally at a point where I feel comfortable with the basic configurations of that platform but NSX-T just looks to have features that are easier to use by the looks of it.

This year was also the year I got my first VMware certification – VCP-DCV2021 in January. I also managed to get the vSAN Specialist badge in July making it a very good certification year for me.

Now that was a very long blog post and I hope you bear with me along it all. I have really had a lot of VMware under my nails this year but also mountains of networking and server operations. Hope I can have more time to dive into solutions in the new year.

Happy Christmas everyone and a good new year to you all!

vExpert 2018

Hi All

Just a quick update today to announce that I was able to get accepted as a vExpert 2018 in the second half as announced by VMware here: https://blogs.vmware.com/vexpert/2018/08/03/vexpert-2018-second-half-award-announcement/

After a long break in 2017 when I was at home with my twins I did not manage to get out much content to the community but after coming back in full force it is nice to be allowed back in to the program for my 4th consecutive year! I am very humbled to be let into this group of highly talented people who share so much great information with the community and the world.

As I mentioned in previous posts my role switched a bit when I changed jobs so I have a lot more areas to cover and less focus on VMware. I try and get as much time doing stuff with vSphere and hopefully vRO/vRA again soon but for now my focus is mainly developing our cloud platform along with my colleagues.

That is it for now – back into the machine room!

VMware License Checkup

Today I had to check up on some license keys for a customer. There was not a complete state of keys, enterprise accounts and support contracts so I started looking into how to collect data.

This may be common knowledge to many but VMware has this nifty tool on my.vmware.com:

https://www.vmware.com/support/serialNumberTrack.portal

Here you plop in your key and it returns support contract, type and which EA number it is connected to.

To get all the keys of a vCenter the following piece of PowerCLI can be used:

$licman = Get-View (Get-View ServiceInstance).Content.LicenseManager
$licman.Licenses | Select LicenseKey, EditionKey

This returns the list of keys on the vCenter easy to copy paste into the tracking tool. Nifty!

IO Issues on Cisco C220 M4

Hi All

As promised more content is finally available! Unfortunately I cannot share any screenshots from this actual issue I worked on – you will have to take my word for it.

At my new job I was tasked with solving some issues that had been observed on a remote site on two Cisco C220 M4’s with local disks. The hosts run a set of redundant services but nothing is shared between them except network equipment. The issues were 1) some times around 04.00 AM the software running in the virtual machines would throw an alarm briefly and 2) powering on/off or snapshotting a VM would cause the software to throw alarms as well but on other VMs than the one being powered on/off or snapshotted. The event log on both hosts showed intermittent I/O latency warnings not connected to any of the above issues but nothing alarming.

Now one note though. The software running on these VMs is very latency sensitive so something like snapshots could potentially be a problem in any case but powering on VM A should not affect VM B unless the host is hurting for resources which is not the case.

Before diving in I asked out in the vExpert Slack if anyone had seen issues like this before or had any ideas of what to look for. James Kilby and Bilal Ahmed were quick to throw some ideas on the table. James suggested that vswp file being created on power on might cause the problem and Bilal suggested looking at the network. With those things in mind I started debugging.

First off I had already decided to update the ESXi version to latest 5.5 U3 + patches – it was a little out dated. Also decided to firmware upgrade the servers with the latest 3.0.3c release for Standalone C220 M4s. I had also found the latest supported drivers for the Cisco RAID controller (lsi-mr3) and the Intel NIC (igb) to rule out any compatibility issues. Also it was my hope that an update would remove some of the I/O latency warnings.

Now before upgrading anything I tested to see if the problem was still there. It was. Powering on a test VM with no OS and just 1 core and 4 GB RAM – instant alarms kicking off in the application. However powering off did not cause any noticeable problems. I proceeded to firmware update as my hope was this would solve the issues. Firmware upgrading on a remote site through a small connection is painful! It took a while. But once the first host was updated I proceeded to test if the issues were still there. There were. Damnit. Time to dig deeper. I tried out James’ idea of vswp being the problem and setting a 100% reservation seemed to solve the problem. However this was not a viable solution as this would only solve the problem if the powering on VM has a reservation. If anyone powered on a VM without it, it would still affect all other VMs, regardless of reservation on those.

I booted up our favorite debugging tool ESXTOP and vent into HBA mode and set delay down to 2 seconds. I then observed the Cisco RAID controllers behavior during power on operations and that freaked me out. It would happily do anything between 100 and 1000 IOPS at 5 to 150 ms while not powering on. The latency would spike high but nothing I was that scared of on a small set of local 10k disks. However when powering on a VM without reservation the HBA would stop doing any operations for upwards of 4 refreshes in ESXTOP (at 2 second intervals!). All indicators showing 0. No IO was passed so no latency was observed. This scared me a bit. Latest firmware and supported drivers. Damn. We weren’t seeing the same issue on another site with Dell servers but they also had SSDs instead of 10k disks. Was this the 10k disks not performing enough?

We had a short talk internally about what to do. My boss suggested that we looked at this Cisco bug:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCut37134/?referring_site=bugquickviewredir

This bug references an issue for some C220 M4’s that were installed with a specific version of the 5.5 ESXi Cisco Custom ISO. It was not the ISO these hosts were installed with but the solution was to use a different driver than the one ESXi default selects for the Cisco RAID controller HBA – the lsi-mr3 driver. Instead it instructs to make sure that the megaraid-sas driver is installed and to remove the lsi-mr3 and lsi-msgpt3 drivers and reboot which will make the megaraid-sas the active driver. We decided to try this. Downloaded the latest support megaraid-sas driver for the server and remove the lsi-mr3 and lsimsgpt3 drivers. Reboot and wait.

After getting online again with one host we tested. Powered on the test VM – and nothing. No alarms. What a difference. Tried it again looking at ESXTOP. No drops in IO. It was now doing 8.000 IOPS @ 15ms no problem. Major difference. Mass powering on VMs 15 at a time had taken minutes before the actual power on task was done and the machines would start booting. It took seconds now.

So what is the moral here? Apparently it could benefit you as a Cisco UCS server user to use the megaraid-sas driver instead of the lsi-mr3 driver. Both are included on the Cisco Custom ISO but it defaults to the lsi-mr3 so you actively have to do something to change that.

VMWorld Europe 2016 – Day 2

The general session on day 2 started with the story I’d how everything today is becoming digital in the digital transformation. Education,  x-ray and even flamingos at a zoo are digital. 

Users want simple consumption and IT wants enterprise security.  Users want any app on any cloud avaliable on any device. This is were Workspace One comes in delivering access to all apps from anywhere on any device. We saw a short demo of Skype for Business running inside a Horizon virtual desktop. 

Workspace One even has several apps to increase productivity from Boxer email client to a expense report assistance app. You can even show 3D renderings on a Samsung Android tablet powered by Horizon and Nvidia Grid. 

The SDDC

More info on vSphere 6.5 was shown like the ability to HA vCenter at the application level with a 5 min RTO. 6x the speed of operations compared to 5.5 yielding faster power ons. Max of 20.000 VMs per vCenter. And again the new HYML5 client which will have updates outside of the normal vCenter patch cycle for faster updates and new features. 

Encryption of VMs without guest agents and based on storage policies allows for more security

And the monster VM can now go to 6TB RAM to support SAP Hana and other in memory databases. 

vSphere Integrated Containers 

Allows for running containers on your existing vSphere infrastructure with a Docker compatible interface. A registry of containers as well as a new management portal out in beta. VRA7.2 will even allow for deploying containers from the service catalog as you would any service. 

VSAN 6.5

A new release tightly integrated in the vSphere stack. New management options and a new option to directly connect two nodes with a witness off-site for ROBO and SOHO deployment. iSCSI to physical or virtual machines is now also possible allowing for making those old MCSC clusters with shared disks as well as running physical workloads of of VSAN. 

5.000 users are running it now and 60% have business critical apps like SQL servers running of this. 

Danish supermarket chain Coop are using VSAN to run 1300 VMs. Everything that can run of VSAN does. 

You can even use VRA7 and policy based storage to allow users to request a change of storage and let the policy engine do the migrations necessary. 

Vendors 

I got around to a few vendors as well yesterday to talk about products. 

Mellanox

Showed me a few of their new features such as adapters running 10/25/40/50 and 100 g networks. Supporting all sorts of protocols from RoCe and NVMeoF which allows for RDMA like access to remote NVMe based storage. 

Mangstor

This lead me to Mangstor who along with Mellanox provide a solution that allows you to actually use the NVMeoF protocol against their box and get insane performance as either stand alone or as a caching layer between existing storage clusters like lustre for example. 

Intel 

Had a chat with Intel about their whitebox servers supporting VSAN which contains hot pluggable PCIe NVMe storage in both standard and hyper converged solutions. 

Nexenta 

Gave me a good demo and talk about the product and what it does for file services. With support for mixed access NFS and CIFS which I’m not quite sure works as smoothly as presented as well as replication and snapshot based data protection. Overall an interesting product with a lot of potential. 

The Party 

What every might be waiting for now is the party Wednesday night. Overall a bit lackluster with not much going on except drinks and food. However the band this year was a surprise for me. I was happy to see Empire of the sun had been hired to give the night it’s musical touch. Very nice! 

After the party I went straight to bed and slept like a rock. 

And now to last day! 

VMworld Europe 2016 – Day 1

Early morning day 2 of my VMworld 2016 trip seems like the time to do a short recap of yesterday.

Yesterday started with the General Session keynote where Pat Gelsinger and several others presented the view from VMware. Amongst his points I found the following things most interesting:

  • THE buzzword is Digital Transformation
  • Everyone is looking at Traditional vs Digital business
  • However only about 20% of companies are actively looking at doing this. 80% are stuck behind in traditional IT and spend time optimizing predictable processes.
  • Digital Business is the new Industrial Revolution

In 2016 – 10 years ago AWS was launched. Back there were about 29 million workloads running in IT. 2% of that was in the cloud mostly due to Salesforce. 98% was in traditional IT. Skip 5 years ahead now we have 80 million workloads and 7% in public cloud and 6% in private. Remaining 87% still in traditional perhaps virtualized IT. This year we are talking 15% public and 12% private cloud and 73% traditional of 160 million workloads. Pat’s research time have set a specific time and date for when cloud will be 50% (both public and private). That date is June 29th 2021 at 15:57 CEST. We will have about 255 million workloads by then. In 2030 50% of all workloads will be in public clouds. The hosting market is going to keep growing.

Also the devices we are connecting will keep growing. By 2021 we will have 8.7 billion laptops, phones, tablets etc connected. But looking at IoT by Q1 2019 there will be more IoT devices connected than laptops and phones etc and by 2021 18 billion IoT devices will be online.

In 2011 at VMworld in Copenhagen (please come back soon 🙂 ) the SDDC was introduced by Raghu Raghuram. Today we have it and keep expanding on it. So with today vSphere 6.5 and Virtual San 6.5 were announced for release as well as VMware Cloud Foundation as a single SDDC package and VMware Cross Cloud Services for managing your mutliple clouds.

vSphere 6.5 brings a lot of interesting new additions and updates – look here at the announcement. Some of the most interesting features from my view:

  • Native VC HA features with and Active, Passive, witness setup
  • HTML 5 web client for most deployments.
  • Better Appliance management
  • Encryption of VM data
  • And the VCSA is moving from SLES to Photon.

Updates on vCenter and hosts can be found here and here.

I got to stop by a few vendors at the Solutions exchange aswell and talk about new products:

Cohesity:

I talk to Frank Brix at the Cohesity booth who gave me a quick demo and look at their backup product. Very interesting hyper converged backup system that includes backup software for almost all need use cases and it scales linearly. Built-in deduplication and the possibility of presenting NFS/CIFS out of the deduped storage. Definitely worth a look if your are reviewing your backup infrastructure.

HDS:

Got a quick demo on Vvols and how to use it on our VSP G200 including how to move from the old VMFS to Vvols instead. Very easy and smooth process. I also got an update on the UCP platform that now allows for integration with an existing vCenter infrastructure. Very nice feature guys!

Cisco:

I went by the Cisco booth and got a great talk with Darren Williams about the Hyperflex platform and how it can be used in practice. Again a very interesting hyper-converged product with great potential.

Open Nebula:

I stopped by at OpenNebula to look at their vOneCloud product as an alternative to vRealize Automation now that VMware removed it from vCloud Suite Standard. It looks like a nice product – saw OpenNebula during my education back in 2011 I think while it was still version 1 or 2. They have a lot of great features but not totally on par with vRealize Automation – at least yet.

Veeam:

Got a quick walkthrough of the Veeam 9.5 features as well as some talk about Veeam Agent for Windows and Linux. Very nice to see them move to physical servers but there is still some ways to go before the can talk over all backup jobs.

 

Now for Day 2’s General Session!

vROPS: the peculiar side

vROPS is running again in a crippled state with AD login issues, licensing issues and alert issues but at least it is showing me alerts and emailing me again.

While digging through vROPS today in a Webex with VMware Technical Support I stumbled upon an efficiency alert that I simply had to share.

In summary the image below shows me that if I somehow manage to reclaim this snapshot space I don’t think I will have any storage capacity problems for a considerable amount of time!

RidiculousRead again – that is almost 2.8 billion TB (or 2.8 zettabytes) of disk space! on a 400GB VM. How many snapshots would that even take to fill? By my estimates around 7 billion full snapshots that were fully written. I’m not sure that is within vSphere 5.5 configuration maximums for snapshots per VM.