VMWorld Europe 2016 – Day 2

The general session on day 2 started with the story I’d how everything today is becoming digital in the digital transformation. Education,  x-ray and even flamingos at a zoo are digital. 

Users want simple consumption and IT wants enterprise security.  Users want any app on any cloud avaliable on any device. This is were Workspace One comes in delivering access to all apps from anywhere on any device. We saw a short demo of Skype for Business running inside a Horizon virtual desktop. 

Workspace One even has several apps to increase productivity from Boxer email client to a expense report assistance app. You can even show 3D renderings on a Samsung Android tablet powered by Horizon and Nvidia Grid. 

The SDDC

More info on vSphere 6.5 was shown like the ability to HA vCenter at the application level with a 5 min RTO. 6x the speed of operations compared to 5.5 yielding faster power ons. Max of 20.000 VMs per vCenter. And again the new HYML5 client which will have updates outside of the normal vCenter patch cycle for faster updates and new features. 

Encryption of VMs without guest agents and based on storage policies allows for more security

And the monster VM can now go to 6TB RAM to support SAP Hana and other in memory databases. 

vSphere Integrated Containers 

Allows for running containers on your existing vSphere infrastructure with a Docker compatible interface. A registry of containers as well as a new management portal out in beta. VRA7.2 will even allow for deploying containers from the service catalog as you would any service. 

VSAN 6.5

A new release tightly integrated in the vSphere stack. New management options and a new option to directly connect two nodes with a witness off-site for ROBO and SOHO deployment. iSCSI to physical or virtual machines is now also possible allowing for making those old MCSC clusters with shared disks as well as running physical workloads of of VSAN. 

5.000 users are running it now and 60% have business critical apps like SQL servers running of this. 

Danish supermarket chain Coop are using VSAN to run 1300 VMs. Everything that can run of VSAN does. 

You can even use VRA7 and policy based storage to allow users to request a change of storage and let the policy engine do the migrations necessary. 

Vendors 

I got around to a few vendors as well yesterday to talk about products. 

Mellanox

Showed me a few of their new features such as adapters running 10/25/40/50 and 100 g networks. Supporting all sorts of protocols from RoCe and NVMeoF which allows for RDMA like access to remote NVMe based storage. 

Mangstor

This lead me to Mangstor who along with Mellanox provide a solution that allows you to actually use the NVMeoF protocol against their box and get insane performance as either stand alone or as a caching layer between existing storage clusters like lustre for example. 

Intel 

Had a chat with Intel about their whitebox servers supporting VSAN which contains hot pluggable PCIe NVMe storage in both standard and hyper converged solutions. 

Nexenta 

Gave me a good demo and talk about the product and what it does for file services. With support for mixed access NFS and CIFS which I’m not quite sure works as smoothly as presented as well as replication and snapshot based data protection. Overall an interesting product with a lot of potential. 

The Party 

What every might be waiting for now is the party Wednesday night. Overall a bit lackluster with not much going on except drinks and food. However the band this year was a surprise for me. I was happy to see Empire of the sun had been hired to give the night it’s musical touch. Very nice! 

After the party I went straight to bed and slept like a rock. 

And now to last day! 

VMworld Europe 2016 – Day 1

Early morning day 2 of my VMworld 2016 trip seems like the time to do a short recap of yesterday.

Yesterday started with the General Session keynote where Pat Gelsinger and several others presented the view from VMware. Amongst his points I found the following things most interesting:

  • THE buzzword is Digital Transformation
  • Everyone is looking at Traditional vs Digital business
  • However only about 20% of companies are actively looking at doing this. 80% are stuck behind in traditional IT and spend time optimizing predictable processes.
  • Digital Business is the new Industrial Revolution

In 2016 – 10 years ago AWS was launched. Back there were about 29 million workloads running in IT. 2% of that was in the cloud mostly due to Salesforce. 98% was in traditional IT. Skip 5 years ahead now we have 80 million workloads and 7% in public cloud and 6% in private. Remaining 87% still in traditional perhaps virtualized IT. This year we are talking 15% public and 12% private cloud and 73% traditional of 160 million workloads. Pat’s research time have set a specific time and date for when cloud will be 50% (both public and private). That date is June 29th 2021 at 15:57 CEST. We will have about 255 million workloads by then. In 2030 50% of all workloads will be in public clouds. The hosting market is going to keep growing.

Also the devices we are connecting will keep growing. By 2021 we will have 8.7 billion laptops, phones, tablets etc connected. But looking at IoT by Q1 2019 there will be more IoT devices connected than laptops and phones etc and by 2021 18 billion IoT devices will be online.

In 2011 at VMworld in Copenhagen (please come back soon ๐Ÿ™‚ ) the SDDC was introduced by Raghu Raghuram. Today we have it and keep expanding on it. So with today vSphere 6.5 and Virtual San 6.5 were announced for release as well as VMware Cloud Foundation as a single SDDC package and VMware Cross Cloud Services for managing your mutliple clouds.

vSphere 6.5 brings a lot of interesting new additions and updates – look here at the announcement. Some of the most interesting features from my view:

  • Native VC HA features with and Active, Passive, witness setup
  • HTML 5 web client for most deployments.
  • Better Appliance management
  • Encryption of VM data
  • And the VCSA is moving from SLES to Photon.

Updates on vCenter and hosts can be found here and here.

I got to stop by a few vendors at the Solutions exchange aswell and talk about new products:

Cohesity:

I talk to Frank Brix at the Cohesity booth who gave me a quick demo and look at their backup product. Very interesting hyper converged backup system that includes backup software for almost all need use cases and it scales linearly. Built-in deduplication and the possibility of presenting NFS/CIFS out of the deduped storage. Definitely worth a look if your are reviewing your backup infrastructure.

HDS:

Got a quick demo on Vvols and how to use it on our VSP G200 including how to move from the old VMFS to Vvols instead. Very easy and smooth process. I also got an update on the UCP platform that now allows for integration with an existing vCenter infrastructure. Very nice feature guys!

Cisco:

I went by the Cisco booth and got a great talk with Darren Williams about the Hyperflex platform and how it can be used in practice. Again a very interesting hyper-converged product with great potential.

Open Nebula:

I stopped by at OpenNebula to look at their vOneCloud product as an alternative to vRealize Automation now that VMware removed it from vCloud Suite Standard. It looks like a nice product – saw OpenNebula during my education back in 2011 I think while it was still version 1 or 2. They have a lot of great features but not totally on par with vRealize Automation – at least yet.

Veeam:

Got a quick walkthrough of the Veeam 9.5 features as well as some talk about Veeam Agent for Windows and Linux. Very nice to see them move to physical servers but there is still some ways to go before the can talk over all backup jobs.

 

Now for Day 2’s General Session!

vROPS: the peculiar side

vROPS is running again in a crippled state with AD login issues, licensing issues and alert issues but at least it is showing me alerts and emailing me again.

While digging through vROPS today in a Webex with VMware Technical Support I stumbled upon an efficiency alert that I simply had to share.

In summary the image below shows me that if I somehow manage to reclaim this snapshot space I don’t think I will have any storage capacity problems for a considerable amount of time!

RidiculousRead again – that is almost 2.8 billion TB (orย 2.8 zettabytes) of disk space! on a 400GB VM. How many snapshots would that even take to fill? By my estimates around 7 billion full snapshots that were fully written. I’m not sure that is within vSphere 5.5 configuration maximums for snapshots per VM.

vRops down for the count

While I try to hold my frustration at bay and wait for VMware support to get back to me to figure out what the h*** happened yesterday that has sent my vROPS 6.0.1 cluster down for the count on this now close to 24 hours.

A recap of what happened up to the point of realizing that the cluster was what I would call inconsistent. I spent most of the day yesterday cleaning up by removing a number of old unused VMs. Amongst those were a couple of turned of VMs that I did not think much of before deleting them.

About 1ยฝ hours after deleting the last VMs I got an error in vROPS about one adapter instance not being able to collect information about the before mentioned powered off VMs. I looked in the environment information tab to see if they were still listed along with some of the others I had deleted. But no – they weren’t there. Hmm.

Then I thought they might still be listed in the license group I had defined. Went over to look and to my horror this was the first sign something was wrong – none of my licenses were in use?! Looking in the license groups view all my hosts were suddenly shown as unlicensed and my license group that normally has around 1800 members was empty. What? Editing the license group showed that the 1800 members including the hosts under unlicensed where listed as “Always include” so how come they weren’t licensed.

At this point I began suspecting that the cluster was entering a meta state of existence. So looking at the Cluster Management page I missed a critical piece of info at first but more on that later. Everything was up and running so I went to the Solutions menu with the intent of testing the connection to each vCenter server. But doing so caused an error that the currently selected collector node was not available? But the cluster just told me everything was up? So tried every one of the 4 nodes but none worked. Okay what do I do. I tried removing an adapter instance and add it again. Big mistake. Can’t readd it with the same name so had to make a new name for the same old vCenter..

That still not worked. Then I went back to the cluster management and decided to take one of the data nodes offline and the online again to see if that fixed. While waiting at “Loading” after initiating the power off I suddenly got an error saying it was unable to communicate with the data node. Then the page reloaded and the node was still online. Unsure what to do I stared at the screen only to suddenly see a message “Your sessions has expired” and then being booted back to login?

When logging back in I now only saw half of the environment. Because the old adapter that I had removed and readded under another name was not collecting. It just stated Failed.

I decided to take the car home from the office here. Was not sure what to do and need a few hours to get it at some distance. Back home I connected to the cluster again and looked at Cluster Management again. Then I spotted the (or “a” at least) problem.

Below is a screen print of what it normally looks like:

CorrectAnd here is what it looked like now:

WrongNotice the slight problem that both HA nodes reported as being Master? That cannot be good. What to do other than power off the entire cluster and bring it online again.

About 30 minutes later the cluster was back online and I started to get alerts again. A lot of alerts. Even alerts that it had previously back in the Easter week had cancelled. But okay – monitoring is running again. So decided to leave it at that and pick it up this morning again.

Well still no dice – things were still not licensed. Damnit. So I opened a ticket with VMware. While uploading log bundles and waiting I tried different things to get it to work but nothing. Then suddenly my colleague says he can’t log into vRops with his vCenter credentials. What? I had been logged in as Admin while trying to fix this so hadn’t tested my vCenter account. But it did not work. Or atleast not when using user@doma.in notation. using DOMA\user it worked – atleast I could login and see everything from the adapter that I readded yesterday. Not the other one. What?

By this time a notification event popped up in vRops clicking it gave me “1. Error getting Alert resource”. What? Now pretty desperate I powered off the cluster again and then back on. This fixed the new error of not showing alerts. Atleast for 30 minutes. The suddenly some alerts showed this again.

Trying to login with vCenter credentials did not work at all now. This is escalating! Tried setting the login to a single vCenter instead of all vCenters. Okay so previously I had only been able to see the contents of the readded vCenter adapter so I tried the one I could not see anything from. DOMA\user worked and I could see the info from this. Success – I thought. Logging back out and trying it against the readded vCenter did not work with DOMA\user but user@doma.in worked? But when inspecting the environment I was seeing the data from the other vCenter? What?

Right now I am uploading even more logs to VMware. I will update this when I figure out what the h*** went wrong here.

 

ffmpeg oneliner(s)

Hello there. I expect this to be one of the first posts that I will continue to update. Mostly for my one reference. I have been in the process of converting some old video files for better support for Chromecast/DLNA and generally for my own streaming purposes.

Some of the first problems I found were combining old files without re-encoding them. So I looked at the old trusty ffmpeg to do the job. Below I will over time add lists of ffmpeg one-liners:

Combine two .avi files and copy codecs:

ffmpeg -i "concat:part1.avi|part2.avi" -c copy complete.avi

 

Microsoft NLB and the consequences

Hello All

I am not usually one to bash certain pieces of technology over others, at least not in public. I know which things I preffer and which I avoid. But after having spent the better part of a work day cleaning up after Microsoft Network Load Balancer (NLB) I have to say that I am not amused!

We are currently working on deprecating an old HP switched network and moving all the involved VMs and hosts to our new Nexus infrastructure. This is a long process, at least when you want to minimize downtime. The two switching infrastructures are very different. Now I am a virtualization administrator with responsibilities for a lot of physical hardware as well so for the last month or two I have been planning this and the next weeks work with moving from the old infrastructure to the new.

Everything was ready, a layer 2 connection was established between the infrastructures allowing seamless migration between them only interrupted by the path change and for the physical machines the actual unplug of a cable to be reconnected by a new. No IP address changes, no gateway changes. Just changing uplinks. And it worked, a VM would resume connection when it moved to a host with the new uplink. Perfect!

Then disaster struck. Our Exchange setup starter creaking and within 20 minutes grinded to a halt. Something was wrong. But only on the client access layer. We quickly realized that the problem was that one of the 4 nodes in the NLB cluster running the CAS service had been moved to the new infrastructure. I hadn’t noticed it because they all still responded to ping and RDP but the NLB cluster was broken.

The reason; we use NLB with multicast. That means that on our old Catalyst switch/routers we had a special configuration that converted the unicast IP to a multicast MAC that was sent in the direction of the old infrastructure. This is a static configuration thus when we started changing the locations of the CAS servers this broke. Hard! Within an hour we had stabilized by moving two of the 4 nodes together on the same ESXi host on the new network and changing the static configuration on the Catalyst switch. But that left two nodes on the old HP networks unable to help take the load.

We have been spending the entire morning planning what to do and how to test it. Non of us had though of NLB as a problem but had we remembered this static MAC multicast configuration we might have avoided this.

My take away from this; avoid special configurations. Keep it as standard as possible. If you need to configure something customly you should stop and reconsider if you are doing it correctly.

Veeam NFC stream error and missing symlinks

Today my colleague, who handles our Veeam installation was diagnosing an error we were sporadically seeing. The error was this (removed names):

Error: Client error: NFC storage connection is unavailable. Storage: [stg:datastore-xxxxx,nfchost:host-xxxx,conn:vcenter.server.fqdn]. Storage display name: [DatastoreName].

Failed to create NFC download stream. NFC path: [nfc://conn:vcenter.server.fqdn,nfchost:host-xxx,stg:datastore-xxxxx@VMNAME/VMNAME.vmx].

Now this error indicates that it failed to get a connection to the host via NFC stream (port 902). Or so I thought. We have seen sporadic problems for vCenter heartbeats over the same port so that was what we expected. Turns out that some of the hosts in the cluster were missing the “datastore symlink” in /vmfs/volumes.

When running “ls -1a /vmfs/volumes” the result was not the same on each host. 4 of 8 hosts were missing a symlink and two others had a wrongly named symlink. I recalled that when I was creating the datastores I used PowerCLI to change the names of the datastores several times in rapidily after each others as my script has slight errors when constructing the correct datastore names. It seems that this left some of the datastores on some hosts either with no symlink or a wrongly named symlink.

Fortunately the fix is easy:

  1. Enter Maintenance Mode
  2. Reboot host
  3. ?????
  4. Profit

That is it! ๐Ÿ™‚

ESXi disconnecting right after connecting

This morning was just one of those mornings! Got up, checked my phone, 22 alerts from vCops. Damnit ๐Ÿ™

So got into work and could see that it was centralized around two server rooms located close to each other utilizing the same uplinks to our core network. I suspected a network error at first. Talk to a guy in networking, “Oh, didn’t you know? There was a planned power outage in that area”. Oh that explains it. Debugging further showed that only networks were affected by power interruption. Servers and storage continued to run.

So I suspected that the reason the hosts were still not responding was that they had been with out network for 4-6 hours. Chose to reconnect which worked.. at first. Immediatly after connecting the hosts disconnected again. This happened with all of them. Strange.

I remembered then an KB article I saw a while back: ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting

Aahh so port 902 might be blocked. Checked – Nope. Open both on TCP and UDP. Hmm. Aahh perhaps I need to restart the management agents but SSH was disabled. So connect the old C# client to one of the hosts directly. Enable SSH. Still no, network was filtering port 22. So no avail. Beginning to panic a bit PowerCLI came to mind. Perhaps there is a way to restart the management agents from PowerCLI.

There is! Here. But not all of them though as far as I could tell. So I tried restarting the vpxa service. Which luckily worked.

So a lot of clean up of the configurations are in order now. Personal todo: 1) allow SSH from management networks to the hosts. 2) Fix/Get access to iLO/Drac/IMPI of the hosts. 3) Get answers why I was not informed about the power work being done. And a bonus too. Need to figure out why all of the 8 hosts spread on 3 clusters have access to a 20 MB LUN that no one knows about and that while the vpxa services on 5 hosts was not working two hosts complained that they had lost access to that specific LUN.

Work work work.

Update on Updating WordPress and News

A while back I wrote a short article about how to update WordPress when your FTP server did not accept passive FTP connections. Since then I have not been trying to update anything until now. And the fix didn’t work. Or rather it did but there was still a permission problem on the files. Even chmod 777 on the files didn’t fix it for some reason.

So I started looking for a solution and found this instead:

http://www.hongkiat.com/blog/update-wordpress-without-ftp/

It describes how to switch method to direct instead of FTP. This fix worked better from the beginning and gave me more direct error messages. So now I knew that a chmod 777 would fix my problem while updating. So chmod -R 777 on the wordpress root dir while updating and then setting the permission back and voila my installation was updated. Perfect!

And now to the news part. Haven’t been very active lately with this blog. My company has had a lot of dealings with budget cuts and organizational changes in this newly formed IT unit. Not the best of starts to have to do within a year of starting. That has put a hold on many new projects. I have been working on consolidating older vSphere setups on new hardware and software which isn’t very interesting work but it has to be done. I will sometime next week hopefully write a post about how to handle CBT errors after migrating from one storage array to another. We had one of the old setups where the 2 machines were failing in odd ways. More on that later.

In other news I finally settled on a better name for the blog and will be implementing it as soon as possible. Be prepared, I think it is quite awesome ๐Ÿ™‚

Follow-up: Upgradring SSO 5.5

I wanted to do a follow up on my previous post about Upgrading SSO 5.5 and cover the resolution of the undesirable situation we had ended up in but first I’d like to cover the process leading up to the problematic upgrade.

Preparation phase:

We started preparing shortly after getting wind that vSphere 5.5 was going to be released and that the SSO had been reworked to work better with AD. We submitted 3-4 bugs related to SSO and AD (we run a single forest with multiple subdomains in a parent-child trust, imagine the problems we were having!) and the resolution to most of the were soon upgrade to vSphere 5.5. So we started researching the upgrade. Having a HA setup with a load balancer infront and the goal of doing this to our production environment ment that we were reading a lot of documentation and and KB articles as well as talking to VMware proffesionals.

At VMworld I talked personally with three different VMware employees about the process and was told every time that it was possible to upgrade the SSO and Web Client Server components and leave the Inventory Service and vCenter services for later.

So after having read a lot of this we found some issues in the documentation that made us contact VMware directly for a clarification. First off, reading through the upgrade documentation we stumbled upon this phrase:

DocumentationError1

As we read it we had to enter the load balanced hostname when upgrading the second node in the HA cluster which seems illogical. This was also different from KB2058239 which states the following:

DocumentationError2

So we contacted VMware support and got a clarification. The following response was given by email and later confirmed by phone:

In my scenario I was not able to use Load balancer’s DNS name and gave me a certificate error and the installation, however the installation went through by providing the DNS name of the primary node. This is being validated by my senior engineer and I will contact you tomorrow with any further update from my end.

We were still a bit unsure of the process and decided to revive the old test environment for the, at the time, running vSphere 5.1 configuration.

Testing phase:

It took a while to get it running again as it was being redeployed from an OVA that could not be directly deployed on our test cluster (too new VMX version) so it had to first be converted to a VMX and then through VMware Standalone Converter to downgrade the hardware. This worked fine and the machines booted. However the SSO was not running. So back to figuring that out. As it turns out, when changing the hardware of the machine its machineid changed. So we had to run the following command on both nodes and restart the services as described in this KB:

rsautil manage-secrets -a recover -m <masterPassword>

And presto! it was running again. We then performed the upgrade of the SSO and Web client. But as we had no indication that the vCenter services was going to be hit by this no testing of it was done (we were after all told that this was no problem by several different people at this point). The upgrade went mostly smooth with only to warnings, 1) Warning 25000 – SSL error which should be ignorable and that the OS had pending updates that needed to be installed before proceeding.

But we ran into the AD problem that I described in the previous post as linked at the start. The SSO was contacting a domain controller, presumably on port 3268, to get a GC connection and this was failing as the server it was communicating with was not running that service. We got a solution after a week but it seems that the problem had temporarily solved itself before the fix was given by VMware. So over the phone I agreed with VMware technical support that it was safe to procede on our production environment at if we got the same error again we should make a severity 1 support request as soon as possible.

Launch in 3. 2. 1…..:

Following the above phases we proceeded with the upgrade. We scheduled a maintenance window of about 3 hours (took 1 hour in the test environment). On the day we prepared by snapshotting the SSO and Web Client servers (5 in total). We took backups of the RSA database just in case and the proceeded. Everything seemed to work. The first node was successfully upgraded and then the second. After upgrading we reconfigured the load balancer as described in KB2058838 and it just worked out of the box it seemed. Until we had to update the service end points. The first went well. As did the second. But the third was giving a very wierd error, a name was missing. But everything was there, the file was identical to the others except the changed text to identify the groupcheck service. The we saw it, copypasta had struck again as we had missed a single “[” at the beginning of the file rendering the first block unparsable. Quick fix!

The first Web Client server was then updated to check everything and then we saw it. The vCenters were not connecting to the lookup service so the Web Client Server was not seeing them! Darn it! We hadn’t tested this. So we figured that restarting the vCenter service would solve the problem. It didn’t. In fact it turned out to make it worse. As the service was starting it failed to connect to the SSO and decided to wipe all permissions! That is a really bad default behaviour! And the service didn’t even start! It just shut down again. Digging a bit we found that the vCenter server had hard-coded the URL for the “ims” service in the vpxd.cfg file and as this had changed from “ims” to “sts” this was of course not working. The KB said to update the URL so we did. This helped a bit. The service was not starting but not connecting correctly. We could however now login with admin@System-Domain via the old C# client (Thank god that still works!). It is here we called VMware technical support, but too late. After getting a severity 1 case opened we were informed that we would get a callback within 4 business hours (business hours being 7am to 7pm GMT). At this point it was 5:45PM GMT. So we waited until 7PM but no call unfortunately. So we went home and came back early next morning. at about 7:45AM the callback came, I grabbed it and was then on the phone for 3 hours straight.

What the problem was is that it is not possible to upgrade SSO and Web Client server and leave vCenter and Inventory service. The last two needed to be upgraded aswell and the process of recovering the situation was to upgrade the vCenter servers and their Inventory services. This was a bit problematic for the vCenter that had been “corrupted” by restarting and trying to fix the config files. We ended up uninstalling the vCenter service and reinstalling it on the server and on the other vCenter that had not been touched we just upgraded unproblematically.

Aftermath:

When I hung up the phone with VMware technical support we were back online. Only the last Web Client server needed upgrading, easily done. But the one vCenter had lost all permissions, we are still recuperating from that but as this was just before Christmas holidays only critical permissions were restored.

So after a thing like this one needs to lean back and evaluate how to avoid this in the future so here are my morals from this:

  1. Test, test, test, test! We should have tested the upgrade with the vCenter servers as well to have caught this in the testing phase!
  2. Upgrade the entire management layer. Instead of trying to keep the vCenters on 5.1 we should have just planned on upgrading the entire management layer.
  3. Maintenance in business hours. We set our maintanence window outside normal business hours causing us to have to wait 12 hours for support from VMware.