During the work with consolidating several VMware installations into a single platform I have to manage permissions. Most of the installations were maintained by a single person or a smaller group of people thus permissions were not that complex, however they were all mostly the same; assign Administrator role to all people given at the vCenter level.
When moving to the new platform this is not really possible. I, as part of the virtualization group, of course have administrative privileges on the two vCenter servers and everything below. But the old administrators still need their permissions due them continuing to maintain legacy systems. We decided to grant this access on a cluster level. This also required us to create a folder in the VMs and Templates view, the Datastores and Datastore Clusters and the Networking view for each old installation. On the clusters and these folders we granted Administrator rights.
Now comes the funny part of this. We had not noticed the problem of a missing popup until a persistent administrator was testing his new rights out. We had the cluster he was on set on DRS Manual mode. This of course requires you to choose which server to start a VM on when powering it on. But he wasn’t getting the popup in the web client. I, at first, thought it was a web client problem but it turned out it was something else.
He would choose the action as below, Power On:
In my C# client I could see the following happening:
But the VM wasn’t powering on:
I tried powering it on manually and saw the popup in my client and wondered what was wrong since he didn’t get it. That is when I realized what might be the problem and tested it out. The problem being that powering on a VM and selecting which host to run on when in DRS Manual mode is a Datacenter privilege and as the user only had permissions on folders and clusters below the Datacenter he was not seeing the popup. If I added his permissions on the Datacenter (and not propagating) he would see the popup as below. This only happened when in DRS Manual mode:
A while back I wrote an post about the process of consolidation where I was describing the primary aspects of the process for us and the solutions we chose. We are a little ways down the road an a lot of consolidation has already happened.
We have shutdown (or emptied and left running) 5 vCenter servers and consolidated 8 clusters and 7 single hosts in our new vCenter setup. The process has been pain free for the most part. In all of these migrations (6 different maintenance windows) we have only encountered a single problem which I will describe a little later. This has given us quite the track record for performing well under maintenance windows in our little virtualization team 🙂
So the problem we encountered was not in the actual process of disconnecting from one vCenter and connecting to another but instead in the preparing phase. What happened was that while we were migrating from VDS switches to VSS switches we needed to change port groups for a lot of VMs; PowerCLI to the rescue. This was automated using a translation table in the form of a hash table that would have the old VDS port group names as keys and the new VSS port group names as values. Foreach VM and foreach of the network adapters on the VM look at the port group name and find the new value in the hash table. Simple!
But. There is always a but. After change some 60-70 port groups, the port group of a specific VM was changed and it was working fine it would seem. About 15 minutes later I got the support call that a website was down on one of the VMs. I started looking and could not see anything wrong (I’m not that fond of IIS web servers and their way of logging!). The network of the VM had not been changed yet so what was causing this 503 Service Unavailable error? And even more odd it was only one out fo 17-20 websites that was not working
I googled some things no luck, grabbed a colleague to help look, no answer. The Google paid off. The mention of the words “Application pool”. I quickly looked up the application pool of the problem web site and sure enough it was in a stopped state. Started it again and the site was running again. But why had it stopped? The logs show nothing™. So I got to thinking and the only explanation that made sense to me was the fact that I had 15 minutes before the call changed the port group of the aforementioned VM which was running the database for the website. A small MSSQL machine. The only logical thing I can see having happened is that in changing the port group the connection to the database was cut and the application running in the pool handled it badly.
Now looking back at it this is a minor problem. One web site between 20 running on one VM between hundreds. All in all the consolidation process is running well. We have a few more vCenters and single hosts to gather under the new setup still. And after that comes the process of phasing out old hardware. We have recently got 6 new blade servers to replace some older hardware, some of the most powerfull we have had yet. 2×8 core Intel CPUs with 256 GB RAM and if all goes well a new storage controller dedicated for these machines and virtualization in general. But moving to new hardware, shutting down old servers, giving new IP’s to production machines, all of this is not an easy process and requires coordination across the entire IT organization.