Since the last time

So it’s been a while since my last post, a lot has been going on. I have been through my first “employee performance interview” (Medarbejderudviklingssamtale or MUS in Danish). It was good and a lot of things were discussed in regard to this new organization. Some steps to increase my skills were also planned and I will get back to that later.

Since last time I attended VMWorld Europe 2013 in Barcelona! It was an awesome conference as always and I got a lot of new things with me home. One of the things I did different this year as compared to the previous two years was spend a lot more time on the Solutions Exchange. I focused primarily on storage vendors as I have taken fancy in new flash accelerated or all-flash storage systems. So I think I visited every booth with just the slightest connection to storage.

I also had the chance to discuss some of the new technologies coming out of VMware and also discuss the upgrade procedure for vSphere 5.5 when running with an SSO behind a load balancer. That was really useful and insightful and provided me with most of the information I need to perform an upgrade of the SSO and Web Client in our environment to vSphere 5.5 to relieve all the AD problems we have had. I will post a blog article on this later as there are still some hick-ups in the documentation and procedure that I need to test out and receive confirmation on from VMware support.

Our consolidation process has not been moving that much. Shortly after returning from Barcelona I took part in a live migration of VMs between our data center and a remote server room across a distance of about a kilometer. Without going into details about how everything was connected suffice to say that we had a single 10Gbit Ethernet connection between our one of our data center routers and one of the server room’s routers. We also had a single FC connection between a storage array in the data center and the blade chassis in the server room. This allowed us to evacuate a single blade in the server room and move it to an identical blade chassis after this we used another blade in the server room as the “transport host”. We vMotion VMs onto it as it could see both data center and server room storage arrays. The Storage vMotioned the VMs to the data center array and finally vMotion onto the host in the data center. Then one by one we evacuated all blades in the server room and moved them to the data center. The process took about 2 days including the move of a few physical hosts as well and was all in all very successful.

We had a single error during the move which caused an unexplained HA restart. The largest of the VMs (1TB storage spread on 4 different VMDKs) was set to change format to thin provisioned during the Storage vMotion. But at some point during migration we got an unexpected error (This was the actual message from the vSphere client). 30 seconds later HA spontaneously rebooted the VM even though Virtual Machine monitoring was disabled and the host didn’t crash. Luckily the VM handled the reboot well and it occurred close to midnight with no users online.

Right now my colleagues are planning the consolidation of two other VMware installation which will most likely be done with cold migrations. The amount of VMs is small and the fiber connections and licenses of these installations will not allow us to do a live migration. They are also planning a move similar to the one I worked on which we hope to complete some time in December. I am working on a cold migration of VMware installation as well where most of the VMs will be reinstalled on a new cluster rather than migrating them.

That was a status on what we are working on. Also back to the “I will get back to this later”. During the next month I will be working on a test installation of vCloud Automation Center to experiment with it and research if this is something we can use in our organization. The initial tests will be confined to the infrastructure department but if it works out it might be scaled up.

Consolidation continued

A while back I wrote an post about the process of consolidation where I was describing the primary aspects of the process for us and the solutions we chose. We are a little ways down the road an a lot of consolidation has already happened.

We have shutdown (or emptied and left running) 5 vCenter servers and consolidated 8 clusters and 7 single hosts in our new vCenter setup. The process has been pain free for the most part. In all of these migrations (6 different maintenance windows) we have only encountered a single problem which I will describe a little later. This has given us quite the track record for performing well under maintenance windows in our little virtualization team 🙂

So the problem we encountered was not in the actual process of disconnecting from one vCenter and connecting to another but instead in the preparing phase. What happened was that while we were migrating from VDS switches to VSS switches we needed to change port groups for a lot of VMs; PowerCLI to the rescue. This was automated using a translation table in the form of a hash table that would have the old VDS port group names as keys and the new VSS port group names as values. Foreach VM and foreach of the network adapters on the VM look at the port group name and find the new value in the hash table. Simple!

But. There is always a but. After change some 60-70 port groups, the port group of a specific VM was changed and it was working fine it would seem. About 15 minutes later I got the support call that a website was down on one of the VMs. I started looking and could not see anything wrong (I’m not that fond of IIS web servers and their way of logging!). The network of the VM had not been changed yet so what was causing this 503 Service Unavailable error? And even more odd it was only one out fo 17-20 websites that was not working

I googled some things no luck, grabbed a colleague to help look, no answer. The Google paid off. The mention of the words “Application pool”. I quickly looked up the application pool of the problem web site and sure enough it was in a stopped state. Started it again and the site was running again. But why had it stopped? The logs show nothing™. So I got to thinking and the only explanation that made sense to me was the fact that I had 15 minutes before the call changed the port group of the aforementioned VM which was running the database for the website. A small MSSQL machine. The only logical thing I can see having happened is that in changing the port group the connection to the database was cut and the application running in the pool handled it badly.

Now looking back at it this is a minor problem. One web site between 20 running on one VM between hundreds. All in all the consolidation process is running well. We have a few more vCenters and single hosts to gather under the new setup still. And after that comes the process of phasing out old hardware. We have recently got 6 new blade servers to replace some older hardware, some of the most powerfull we have had yet. 2×8 core Intel CPUs with 256 GB RAM and if all goes well a new storage controller dedicated for these machines and virtualization in general. But moving to new hardware, shutting down old servers, giving new IP’s to production machines, all of this is not an easy process and requires coordination across the entire IT organization.