Upgrading SSO 5.5

*EDIT* After 3 hours on the phone with VMware Technical Support we are finally running again. Permissions on one vCenter were wiped and have to be recreated and a bug with Win2k12 domain controllers has hit is meaning that we had to create each domain as an identity source. But we are running again! *EDIT*

 

This is going to be a short post about my current experiences with upgrading the SSO component of vCenter. Short because it is still not working and I am waiting for VMware support to contact me.

So after VMworld I was pushing to upgrade to vSphere 5.5, atleast web client and SSO as this would solve a lot of our AD problems. So planning began and in November I revived my old test environment for the vSphere 5.1 upgrade. I dusted it off, got it running (more on that in a later post) and started the upgrade. The upgrade went fine – as such. After coming online again I could not enumerate users in SSO groups. I could search the AD via the new Integrated Windows Authentication Identity Source but not enumerate members of SSO groups. The web client would be “loading” for a LOOONG time and the following could be seen over and over again in the vmware-sts-idmd.log:

07:07:04,885 WARN   [LdapErrorChecker] Error received by LDAP client: com.vmware.identity.interop.ldap.WinLdapClientLibrary, error code: 81
07:07:04,885 WARN   [ServerUtils] cannot bind connection: [ldap://ADSERVERNAME, null]
07:07:04,885 ERROR  [ServerUtils] cannot establish connection with uri: [ldap://ADSERVERNAME]
07:07:04,885 ERROR  [ActiveDirectoryProvider] Failed to get GC connection to domain aau.dkLdap_sasl_bind failedServer Down

I purged dates and servernames from the log. But in essence it was looking like the SSO was attempting to make a Global Catalog (GC) connection to an AD server but it was selecting an AD server which was not running the GC service and when attempting to connect it decides that the server is down but keeps trying to contact it. Our AD forest consists of 1 root domain and about 20 subdomains and in total about 55 domain controllers where a bit over half run the GC service. There is no setting to tell SSO which server to talk to in the AD forest. So no change.

I got the following from VMware technical support a few days later:

Procedure

1 Log in to the vSphere Web Client as administrator@vsphere.local or as another user with vCenter Single Sign-On administrator privileges.

2 Browse to Administration > Single Sign-On > Configuration.

3 On the Identity Sources tab, select an identity source and click the Set as Default Domain icon.

In the domain display, the default domain shows (default) in the Domain column.

Please restart the services and login to the web client/ vi client.

When I tried that it seemed to work. I reverted and it still worked leading me to think that it was not the above that fixed it but simply time as it would at some point select another server. I was assured over the phone by VMware technical support that customers that had seen the above error had fixed it with the above.

But now, to tell you the truth, the above log snippet is from today. I went ahead and upgrade SSO and Web client yesterday and got stuck with this error after spending about 2 hours reconnecting/repointing a vCenter 5.1 to the new lookup service. It requires you to change the vpxd.cfg file as it has hard coded the path to the STS service in that file and it points to the old /ims and not the new /sts url. It also caused a total wipe of permissions on the one vCenter I repointed causing some incosistent behavior in the web client. Looking under vCenter I cannot see it but if I browse down through Administration -> licensing and find it there I can see all objects with the admin@System-Domain account. The only account left with permissions on the vCenter.

I’m back at the office early this morning, slept like crap due to this. This is going to be a long day!

Since the last time

So it’s been a while since my last post, a lot has been going on. I have been through my first “employee performance interview” (Medarbejderudviklingssamtale or MUS in Danish). It was good and a lot of things were discussed in regard to this new organization. Some steps to increase my skills were also planned and I will get back to that later.

Since last time I attended VMWorld Europe 2013 in Barcelona! It was an awesome conference as always and I got a lot of new things with me home. One of the things I did different this year as compared to the previous two years was spend a lot more time on the Solutions Exchange. I focused primarily on storage vendors as I have taken fancy in new flash accelerated or all-flash storage systems. So I think I visited every booth with just the slightest connection to storage.

I also had the chance to discuss some of the new technologies coming out of VMware and also discuss the upgrade procedure for vSphere 5.5 when running with an SSO behind a load balancer. That was really useful and insightful and provided me with most of the information I need to perform an upgrade of the SSO and Web Client in our environment to vSphere 5.5 to relieve all the AD problems we have had. I will post a blog article on this later as there are still some hick-ups in the documentation and procedure that I need to test out and receive confirmation on from VMware support.

Our consolidation process has not been moving that much. Shortly after returning from Barcelona I took part in a live migration of VMs between our data center and a remote server room across a distance of about a kilometer. Without going into details about how everything was connected suffice to say that we had a single 10Gbit Ethernet connection between our one of our data center routers and one of the server room’s routers. We also had a single FC connection between a storage array in the data center and the blade chassis in the server room. This allowed us to evacuate a single blade in the server room and move it to an identical blade chassis after this we used another blade in the server room as the “transport host”. We vMotion VMs onto it as it could see both data center and server room storage arrays. The Storage vMotioned the VMs to the data center array and finally vMotion onto the host in the data center. Then one by one we evacuated all blades in the server room and moved them to the data center. The process took about 2 days including the move of a few physical hosts as well and was all in all very successful.

We had a single error during the move which caused an unexplained HA restart. The largest of the VMs (1TB storage spread on 4 different VMDKs) was set to change format to thin provisioned during the Storage vMotion. But at some point during migration we got an unexpected error (This was the actual message from the vSphere client). 30 seconds later HA spontaneously rebooted the VM even though Virtual Machine monitoring was disabled and the host didn’t crash. Luckily the VM handled the reboot well and it occurred close to midnight with no users online.

Right now my colleagues are planning the consolidation of two other VMware installation which will most likely be done with cold migrations. The amount of VMs is small and the fiber connections and licenses of these installations will not allow us to do a live migration. They are also planning a move similar to the one I worked on which we hope to complete some time in December. I am working on a cold migration of VMware installation as well where most of the VMs will be reinstalled on a new cluster rather than migrating them.

That was a status on what we are working on. Also back to the “I will get back to this later”. During the next month I will be working on a test installation of vCloud Automation Center to experiment with it and research if this is something we can use in our organization. The initial tests will be confined to the infrastructure department but if it works out it might be scaled up.

The Missing Pop Up

During the work with consolidating several VMware installations into a single platform I have to manage permissions. Most of the installations were maintained by a single person or a smaller group of people thus permissions were not that complex, however they were all mostly the same; assign Administrator role to all people given at the vCenter level.

When moving to the new platform this is not really possible. I, as part of the virtualization group, of course have administrative privileges on the two vCenter servers and everything below. But the old administrators still need their permissions due them continuing to maintain legacy systems. We decided to grant this access on a cluster level. This also required us to create a folder in the VMs and Templates view, the Datastores and Datastore Clusters and the Networking view for each old installation. On the clusters and these folders we granted Administrator rights.

Now comes the funny part of this. We had not noticed the problem of a missing popup until a persistent administrator was testing his new rights out. We had the cluster he was on set on DRS Manual mode. This of course requires you to choose which server to start a VM on when powering it on. But he wasn’t getting the popup in the web client. I, at first, thought it was a web client problem but it turned out it was something else.

He would choose the action as below, Power On:

01

In my C# client I could see the following happening:

02

But the VM wasn’t powering on:

03

I tried powering it on manually and saw the popup in my client and wondered what was wrong since he didn’t get it. That is when I realized what might be the problem and tested it out. The problem being that powering on a VM and selecting which host to run on when in DRS Manual mode is a Datacenter privilege and as the user only had permissions on folders and clusters below the Datacenter he was not seeing the popup. If I added his permissions on the Datacenter (and not propagating) he would see the popup as below. This only happened when in DRS Manual mode:

04

Consolidation continued

A while back I wrote an post about the process of consolidation where I was describing the primary aspects of the process for us and the solutions we chose. We are a little ways down the road an a lot of consolidation has already happened.

We have shutdown (or emptied and left running) 5 vCenter servers and consolidated 8 clusters and 7 single hosts in our new vCenter setup. The process has been pain free for the most part. In all of these migrations (6 different maintenance windows) we have only encountered a single problem which I will describe a little later. This has given us quite the track record for performing well under maintenance windows in our little virtualization team 🙂

So the problem we encountered was not in the actual process of disconnecting from one vCenter and connecting to another but instead in the preparing phase. What happened was that while we were migrating from VDS switches to VSS switches we needed to change port groups for a lot of VMs; PowerCLI to the rescue. This was automated using a translation table in the form of a hash table that would have the old VDS port group names as keys and the new VSS port group names as values. Foreach VM and foreach of the network adapters on the VM look at the port group name and find the new value in the hash table. Simple!

But. There is always a but. After change some 60-70 port groups, the port group of a specific VM was changed and it was working fine it would seem. About 15 minutes later I got the support call that a website was down on one of the VMs. I started looking and could not see anything wrong (I’m not that fond of IIS web servers and their way of logging!). The network of the VM had not been changed yet so what was causing this 503 Service Unavailable error? And even more odd it was only one out fo 17-20 websites that was not working

I googled some things no luck, grabbed a colleague to help look, no answer. The Google paid off. The mention of the words “Application pool”. I quickly looked up the application pool of the problem web site and sure enough it was in a stopped state. Started it again and the site was running again. But why had it stopped? The logs show nothing™. So I got to thinking and the only explanation that made sense to me was the fact that I had 15 minutes before the call changed the port group of the aforementioned VM which was running the database for the website. A small MSSQL machine. The only logical thing I can see having happened is that in changing the port group the connection to the database was cut and the application running in the pool handled it badly.

Now looking back at it this is a minor problem. One web site between 20 running on one VM between hundreds. All in all the consolidation process is running well. We have a few more vCenters and single hosts to gather under the new setup still. And after that comes the process of phasing out old hardware. We have recently got 6 new blade servers to replace some older hardware, some of the most powerfull we have had yet. 2×8 core Intel CPUs with 256 GB RAM and if all goes well a new storage controller dedicated for these machines and virtualization in general. But moving to new hardware, shutting down old servers, giving new IP’s to production machines, all of this is not an easy process and requires coordination across the entire IT organization.

Migrating VMkernels

Today I’m going to show you a little trick I found in the vSphere C# client that I have not previously seen anyone mention, I didn’t even find until last week. Reason I found is simple I had a need to live migrate a VMkernel.

Scenario:

You have one or more Distributed Switches (VDS) which have two or more uplinks. You need to move e.g. the VMkernel NIC handling Management Traffic from the VDS to a normal Virtual Switch (VS) on the host. And you want to do this without losing connectivity from vCenter to the ESXi host.

Reason:

Why would you need this you may ask. For us the reason was simple, we needed to move ESXi hosts with live VMs from one vCenter to another. If all traffic including management and vMotion is handled by VDSs disconnecting from the vCenter would remove connectivity as VDSs are a vCenter/cluster construct.

I search a lot to find how to do it and always ended up with the solution “Use the DCUI and restore management switch and vmkernel from there and reattach uplinks”. But that would shortly disconnect the ESXi host from the vCenter and I would have to login to each hosts remote console software (DRAC, iLo etc). Tedious work.

Method:

This little nifty trick:

vmk1

When marking a VMkernel and clicking migrate you get a dialog like below where you select the VS you want to migrate to:

vmk2

And in the next step set a name and a VLAN ID for the port group and you are done:

vmk3

Easy. Just remember – and this is the important step – To avoid downtime and losing connectivity you need to first move one of the uplinks to the VS you want to migrate to so that there is an available physical connection. Once you have migrated you can move the other uplink(s) to the VS and you are done! Easy, and all done from the vSphere client! If you are a real pro most of it could probably be scripted as well but due to only have a few hosts I felt more comfortable just doing it by hand, one at a time.

The process of consolidation

Hello readers.

This will be the first real post on this blog. Today I’m going to describe the process of consolidation of IT units with a specific focus on the virtualization part of this.

I’ll start up with some basic info. I work at Aalborg University (AAU) and have done so for almost 5 years, 4 of which as a part-time student employee. In September 2012, shortly after being hired full-time, the process of consolidating all IT units on Aalborg University startet with the hiring of a CIO to manage the new organization. Her work the following months resulted in the hiring of 4 new IT chiefs with responsibility of 4 different departments in the new organization; Process and Strategy, Applied IT and Development, Infrastructure Services and Support Services.

All employees of AAU’s different IT departments were then moved to this new organization this March. I was moved to the Infrastructure Services department and in this department, the Datacenter and Networking group. My and two colleagues will in the future be maintaining and developing AAU’s virtualization efforts. Primarily this is, as mentioned earlier, VMware but in the future we might look at other solutions as well.

Now to the more technical part. As I mentioned earlier AAU has 7 different vCenter servers  and 16 distinct clusters of 1 or more hosts (I know a single host cannot really be called a cluster 🙂 ) consisting of a total of a little over 50 hosts. On these 50 hosts somewhere around 1000 VM’s are running in some form. My old IT department had 2 of these clusters, each with their own vCenter server. We had a total of 13 hosts (8 and 5 respectively) running just under 300 of the virtual machines. We by far had the largest, newest and most developed environment.

So what is the golden solution for us then? We are working all over the Infrastructure department on consolidating all servers, storage, services and some network in two data centers and a backup location. We were lucky to last year get a completely new data center with 160kWh cooling and power capacity. This will be our primary data center and one of the older server rooms will be adapted to work as the secondary. We are required to have a backup location that is separate from the two data centers, this is the location where all backup from the two production locations will be stored.

In the new vSphere setup that we recently started using and migrating to we have maintained a design where each data center has a vCenter and when we are done 3 clusters; one cluster for infrastructure machines (AD, DNS, Exchange etc), one for customer machines (people external to the IT department needing virtualization) and one for testing purposes. I would draw you a graph showing the setup but currently I’m not using a machine with anything proper to draw it with:)

We settled (despite some warnings) on a vCenter setup consisting of 6 machines; 2 SSO nodes in HA (one in each data center with for now an Apache load balancer) with a database on a MSSQL 2012 cluster, 2 vCenter servers with their respective Inventory service as well and 2 Web Client servers that in the future will be placed behind a physical load balancer to provide some load distribution and availability of access. We are aware that database clusters are not supported on neither the SSO nor vCenter itself but having been running vCenter on an MSSQL 2008 cluster for a little over 3 years very little problems have been seen, most fixed by restarting the vCenter service.

The setup was tested in a small setup and shown to work but of course when deploying to production things change. We deployed the servers to a new data center networking setup meaning we needed to modify filters on a lot of networks to establish access from the new vCenter servers to the ESXi hosts on old networks. We needed to deploy on the new SQL cluster which was having networking problems. We then ran into some collation problems (we came to the conclusion that they were caused by failing over the empty databases in our SQL cluster) but with databases initialized the problem disappeared.

So after spending roughly 1½ weeks on installing and prepping the new environment with my colleagues we were ready to deploy. We then proceeded to test with one of the clusters by first moving one host from the old vCenter to the new and back. Success. A week later a maintenance window was planned and the entire cluster of 5 hosts including VMs was moved live from the old vCenter to the new with only a single IIS service having a hick-up in the process.

The details of the moving process will be described in a later post as there were some more or less complex steps needed to be taken for the process to run smoothly including moving VMs and VMkernels from distributed switches to virtual switches, copying resource pools and folders etc. Thank god for PowerCLI 🙂

That was it for the first post, hope you enjoyed reading it:)

And we have a go for launch!

So this blog is live! Very nice. I have been wanting to get into writing about the things we are doing at my work place with respect to virtualization. Some things just as a note to myself and other things to give others a tip on how we decided to solve some problems/assignments.

Not much to say yet but over the next couple of days I will try and write some things about the current process of consolidating 7 distinct vCenter installations and 16 separate clusters consisting of a total of about 56 hosts into one joined setup with 2 vCenters and and, the end of the tunnel, 6 clusters. It will contain some tips on the process of preparing for this and some of the steps that we have taken and also some designs on what we chose to do and at least some degree of argumentation of why we chose this 🙂

Until the have fun and virtualize!