Current problems with vSphere Single Sign On

Hello again

Today I’m going to highlight some of the bugs/problems we have run into with vSphere Single Sign On (SSO), introduced in vSphere 5.1.

Using Active Directory groups from a Multi-Domain Forest with Parent-Child trust:

We are using a Active Directory (AD) forest that consists of a single root domain around 20 sub domains, a single level down. That is we have a root AD domain like example.com and below a single level we have all the domains, mostly on the form department.example.com. All users exist in one of the sub domains, none are placed in the root domain. It contains only service accounts, groups and servers.

The SSO as of version 5.1 U1b only supports External Two-way trusts (See: KB:2037410) so at first you think it won’t work. So you try adding a group from AD and you are able to do so. But trying to authenticate with a user of that group doesn’t work. Adding the user itself instead of the group works. VMware have stated to us that this is a problem with the transitivity of the trust in Parent-Child trusts which does not exist in the External Two-way trust. I suspect that this is also caused by another “bug” that we have noticed that I will explain below.

Workaround: use local SSO groups with AD users. This works just fine.

Users with same sAMAccountName in different domains in same forest cannot be used:

We discovered last week that we were unable to add users to permissions or SSO groups if the SSO already new a user with the same sAMAccountName but in a different domain. VMware confirmed this as a bug which will be fixed in an upcomming release. The bug only occurs in child domains in an AD forest.

Users are presented in the SSO with wrong domain when added:

This is small bug  but when adding a user (or group) the first time the user is added with username@example.com rather than the correct username@department.example.com which is shown when adding and despite selecting the correct domain in the “Add user” dialog. The result is that the first time the user logs in nothing is shown despite the user having permissions. Have the user log out and then remove and add the user solves the problem, as if the SSO figures out that the user is in the sub domain on first login. I suspect that this has something to do with the first problem as well.

Cannot remove user from SSO group:

Despite KB:2037102 stating that the bug was solved in vSphere 5.1 U1a we are experiencing it on this version aswell. We have not reported this yet but am doing so tomorrow morning. For some users no errors are shown but the user is not removed, for others an error is shown that the principal could not be removed. In the imsTrace log a Java exception shows that the principal cannot be removed because it does not exist on the group, even though the web client shows the user.

 

This post may be updated in the future:)

Migrating VMkernels

Today I’m going to show you a little trick I found in the vSphere C# client that I have not previously seen anyone mention, I didn’t even find until last week. Reason I found is simple I had a need to live migrate a VMkernel.

Scenario:

You have one or more Distributed Switches (VDS) which have two or more uplinks. You need to move e.g. the VMkernel NIC handling Management Traffic from the VDS to a normal Virtual Switch (VS) on the host. And you want to do this without losing connectivity from vCenter to the ESXi host.

Reason:

Why would you need this you may ask. For us the reason was simple, we needed to move ESXi hosts with live VMs from one vCenter to another. If all traffic including management and vMotion is handled by VDSs disconnecting from the vCenter would remove connectivity as VDSs are a vCenter/cluster construct.

I search a lot to find how to do it and always ended up with the solution “Use the DCUI and restore management switch and vmkernel from there and reattach uplinks”. But that would shortly disconnect the ESXi host from the vCenter and I would have to login to each hosts remote console software (DRAC, iLo etc). Tedious work.

Method:

This little nifty trick:

vmk1

When marking a VMkernel and clicking migrate you get a dialog like below where you select the VS you want to migrate to:

vmk2

And in the next step set a name and a VLAN ID for the port group and you are done:

vmk3

Easy. Just remember – and this is the important step – To avoid downtime and losing connectivity you need to first move one of the uplinks to the VS you want to migrate to so that there is an available physical connection. Once you have migrated you can move the other uplink(s) to the VS and you are done! Easy, and all done from the vSphere client! If you are a real pro most of it could probably be scripted as well but due to only have a few hosts I felt more comfortable just doing it by hand, one at a time.

The process of consolidation

Hello readers.

This will be the first real post on this blog. Today I’m going to describe the process of consolidation of IT units with a specific focus on the virtualization part of this.

I’ll start up with some basic info. I work at Aalborg University (AAU) and have done so for almost 5 years, 4 of which as a part-time student employee. In September 2012, shortly after being hired full-time, the process of consolidating all IT units on Aalborg University startet with the hiring of a CIO to manage the new organization. Her work the following months resulted in the hiring of 4 new IT chiefs with responsibility of 4 different departments in the new organization; Process and Strategy, Applied IT and Development, Infrastructure Services and Support Services.

All employees of AAU’s different IT departments were then moved to this new organization this March. I was moved to the Infrastructure Services department and in this department, the Datacenter and Networking group. My and two colleagues will in the future be maintaining and developing AAU’s virtualization efforts. Primarily this is, as mentioned earlier, VMware but in the future we might look at other solutions as well.

Now to the more technical part. As I mentioned earlier AAU has 7 different vCenter servers  and 16 distinct clusters of 1 or more hosts (I know a single host cannot really be called a cluster 🙂 ) consisting of a total of a little over 50 hosts. On these 50 hosts somewhere around 1000 VM’s are running in some form. My old IT department had 2 of these clusters, each with their own vCenter server. We had a total of 13 hosts (8 and 5 respectively) running just under 300 of the virtual machines. We by far had the largest, newest and most developed environment.

So what is the golden solution for us then? We are working all over the Infrastructure department on consolidating all servers, storage, services and some network in two data centers and a backup location. We were lucky to last year get a completely new data center with 160kWh cooling and power capacity. This will be our primary data center and one of the older server rooms will be adapted to work as the secondary. We are required to have a backup location that is separate from the two data centers, this is the location where all backup from the two production locations will be stored.

In the new vSphere setup that we recently started using and migrating to we have maintained a design where each data center has a vCenter and when we are done 3 clusters; one cluster for infrastructure machines (AD, DNS, Exchange etc), one for customer machines (people external to the IT department needing virtualization) and one for testing purposes. I would draw you a graph showing the setup but currently I’m not using a machine with anything proper to draw it with:)

We settled (despite some warnings) on a vCenter setup consisting of 6 machines; 2 SSO nodes in HA (one in each data center with for now an Apache load balancer) with a database on a MSSQL 2012 cluster, 2 vCenter servers with their respective Inventory service as well and 2 Web Client servers that in the future will be placed behind a physical load balancer to provide some load distribution and availability of access. We are aware that database clusters are not supported on neither the SSO nor vCenter itself but having been running vCenter on an MSSQL 2008 cluster for a little over 3 years very little problems have been seen, most fixed by restarting the vCenter service.

The setup was tested in a small setup and shown to work but of course when deploying to production things change. We deployed the servers to a new data center networking setup meaning we needed to modify filters on a lot of networks to establish access from the new vCenter servers to the ESXi hosts on old networks. We needed to deploy on the new SQL cluster which was having networking problems. We then ran into some collation problems (we came to the conclusion that they were caused by failing over the empty databases in our SQL cluster) but with databases initialized the problem disappeared.

So after spending roughly 1½ weeks on installing and prepping the new environment with my colleagues we were ready to deploy. We then proceeded to test with one of the clusters by first moving one host from the old vCenter to the new and back. Success. A week later a maintenance window was planned and the entire cluster of 5 hosts including VMs was moved live from the old vCenter to the new with only a single IIS service having a hick-up in the process.

The details of the moving process will be described in a later post as there were some more or less complex steps needed to be taken for the process to run smoothly including moving VMs and VMkernels from distributed switches to virtual switches, copying resource pools and folders etc. Thank god for PowerCLI 🙂

That was it for the first post, hope you enjoyed reading it:)