Microsoft NLB and the consequences

Hello All

I am not usually one to bash certain pieces of technology over others, at least not in public. I know which things I preffer and which I avoid. But after having spent the better part of a work day cleaning up after Microsoft Network Load Balancer (NLB) I have to say that I am not amused!

We are currently working on deprecating an old HP switched network and moving all the involved VMs and hosts to our new Nexus infrastructure. This is a long process, at least when you want to minimize downtime. The two switching infrastructures are very different. Now I am a virtualization administrator with responsibilities for a lot of physical hardware as well so for the last month or two I have been planning this and the next weeks work with moving from the old infrastructure to the new.

Everything was ready, a layer 2 connection was established between the infrastructures allowing seamless migration between them only interrupted by the path change and for the physical machines the actual unplug of a cable to be reconnected by a new. No IP address changes, no gateway changes. Just changing uplinks. And it worked, a VM would resume connection when it moved to a host with the new uplink. Perfect!

Then disaster struck. Our Exchange setup starter creaking and within 20 minutes grinded to a halt. Something was wrong. But only on the client access layer. We quickly realized that the problem was that one of the 4 nodes in the NLB cluster running the CAS service had been moved to the new infrastructure. I hadn’t noticed it because they all still responded to ping and RDP but the NLB cluster was broken.

The reason; we use NLB with multicast. That means that on our old Catalyst switch/routers we had a special configuration that converted the unicast IP to a multicast MAC that was sent in the direction of the old infrastructure. This is a static configuration thus when we started changing the locations of the CAS servers this broke. Hard! Within an hour we had stabilized by moving two of the 4 nodes together on the same ESXi host on the new network and changing the static configuration on the Catalyst switch. But that left two nodes on the old HP networks unable to help take the load.

We have been spending the entire morning planning what to do and how to test it. Non of us had though of NLB as a problem but had we remembered this static MAC multicast configuration we might have avoided this.

My take away from this; avoid special configurations. Keep it as standard as possible. If you need to configure something customly you should stop and reconsider if you are doing it correctly.