Failed to get size of IP buffer error

Hello everyone

Just a brief post today. Back in the start of January we saw and older Server 2008 32-bit showing this error in the title. It would spam the alert in the event log of the server until it became inaccessible. Not much was to be found about the error but I did find this post from Alex575 who also saw the error in January.

As no answers had been made on the post I decided to follow it and try and work out a solution. We haven’t updated ESXi and Tools above 9359 since ESXi 5.5 U3 so I started thinking that maybe the new VMware Tools 10 package could solve the issues as the event log entries came from the Tools service (vmsvc).

We upgraded the servers tools version to 10245 (Version 10.0.5) and from crashing every 10 days it has as of yet not crashed (14 days and counting).

VMware Tools from Version 10 will ship outside of vSphere releases as blogged by Brian Graf here: https://blogs.vmware.com/vsphere/2015/09/vmware-tools-10-0-0-released.html

The 10.0.5 release can be downloaded here: https://my.vmware.com/group/vmware/details?downloadGroup=VMTOOLS1005&productId=491

PowerCLI: Datastore Cluster and Tags

I was trying to help out a colleague yesterday when I realized that a quick fix to the problem would be to tag the datastore clusters in our environment and get them based on these tags instead of trying to determine which datastore cluster to choose when deploying a VM from PowerCLI.

So I decided to do this quickly and will show what I did (code snippets are from my vSphere 6.0 lab but the it is the same on our 5.5 production).

New-TagCategory -Name "CDC" -Cardinality Single -EntityType DatastoreCluster
New-Tag -Name "DC2" -Category CDC
Get-DatastoreCluster DatastoreCluster | New-TagAssignment -Tag "DC2"

Now I hope we can agree that I have created a new TagCategory that applies to Datastore Clusters and allows for one tag per object. We have also created a tag in this category called “DC2”. Lastly we have added the tag to the datastore cluster “DatastoreCluster”. Now if I run the following I get what I would expect:

C:\> Get-DatastoreCluster DatastoreCluster | Get-TagAssignment

Tag                                      Entity
---                                      ------
CDC/DC2                                  DatastoreCluster
C:\>

But if I run this I get something that I did not expect

C:\> Get-DatastoreCluster -Tag "DC2"
C:\>

This means that it is not working the same as for Virtual Machines with the “get-vm” cmdlet:

C:\> New-TagCategory -Name "VMTest" -Cardinality Single -EntityType VirtualMachine
Name                                     Cardinality Description
----                                     ----------- -----------
VMTest                                   Single
C:\> New-Tag -Name "Test" -Category "VMTest"
Name                           Category                       Description
----                           --------                       -----------
Test                           VMTest
C:\> Get-VM testvm01 | New-TagAssignment Test
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm | Get-TagAssignment
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm -Tag "Test"
Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
testvm01             PoweredOff 1        4,000

So I do not know if this is the way it was meant to work but I is definitely not what I expected!

vRealize Operations 6.1 is out!

As of midnight danish local time vRealize Operations 6.1 is out! This is great as we have been waiting for this release to fix some issues we have been having with our environment running on 6.0.2. Last communication from VMware Technical Support a month ago was that our two remaining problems would be fixed in this release.

I’ve look through the list of fixes but did not see it directly so hoping they still made it 🙂

Release notes can be found here.

UPDATE: Upgrading the VA-OS pak file worked but applying the VA pak file failed to complete. The logs showed that it was the conversion from xDB to cassandra that failed. VMware tech support were fast today and recommended rollback and applying 6.0.3 instead until further diagnostics could be made on 6.1 -> apparently we were the first to submit a case on 6.1 install 🙂

Disabling “One or more ports are experiencing network contention” alert

From day one of deploying vRealize Operations Manager 6.0 I had a bunch of errors in our environment on distributed virtual port group ports. They listed with the error:

One or more ports are experiencing network contention

Digging into the exact ports that were showing dropped packets resulted in nothing. The VMs connected to these ports were not registering any packet drops. Odd.

It took a while before any info came out but it apparently was a bug in the 6.0 code. I started following this thread on the VMware community boards and found that I was not alone in seeing this error. In our environment the error was also only present when logging in as the admin user. vCenter admin users were not seeing it so this pointed towards a cosmetic bug.

A KB article was released about the bug and that the alert can be disabled but it does not described exactly how to disable the alert. The alert is disabled by default in the 6.0.1 release but if you installed 6.0 and upgraded to 6.0.1 and did not reset all settings (as I did not do) there error is still there.

To remove the error login to the vROPS interface and navigate to Administration then Policy and lastly Policy Library as marked in the image below:

PolicyOnce in the Policy Library view select the active policy that is triggering the alert. For me it was Default Policy. Once selected click the pencil icon to Edit the profile as show below:

EditOn the Policy Editor click the 5. step – Override Alert / Symptom Definitions. In Alert Definitions click the drop-down next to Object Type, fold out vCenter Adapter and click vSphere Distributed Port Group. To alerts will now show. Next to the “One or more ports are experiencing…” alert click the error by State and select Local with the red circle with a cross as show below.

Local DisabledI had a few issues with clicking Save after this. Do not know exactly what fixed it but I had just logged in as admin when it worked. This disables the alert! Easy.

 

 

 

Default Host Application error

Last week I was called up by one of our Windows Admins. He had some issues with a VM running Windows and IIS. As we were talking he also casually mentioned another error he was seeing that was “caused by VMware”. I was a bit sceptic as you might imagine 🙂

He was seeing this error when he attempted to browse the IIS web page by clicking the link available in the IIS Manager:

Default Host Application ErrorNotice the VMware icon in the bottom. This is an error from VMware tools! What? As any sane person would do I consulted Google. And got a hit here – https://communities.vmware.com/message/2349884

The third reply post gave me the answer. Seems that when installing VMware Tools it might associate itself with HTTP and HTTPS links. This would then cause a click on the link in IIS Manager to call VMware Tools which is unable to service the request. The fix is pretty straight forward.

Go to Control Panel, then Default Programs and Set Associations. Scroll down to the Protocols section and locate HTTP and HTTPS. Make sure these are set to your browser of choice – in the image below I set them back to Internet Explorer (he was a Windows Sysadm after all 🙂 ). If the association is wrong it would be set to Default Host Application as shown for TELNET.

Fix

Working with Tags

The last couple of days I have been working with PowerCLI and vCenter Tags to see if I could automate my way out of some things regarding tracking which sys admins are responsible for a given VM.

Tagging and creating tags manually is not really my cup of tea (we have 1000+ vms and 40+ sys admins and even more people beyond that who could be tagged. So some automation would be required.

Next pre-creating all tags was not something I would enjoy either as maintaining the list would suck in my opinion. Also all tags are vCenter local so if you like us have more than 1 vCenter then propagating Tags to other vCenters is also something to keep in mind.

I added a bunch of small functions to my script collection to fix somethings. The first thing I ran into was “How to find which vCenter a given VM object came from?”. Luckily the “-Server” option on most commands accepts the vCenter server name as a string and not just the connection object so the following will get the vCenter of a given object by splitting the Uid attribute:

$object.Uid.Split("@")[1].Split(":")[0]

Splitting at “@” and taking the second part will remove the initial part of the string so it now starts with the FQDN followed by more information. Then splitting at the “:” just before port number and taking the first part will result in the FQDN of the vCenter. This may not work in all cases but it works for our purpose.

Now I needed this in my script because I was running into the problem of finding the correct Tag object to use with a given VM object in the “New-TagAssignment” Cmdlet. However it dawned on me that if I just make sure that the tag is present on all vCenter servers when I call “New-TagAssignment” I don’t need the Tag object just the name and PowerCLI/vCenter will do it’s magic. Thus the following works perfectly:

$VM | New-TagAssignment "<TAGNAME>"

But in any case I now have a way of finding the vCenter name of a given vSphere object in PowerCLI 🙂

 

vCenter Orchestrator 5.5.2.1 and SSO behind a load balancer

If you are like us in my organization and are crazy about HA solutions you have probably looked at putting SSO behind a load balancer. This may look like a daunting task and troubleshooting may not be easy. But hey we have an SSO server in our two sites that maintain the SSO service across the entire platform 🙂 Hurray!

Now this is not something I just recently configured. Alone reconfiguring an existing environment to a new SSO server seems like something I would avoid. Easier to just move ESXi hosts to new vCenter servers in a new setup. No, we were among first movers on SSO in HA. We installed vSphere 5.1 and configured the SSO for HA as described in KB2034157. Now vSphere 5.1 SSO had a host of other problems so only 4 months after installing vSphere 5.1 and moving production to this setup we upgraded to vSphere 5.5 and VMware were spot on with new documentation as there were major changes in SSO some URL’s changed, such as the /sts URL. For people like us the reconfiguration of the load balancer was described in KB2058838. Easy!

We have now been running this setup for about 12 months and it has been working well for us. Upgrading has been a bit tricky but having only applied vCenter patches twice in the period this was okay. We are running vCenter 5.5 U2b today so we have access to the new VMRC client for when Chrome stop supporting NPAPI.

But here is where the title of the post comes in. I recently (well it is almost two months ago now) upgraded our vCenter Orchestrator with the lates security patches pushing us to 5.5.2.1. Following this a problem occurred. I could no longer login via the client! This is a pretty serious problem. So I started debugging. Tried re-registering with the SSO. No problem. Test login in the configuration interface -> works. Login via client still fails. What the hell?

I the started browsing the vCO server.log file and looking at what happened when logins failed. There is what I found – 3 of these every login:

2014-11-27 09:52:33.716+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RestTemplate] GET request for "https://<sso-lb-fqdn>:7444//websso/SAML2/Metadata/vsphere.local" resulted in 404 (Not Found); invoking error handler
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RetriableOperation] Exception handled during retry operation with message: 404 Not Found
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] INFO {} [RetriableOperation] Retries left: [2]. Sleeping for [3] seconds before the next retry attempt.

Now these indicate that the vCO cannot talk to the SSO. Well I just re-registered it and tested login? How could this be? At this point I started a support case with VMware. And following over a months back and forth the support started looking into why there was a double slash “//” after the port number thinking that the SSO registration was somehow wrong. I at the same point realized something. Looking at the URL the vCO server was using a different URL that was configured in the configuration interface. What? And remembering back to the load balancer configuration I quickly realized that the problem was as simple as the /websso URL that vCO was using when logging via the client was not allowed through the load balancer that VMware provided above. At some point between vSphere 5.5 release and now some products (including vCAC/vRA) started using /websso instead of /sts.

From here I have spend about two weeks ask VMware how I should configure this with out getting real answers. Finally last week I got a paper describing how to configure an F5 load balancer for SSO when using vCAC. Now this would have been good if I could reverse engineer the approach that the F5 load balancer was using and configure that in the Apache load balancer. But no, those two configurations are completely different. So I decided to test something very simple. Copy the configuration block for /sts and rename everything to /websso. And guess what. So far it works! Here is how it looks:

###################################################################################
# Configure the websso for clustering

ProxyPass /websso/ balancer://webssocluster/ nofailover=On
ProxyPassReverse /websso/ balancer://webssocluster/

Header add Set-Cookie "ROUTEID=.%{BALANCER_WORKER_ROUTE}e; path=/websso" env=BALANCER_ROUTE_CHANGED
<Proxy balancer://webssocluster>
 BalancerMember https://<sso-node-1-fqdn>:7444/websso route=node1 loadfactor=100
 BalancerMember https://<sso-node-2-fqdn>:7444/websso route=node2 loadfactor=1
 ProxySet lbmethod=byrequests stickysession=ROUTEID
</Proxy>

vCAC: Playing Around

I have for a while wanted to write a bit about vCloud Automation Center (vCAC) or, if you haven’t heard, as it will be rebranded vRealize Automation (vRA). My organization upgraded to vCloud Suite Standard licenses 2 years ago during the promo so we have access to the Standard edition of vCAC.

 

So what would it be able to do for us? We have been looking at providing some kind of private cloud solution to users but the form and shape of this has yet to be defined. So for the moment I am just testing out what I can do, how it is done, how to configure access and the likes. So far I am impressed but still a bit confused between tenants, fabric and business groups, reservations, entitlements, catalogs etc etc etc.

 

What I have learned so far from fooling around are:

  1. Native AD support is only available on the default “vsphere.local” tenant for some reason. Meaning if you want to use a different tenant for your AD you need to define each domain in your AD forest to be able to use users from all domains. A bit impractical.
  2. Renaming a cluster under vCenter that has been added as a compute resource in vCAC causes odd problems. Suddenly my tests were giving: Failed “CloneVM: Object reference is not set to an instance of an object”. I was unable to find anything else. Thinking my template had moved I started data collection on the compute resouce again. This failed with no error message to be found. Then I remembered that I had renamed the cluster. Solution was to remove reservations and then the cluster from the Fabric Group and then add the new name, recreate the reservation and then I was able to go again 🙂
  3. Adding users to a business group as users don’t give access to the actual resources. You need to entitle them which to me seems a bit redundant. Perhaps I have not yet seen the light on why this is smart.

 

I wil hopefully write more at a later time about this. My next goal is to get vCO running as an endpoint but due to some odd login problems with vCO 5.5.2.1 I can’t use my vCO at the moment!

Upgrading SSO 5.5

*EDIT* After 3 hours on the phone with VMware Technical Support we are finally running again. Permissions on one vCenter were wiped and have to be recreated and a bug with Win2k12 domain controllers has hit is meaning that we had to create each domain as an identity source. But we are running again! *EDIT*

 

This is going to be a short post about my current experiences with upgrading the SSO component of vCenter. Short because it is still not working and I am waiting for VMware support to contact me.

So after VMworld I was pushing to upgrade to vSphere 5.5, atleast web client and SSO as this would solve a lot of our AD problems. So planning began and in November I revived my old test environment for the vSphere 5.1 upgrade. I dusted it off, got it running (more on that in a later post) and started the upgrade. The upgrade went fine – as such. After coming online again I could not enumerate users in SSO groups. I could search the AD via the new Integrated Windows Authentication Identity Source but not enumerate members of SSO groups. The web client would be “loading” for a LOOONG time and the following could be seen over and over again in the vmware-sts-idmd.log:

07:07:04,885 WARN   [LdapErrorChecker] Error received by LDAP client: com.vmware.identity.interop.ldap.WinLdapClientLibrary, error code: 81
07:07:04,885 WARN   [ServerUtils] cannot bind connection: [ldap://ADSERVERNAME, null]
07:07:04,885 ERROR  [ServerUtils] cannot establish connection with uri: [ldap://ADSERVERNAME]
07:07:04,885 ERROR  [ActiveDirectoryProvider] Failed to get GC connection to domain aau.dkLdap_sasl_bind failedServer Down

I purged dates and servernames from the log. But in essence it was looking like the SSO was attempting to make a Global Catalog (GC) connection to an AD server but it was selecting an AD server which was not running the GC service and when attempting to connect it decides that the server is down but keeps trying to contact it. Our AD forest consists of 1 root domain and about 20 subdomains and in total about 55 domain controllers where a bit over half run the GC service. There is no setting to tell SSO which server to talk to in the AD forest. So no change.

I got the following from VMware technical support a few days later:

Procedure

1 Log in to the vSphere Web Client as administrator@vsphere.local or as another user with vCenter Single Sign-On administrator privileges.

2 Browse to Administration > Single Sign-On > Configuration.

3 On the Identity Sources tab, select an identity source and click the Set as Default Domain icon.

In the domain display, the default domain shows (default) in the Domain column.

Please restart the services and login to the web client/ vi client.

When I tried that it seemed to work. I reverted and it still worked leading me to think that it was not the above that fixed it but simply time as it would at some point select another server. I was assured over the phone by VMware technical support that customers that had seen the above error had fixed it with the above.

But now, to tell you the truth, the above log snippet is from today. I went ahead and upgrade SSO and Web client yesterday and got stuck with this error after spending about 2 hours reconnecting/repointing a vCenter 5.1 to the new lookup service. It requires you to change the vpxd.cfg file as it has hard coded the path to the STS service in that file and it points to the old /ims and not the new /sts url. It also caused a total wipe of permissions on the one vCenter I repointed causing some incosistent behavior in the web client. Looking under vCenter I cannot see it but if I browse down through Administration -> licensing and find it there I can see all objects with the admin@System-Domain account. The only account left with permissions on the vCenter.

I’m back at the office early this morning, slept like crap due to this. This is going to be a long day!

Since the last time

So it’s been a while since my last post, a lot has been going on. I have been through my first “employee performance interview” (Medarbejderudviklingssamtale or MUS in Danish). It was good and a lot of things were discussed in regard to this new organization. Some steps to increase my skills were also planned and I will get back to that later.

Since last time I attended VMWorld Europe 2013 in Barcelona! It was an awesome conference as always and I got a lot of new things with me home. One of the things I did different this year as compared to the previous two years was spend a lot more time on the Solutions Exchange. I focused primarily on storage vendors as I have taken fancy in new flash accelerated or all-flash storage systems. So I think I visited every booth with just the slightest connection to storage.

I also had the chance to discuss some of the new technologies coming out of VMware and also discuss the upgrade procedure for vSphere 5.5 when running with an SSO behind a load balancer. That was really useful and insightful and provided me with most of the information I need to perform an upgrade of the SSO and Web Client in our environment to vSphere 5.5 to relieve all the AD problems we have had. I will post a blog article on this later as there are still some hick-ups in the documentation and procedure that I need to test out and receive confirmation on from VMware support.

Our consolidation process has not been moving that much. Shortly after returning from Barcelona I took part in a live migration of VMs between our data center and a remote server room across a distance of about a kilometer. Without going into details about how everything was connected suffice to say that we had a single 10Gbit Ethernet connection between our one of our data center routers and one of the server room’s routers. We also had a single FC connection between a storage array in the data center and the blade chassis in the server room. This allowed us to evacuate a single blade in the server room and move it to an identical blade chassis after this we used another blade in the server room as the “transport host”. We vMotion VMs onto it as it could see both data center and server room storage arrays. The Storage vMotioned the VMs to the data center array and finally vMotion onto the host in the data center. Then one by one we evacuated all blades in the server room and moved them to the data center. The process took about 2 days including the move of a few physical hosts as well and was all in all very successful.

We had a single error during the move which caused an unexplained HA restart. The largest of the VMs (1TB storage spread on 4 different VMDKs) was set to change format to thin provisioned during the Storage vMotion. But at some point during migration we got an unexpected error (This was the actual message from the vSphere client). 30 seconds later HA spontaneously rebooted the VM even though Virtual Machine monitoring was disabled and the host didn’t crash. Luckily the VM handled the reboot well and it occurred close to midnight with no users online.

Right now my colleagues are planning the consolidation of two other VMware installation which will most likely be done with cold migrations. The amount of VMs is small and the fiber connections and licenses of these installations will not allow us to do a live migration. They are also planning a move similar to the one I worked on which we hope to complete some time in December. I am working on a cold migration of VMware installation as well where most of the VMs will be reinstalled on a new cluster rather than migrating them.

That was a status on what we are working on. Also back to the “I will get back to this later”. During the next month I will be working on a test installation of vCloud Automation Center to experiment with it and research if this is something we can use in our organization. The initial tests will be confined to the infrastructure department but if it works out it might be scaled up.