Production Cluster Upgrade

During the spring of this year me and a few of my colleagues spent several months of meetings with storage solution providers and server hardware manufacturers to figure out if we should try out something new for our VMware production clusters. We had a budget for setting up a new cluster so we wanted to look at our options for trying something other than our traditional blade solutions we a spinning disk FC array which we have been using for years.

Some of the considerations we made regarding storage were that we wanted to start to leverage flash in some way or form to boost intense workloads. So the storage solution would need to use flash to accelerate IO. We also wanted to look at if server side flash could accelerate our IO as well. This lead us to the conclusion that we would like to avoid blades this time around. We would have more flexibility using rack servers with respect to more disk slots, PCIe expansions etc. Going with e.g. 1U server we would be sacrificing 6 additional rack units compared to 16 blades in a 10U blade chassis. Not huge in our infrastructure.

So we a bunch of different storage vendors, some new ones like Nimble Storage, Tintri, Pure Storage and some of the old guys like Hitachi and EMC. On the server side we talk to the regulars like Dell and HP but also Hitachi and Cisco.

All in all it was a great technically interesting spring and by summer we were ready to make our decision. In the end we decided to go with a known storage vendor but a new product. We chose a Hitachi VSP G200 as it in controller strength was on par with our existing HUS130 controllers but with smarter software and more cache. The configuration we went with was a tiered storage pool with a tier 1 layer consisting of 4 FMD 1.6TB in RAID10. This gives us 3.2TB Tier 1 storage and from the tests we have run – this tier is REALLY fast! The second and last tier is a large pool of 10K 1.2 TB disks for capacity. Totally we have just shy of 100TB of disk space on the array. It is setup so all new pages are written to the 10k layer but if data is hot it is migrated to the FMD layer within 30 seconds utilising Hitachi’s Active Flash technology. This feature takes some CPU cycles from the controller but from what we see right now this is a good trade off. We can grow to twice the size in capacity and performance as the configuration is at the moment so we should be safe for the duration of the arrays life.

On the server side we chose something new to us. We went with a rack server based Cisco UCS solution. A cluster consisting of 4x C220 M4 with 2x E5-2650V3 CPU’s and 384GB memory. We use a set of 10k disks in RAID1 for ESXi OS (yes we are very traditional and not very “Cisco UCS” like). The servers are equipped with 4x 10G in the form of a Cisco VIC 1227 MLOM and a Cisco VIC 1225 PCIe. As we were not really that hooked on setting up a SSD read cache (looking at vFlash for now) in production with out trying it we actually got a set of additional Cisco servers for some test environments. These are identical to the above but as some of my colleagues needed to test additional PCIe cards we went with C240 M4 instead for the additional PCIe slots. Two of these servers got a pair of 400GB SSD’s to test out vFlash. If it works we are moving those SSD’s to the production servers for use.

As I said we got the servers late summer and put the into production about 2½ months ago and boy we are not disappointed. Some of our workloads have experienced 20-50% improvements in performance. We ended up installing ESXi5.5 U3a and joining our existing 5.5 infrastructure due to time constraints. We are still working on getting vSphere 6.0 ready so hopefully that will happen in early spring next year.

We have made some interesting configurations on the Cisco UCS solution regarding the network adapters and vNic placement so I will throw up something later on how this was done. We also configured AD login using UserPrincipalName instead of sAMAccountName which was not in the documentation – stay tuned for that as well. And finally – have a nice Christmas all!

vRops 6.1 – follow up

Backup in September I wrote a piece when vRealize Operations Manager 6.1 was released. We were pretty excited about it because we were having a few issues with the 6.0.2 version we were running on. Among the problems we were having was vCenter SSO users suddenly not being able to login via the “All vCenters” option on the frontpage and selecting the individual vCenters to login to gave unpredictable results (logging in to vCenter A showed vCenter B’s inventory?!). We also had issues with alerts that we could not cancel – they would just keep piling up and about once a week I would shut the cluster down and start it again as it allowed me to cancel the alerts if I did it at the right time within 10-15 minutes after starting the cluster again.

However as you could also read we ran into an issue with 6.1 update and were forced to roll back and update to 6.0.3 that solved all issues but the login problem. But as we were the first to try an upgrade in production it took a while before a KB came out on the issue. I have had a to do item to write this up for a while so I can’t remember when the KB actually came out however it has not been updated for a month. The KB is 2133563 and notes that there is currently no resolution to the issue.

I recently spoke to a VMware employee who told me that the issue is in the xdb database and that the upgrade process is encountering something that either should not be in the xdb or that is missing. This causes the conversion from xdb to Cassandra to fail and the upgrade process to fail. I’m looking forward to seeing when a proper fix will come out.

We are closing in on the end of the year so I hope to be able to finish up a few blog articles before entering the new year – on the to do are a few items about vRA 7 and Cisco UCS with ESXi 5.5 and 6.

PowerCLI: Datastore Cluster and Tags

I was trying to help out a colleague yesterday when I realized that a quick fix to the problem would be to tag the datastore clusters in our environment and get them based on these tags instead of trying to determine which datastore cluster to choose when deploying a VM from PowerCLI.

So I decided to do this quickly and will show what I did (code snippets are from my vSphere 6.0 lab but the it is the same on our 5.5 production).

New-TagCategory -Name "CDC" -Cardinality Single -EntityType DatastoreCluster
New-Tag -Name "DC2" -Category CDC
Get-DatastoreCluster DatastoreCluster | New-TagAssignment -Tag "DC2"

Now I hope we can agree that I have created a new TagCategory that applies to Datastore Clusters and allows for one tag per object. We have also created a tag in this category called “DC2”. Lastly we have added the tag to the datastore cluster “DatastoreCluster”. Now if I run the following I get what I would expect:

C:\> Get-DatastoreCluster DatastoreCluster | Get-TagAssignment

Tag                                      Entity
---                                      ------
CDC/DC2                                  DatastoreCluster
C:\>

But if I run this I get something that I did not expect

C:\> Get-DatastoreCluster -Tag "DC2"
C:\>

This means that it is not working the same as for Virtual Machines with the “get-vm” cmdlet:

C:\> New-TagCategory -Name "VMTest" -Cardinality Single -EntityType VirtualMachine
Name                                     Cardinality Description
----                                     ----------- -----------
VMTest                                   Single
C:\> New-Tag -Name "Test" -Category "VMTest"
Name                           Category                       Description
----                           --------                       -----------
Test                           VMTest
C:\> Get-VM testvm01 | New-TagAssignment Test
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm | Get-TagAssignment
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm -Tag "Test"
Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
testvm01             PoweredOff 1        4,000

So I do not know if this is the way it was meant to work but I is definitely not what I expected!

vRealize Operations 6.1 is out!

As of midnight danish local time vRealize Operations 6.1 is out! This is great as we have been waiting for this release to fix some issues we have been having with our environment running on 6.0.2. Last communication from VMware Technical Support a month ago was that our two remaining problems would be fixed in this release.

I’ve look through the list of fixes but did not see it directly so hoping they still made it 🙂

Release notes can be found here.

UPDATE: Upgrading the VA-OS pak file worked but applying the VA pak file failed to complete. The logs showed that it was the conversion from xDB to cassandra that failed. VMware tech support were fast today and recommended rollback and applying 6.0.3 instead until further diagnostics could be made on 6.1 -> apparently we were the first to submit a case on 6.1 install 🙂

vExpert 2015!

YAY!! Can’t really get my arms down yet. I was not sure I would make the cut this year but I did! So happy.

I was on vacation last week when the announcement of vExpert 2015 second half went out. A bit scared when I open the page and started scrolling only to realize that searching would probably be easier 🙂 So I did expecting not to find myself. But I did! So proud and humble that it happened and to me.

Now this announcement motivated to me to try and take my contributions a bit further. I will attempt to put out more content via this blog as often as possible and attending VMUGDK and trying to come up with more sessions to present. This is not my strongest side but it is a side that I believe I need to improve.

Thank you VMware for granting me this title! and thank you VMUGDK for the great danish VMware community!

Disabling “One or more ports are experiencing network contention” alert

From day one of deploying vRealize Operations Manager 6.0 I had a bunch of errors in our environment on distributed virtual port group ports. They listed with the error:

One or more ports are experiencing network contention

Digging into the exact ports that were showing dropped packets resulted in nothing. The VMs connected to these ports were not registering any packet drops. Odd.

It took a while before any info came out but it apparently was a bug in the 6.0 code. I started following this thread on the VMware community boards and found that I was not alone in seeing this error. In our environment the error was also only present when logging in as the admin user. vCenter admin users were not seeing it so this pointed towards a cosmetic bug.

A KB article was released about the bug and that the alert can be disabled but it does not described exactly how to disable the alert. The alert is disabled by default in the 6.0.1 release but if you installed 6.0 and upgraded to 6.0.1 and did not reset all settings (as I did not do) there error is still there.

To remove the error login to the vROPS interface and navigate to Administration then Policy and lastly Policy Library as marked in the image below:

PolicyOnce in the Policy Library view select the active policy that is triggering the alert. For me it was Default Policy. Once selected click the pencil icon to Edit the profile as show below:

EditOn the Policy Editor click the 5. step – Override Alert / Symptom Definitions. In Alert Definitions click the drop-down next to Object Type, fold out vCenter Adapter and click vSphere Distributed Port Group. To alerts will now show. Next to the “One or more ports are experiencing…” alert click the error by State and select Local with the red circle with a cross as show below.

Local DisabledI had a few issues with clicking Save after this. Do not know exactly what fixed it but I had just logged in as admin when it worked. This disables the alert! Easy.

 

 

 

Default Host Application error

Last week I was called up by one of our Windows Admins. He had some issues with a VM running Windows and IIS. As we were talking he also casually mentioned another error he was seeing that was “caused by VMware”. I was a bit sceptic as you might imagine 🙂

He was seeing this error when he attempted to browse the IIS web page by clicking the link available in the IIS Manager:

Default Host Application ErrorNotice the VMware icon in the bottom. This is an error from VMware tools! What? As any sane person would do I consulted Google. And got a hit here – https://communities.vmware.com/message/2349884

The third reply post gave me the answer. Seems that when installing VMware Tools it might associate itself with HTTP and HTTPS links. This would then cause a click on the link in IIS Manager to call VMware Tools which is unable to service the request. The fix is pretty straight forward.

Go to Control Panel, then Default Programs and Set Associations. Scroll down to the Protocols section and locate HTTP and HTTPS. Make sure these are set to your browser of choice – in the image below I set them back to Internet Explorer (he was a Windows Sysadm after all 🙂 ). If the association is wrong it would be set to Default Host Application as shown for TELNET.

Fix

Working with Tags

The last couple of days I have been working with PowerCLI and vCenter Tags to see if I could automate my way out of some things regarding tracking which sys admins are responsible for a given VM.

Tagging and creating tags manually is not really my cup of tea (we have 1000+ vms and 40+ sys admins and even more people beyond that who could be tagged. So some automation would be required.

Next pre-creating all tags was not something I would enjoy either as maintaining the list would suck in my opinion. Also all tags are vCenter local so if you like us have more than 1 vCenter then propagating Tags to other vCenters is also something to keep in mind.

I added a bunch of small functions to my script collection to fix somethings. The first thing I ran into was “How to find which vCenter a given VM object came from?”. Luckily the “-Server” option on most commands accepts the vCenter server name as a string and not just the connection object so the following will get the vCenter of a given object by splitting the Uid attribute:

$object.Uid.Split("@")[1].Split(":")[0]

Splitting at “@” and taking the second part will remove the initial part of the string so it now starts with the FQDN followed by more information. Then splitting at the “:” just before port number and taking the first part will result in the FQDN of the vCenter. This may not work in all cases but it works for our purpose.

Now I needed this in my script because I was running into the problem of finding the correct Tag object to use with a given VM object in the “New-TagAssignment” Cmdlet. However it dawned on me that if I just make sure that the tag is present on all vCenter servers when I call “New-TagAssignment” I don’t need the Tag object just the name and PowerCLI/vCenter will do it’s magic. Thus the following works perfectly:

$VM | New-TagAssignment "<TAGNAME>"

But in any case I now have a way of finding the vCenter name of a given vSphere object in PowerCLI 🙂

 

vCenter Orchestrator 5.5.2.1 and SSO behind a load balancer

If you are like us in my organization and are crazy about HA solutions you have probably looked at putting SSO behind a load balancer. This may look like a daunting task and troubleshooting may not be easy. But hey we have an SSO server in our two sites that maintain the SSO service across the entire platform 🙂 Hurray!

Now this is not something I just recently configured. Alone reconfiguring an existing environment to a new SSO server seems like something I would avoid. Easier to just move ESXi hosts to new vCenter servers in a new setup. No, we were among first movers on SSO in HA. We installed vSphere 5.1 and configured the SSO for HA as described in KB2034157. Now vSphere 5.1 SSO had a host of other problems so only 4 months after installing vSphere 5.1 and moving production to this setup we upgraded to vSphere 5.5 and VMware were spot on with new documentation as there were major changes in SSO some URL’s changed, such as the /sts URL. For people like us the reconfiguration of the load balancer was described in KB2058838. Easy!

We have now been running this setup for about 12 months and it has been working well for us. Upgrading has been a bit tricky but having only applied vCenter patches twice in the period this was okay. We are running vCenter 5.5 U2b today so we have access to the new VMRC client for when Chrome stop supporting NPAPI.

But here is where the title of the post comes in. I recently (well it is almost two months ago now) upgraded our vCenter Orchestrator with the lates security patches pushing us to 5.5.2.1. Following this a problem occurred. I could no longer login via the client! This is a pretty serious problem. So I started debugging. Tried re-registering with the SSO. No problem. Test login in the configuration interface -> works. Login via client still fails. What the hell?

I the started browsing the vCO server.log file and looking at what happened when logins failed. There is what I found – 3 of these every login:

2014-11-27 09:52:33.716+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RestTemplate] GET request for "https://<sso-lb-fqdn>:7444//websso/SAML2/Metadata/vsphere.local" resulted in 404 (Not Found); invoking error handler
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] WARN {} [RetriableOperation] Exception handled during retry operation with message: 404 Not Found
2014-11-27 09:52:33.717+0100 [http-bio-0.0.0.0-8281-exec-5] INFO {} [RetriableOperation] Retries left: [2]. Sleeping for [3] seconds before the next retry attempt.

Now these indicate that the vCO cannot talk to the SSO. Well I just re-registered it and tested login? How could this be? At this point I started a support case with VMware. And following over a months back and forth the support started looking into why there was a double slash “//” after the port number thinking that the SSO registration was somehow wrong. I at the same point realized something. Looking at the URL the vCO server was using a different URL that was configured in the configuration interface. What? And remembering back to the load balancer configuration I quickly realized that the problem was as simple as the /websso URL that vCO was using when logging via the client was not allowed through the load balancer that VMware provided above. At some point between vSphere 5.5 release and now some products (including vCAC/vRA) started using /websso instead of /sts.

From here I have spend about two weeks ask VMware how I should configure this with out getting real answers. Finally last week I got a paper describing how to configure an F5 load balancer for SSO when using vCAC. Now this would have been good if I could reverse engineer the approach that the F5 load balancer was using and configure that in the Apache load balancer. But no, those two configurations are completely different. So I decided to test something very simple. Copy the configuration block for /sts and rename everything to /websso. And guess what. So far it works! Here is how it looks:

###################################################################################
# Configure the websso for clustering

ProxyPass /websso/ balancer://webssocluster/ nofailover=On
ProxyPassReverse /websso/ balancer://webssocluster/

Header add Set-Cookie "ROUTEID=.%{BALANCER_WORKER_ROUTE}e; path=/websso" env=BALANCER_ROUTE_CHANGED
<Proxy balancer://webssocluster>
 BalancerMember https://<sso-node-1-fqdn>:7444/websso route=node1 loadfactor=100
 BalancerMember https://<sso-node-2-fqdn>:7444/websso route=node2 loadfactor=1
 ProxySet lbmethod=byrequests stickysession=ROUTEID
</Proxy>

vCAC: Playing Around

I have for a while wanted to write a bit about vCloud Automation Center (vCAC) or, if you haven’t heard, as it will be rebranded vRealize Automation (vRA). My organization upgraded to vCloud Suite Standard licenses 2 years ago during the promo so we have access to the Standard edition of vCAC.

 

So what would it be able to do for us? We have been looking at providing some kind of private cloud solution to users but the form and shape of this has yet to be defined. So for the moment I am just testing out what I can do, how it is done, how to configure access and the likes. So far I am impressed but still a bit confused between tenants, fabric and business groups, reservations, entitlements, catalogs etc etc etc.

 

What I have learned so far from fooling around are:

  1. Native AD support is only available on the default “vsphere.local” tenant for some reason. Meaning if you want to use a different tenant for your AD you need to define each domain in your AD forest to be able to use users from all domains. A bit impractical.
  2. Renaming a cluster under vCenter that has been added as a compute resource in vCAC causes odd problems. Suddenly my tests were giving: Failed “CloneVM: Object reference is not set to an instance of an object”. I was unable to find anything else. Thinking my template had moved I started data collection on the compute resouce again. This failed with no error message to be found. Then I remembered that I had renamed the cluster. Solution was to remove reservations and then the cluster from the Fabric Group and then add the new name, recreate the reservation and then I was able to go again 🙂
  3. Adding users to a business group as users don’t give access to the actual resources. You need to entitle them which to me seems a bit redundant. Perhaps I have not yet seen the light on why this is smart.

 

I wil hopefully write more at a later time about this. My next goal is to get vCO running as an endpoint but due to some odd login problems with vCO 5.5.2.1 I can’t use my vCO at the moment!