Production Cluster Upgrade

During the spring of this year me and a few of my colleagues spent several months of meetings with storage solution providers and server hardware manufacturers to figure out if we should try out something new for our VMware production clusters. We had a budget for setting up a new cluster so we wanted to look at our options for trying something other than our traditional blade solutions we a spinning disk FC array which we have been using for years.

Some of the considerations we made regarding storage were that we wanted to start to leverage flash in some way or form to boost intense workloads. So the storage solution would need to use flash to accelerate IO. We also wanted to look at if server side flash could accelerate our IO as well. This lead us to the conclusion that we would like to avoid blades this time around. We would have more flexibility using rack servers with respect to more disk slots, PCIe expansions etc. Going with e.g. 1U server we would be sacrificing 6 additional rack units compared to 16 blades in a 10U blade chassis. Not huge in our infrastructure.

So we a bunch of different storage vendors, some new ones like Nimble Storage, Tintri, Pure Storage and some of the old guys like Hitachi and EMC. On the server side we talk to the regulars like Dell and HP but also Hitachi and Cisco.

All in all it was a great technically interesting spring and by summer we were ready to make our decision. In the end we decided to go with a known storage vendor but a new product. We chose a Hitachi VSP G200 as it in controller strength was on par with our existing HUS130 controllers but with smarter software and more cache. The configuration we went with was a tiered storage pool with a tier 1 layer consisting of 4 FMD 1.6TB in RAID10. This gives us 3.2TB Tier 1 storage and from the tests we have run – this tier is REALLY fast! The second and last tier is a large pool of 10K 1.2 TB disks for capacity. Totally we have just shy of 100TB of disk space on the array. It is setup so all new pages are written to the 10k layer but if data is hot it is migrated to the FMD layer within 30 seconds utilising Hitachi’s Active Flash technology. This feature takes some CPU cycles from the controller but from what we see right now this is a good trade off. We can grow to twice the size in capacity and performance as the configuration is at the moment so we should be safe for the duration of the arrays life.

On the server side we chose something new to us. We went with a rack server based Cisco UCS solution. A cluster consisting of 4x C220 M4 with 2x E5-2650V3 CPU’s and 384GB memory. We use a set of 10k disks in RAID1 for ESXi OS (yes we are very traditional and not very “Cisco UCS” like). The servers are equipped with 4x 10G in the form of a Cisco VIC 1227 MLOM and a Cisco VIC 1225 PCIe. As we were not really that hooked on setting up a SSD read cache (looking at vFlash for now) in production with out trying it we actually got a set of additional Cisco servers for some test environments. These are identical to the above but as some of my colleagues needed to test additional PCIe cards we went with C240 M4 instead for the additional PCIe slots. Two of these servers got a pair of 400GB SSD’s to test out vFlash. If it works we are moving those SSD’s to the production servers for use.

As I said we got the servers late summer and put the into production about 2½ months ago and boy we are not disappointed. Some of our workloads have experienced 20-50% improvements in performance. We ended up installing ESXi5.5 U3a and joining our existing 5.5 infrastructure due to time constraints. We are still working on getting vSphere 6.0 ready so hopefully that will happen in early spring next year.

We have made some interesting configurations on the Cisco UCS solution regarding the network adapters and vNic placement so I will throw up something later on how this was done. We also configured AD login using UserPrincipalName instead of sAMAccountName which was not in the documentation – stay tuned for that as well. And finally – have a nice Christmas all!

vRops 6.1 – follow up

Backup in September I wrote a piece when vRealize Operations Manager 6.1 was released. We were pretty excited about it because we were having a few issues with the 6.0.2 version we were running on. Among the problems we were having was vCenter SSO users suddenly not being able to login via the “All vCenters” option on the frontpage and selecting the individual vCenters to login to gave unpredictable results (logging in to vCenter A showed vCenter B’s inventory?!). We also had issues with alerts that we could not cancel – they would just keep piling up and about once a week I would shut the cluster down and start it again as it allowed me to cancel the alerts if I did it at the right time within 10-15 minutes after starting the cluster again.

However as you could also read we ran into an issue with 6.1 update and were forced to roll back and update to 6.0.3 that solved all issues but the login problem. But as we were the first to try an upgrade in production it took a while before a KB came out on the issue. I have had a to do item to write this up for a while so I can’t remember when the KB actually came out however it has not been updated for a month. The KB is 2133563 and notes that there is currently no resolution to the issue.

I recently spoke to a VMware employee who told me that the issue is in the xdb database and that the upgrade process is encountering something that either should not be in the xdb or that is missing. This causes the conversion from xdb to Cassandra to fail and the upgrade process to fail. I’m looking forward to seeing when a proper fix will come out.

We are closing in on the end of the year so I hope to be able to finish up a few blog articles before entering the new year – on the to do are a few items about vRA 7 and Cisco UCS with ESXi 5.5 and 6.

PowerCLI: Datastore Cluster and Tags

I was trying to help out a colleague yesterday when I realized that a quick fix to the problem would be to tag the datastore clusters in our environment and get them based on these tags instead of trying to determine which datastore cluster to choose when deploying a VM from PowerCLI.

So I decided to do this quickly and will show what I did (code snippets are from my vSphere 6.0 lab but the it is the same on our 5.5 production).

New-TagCategory -Name "CDC" -Cardinality Single -EntityType DatastoreCluster
New-Tag -Name "DC2" -Category CDC
Get-DatastoreCluster DatastoreCluster | New-TagAssignment -Tag "DC2"

Now I hope we can agree that I have created a new TagCategory that applies to Datastore Clusters and allows for one tag per object. We have also created a tag in this category called “DC2”. Lastly we have added the tag to the datastore cluster “DatastoreCluster”. Now if I run the following I get what I would expect:

C:\> Get-DatastoreCluster DatastoreCluster | Get-TagAssignment

Tag                                      Entity
---                                      ------
CDC/DC2                                  DatastoreCluster
C:\>

But if I run this I get something that I did not expect

C:\> Get-DatastoreCluster -Tag "DC2"
C:\>

This means that it is not working the same as for Virtual Machines with the “get-vm” cmdlet:

C:\> New-TagCategory -Name "VMTest" -Cardinality Single -EntityType VirtualMachine
Name                                     Cardinality Description
----                                     ----------- -----------
VMTest                                   Single
C:\> New-Tag -Name "Test" -Category "VMTest"
Name                           Category                       Description
----                           --------                       -----------
Test                           VMTest
C:\> Get-VM testvm01 | New-TagAssignment Test
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm | Get-TagAssignment
Tag                                      Entity
---                                      ------
VMTest/Test                              testvm01
C:\> get-vm -Tag "Test"
Name                 PowerState Num CPUs MemoryGB
----                 ---------- -------- --------
testvm01             PoweredOff 1        4,000

So I do not know if this is the way it was meant to work but I is definitely not what I expected!

vRealize Operations 6.1 is out!

As of midnight danish local time vRealize Operations 6.1 is out! This is great as we have been waiting for this release to fix some issues we have been having with our environment running on 6.0.2. Last communication from VMware Technical Support a month ago was that our two remaining problems would be fixed in this release.

I’ve look through the list of fixes but did not see it directly so hoping they still made it 🙂

Release notes can be found here.

UPDATE: Upgrading the VA-OS pak file worked but applying the VA pak file failed to complete. The logs showed that it was the conversion from xDB to cassandra that failed. VMware tech support were fast today and recommended rollback and applying 6.0.3 instead until further diagnostics could be made on 6.1 -> apparently we were the first to submit a case on 6.1 install 🙂

vExpert 2015!

YAY!! Can’t really get my arms down yet. I was not sure I would make the cut this year but I did! So happy.

I was on vacation last week when the announcement of vExpert 2015 second half went out. A bit scared when I open the page and started scrolling only to realize that searching would probably be easier 🙂 So I did expecting not to find myself. But I did! So proud and humble that it happened and to me.

Now this announcement motivated to me to try and take my contributions a bit further. I will attempt to put out more content via this blog as often as possible and attending VMUGDK and trying to come up with more sessions to present. This is not my strongest side but it is a side that I believe I need to improve.

Thank you VMware for granting me this title! and thank you VMUGDK for the great danish VMware community!

Racktables: Datacenter Management

Hello All,

Today I will be doing a little write up about a piece of software that is not related to VMware. Shock! But something that is related to the infrastructure that you need to have your VMs running.

We have for a while now had our cable management and rack space management in several Excel worksheets. This is far from optimal! We ended up using Excel worksheets because coming from a decentralized IT environment to a central one, all of the locally used tools were not scaling to the size or use case we needed. So in lack of better a few Excel worksheets were thrown together “until a solution was found”.

These few Excel worksheets became more worksheets and those became even more worksheets. At some point you just get sick of worksheets! Trying to manage thousands of cables and ports in a worksheet and managing rack space the same way was not even remotely entertaining. We needed something more!

Racktables to the rescue!

A colleague of mine was setting up some new equipment and started wondering if there weren’t any simple free tools for handling this instead of the rack space worksheets. He stumbled upon Racktables and showed it to me. It looked promising but as no resources were allocated to finding a replacement for the worksheets getting traction on a new tool was close to impossible.

So as all great tools in an IT department – this started under the radar! I installed Racktables 0.20.8 and a few plugins (Link Management is a MUST) and started playing with it in the fall of 2014 and after adding a few devices and racks I showed it to my team lead. He was impressed but still no resources.

Later on my colleague had to do some documentation for a user outside of IT and decided to add all the user’s servers and network connections and used Link Management’s ObjectMap as part of the documentation. This was the first real use.

Months went by and at a few team meetings discussing documentation of our server rooms I mentioned the software and that it was ripe for use. My team leader was getting convinced slowly.

And the suddenly a few months ago he put one of our trainees on the job of moving everything from our rack space worksheets into Racktables with the goal of eliminating the worksheets. Yay! Finally some traction.

Our trainee input everything from the worksheets and then summer holidays hit. So as me being the young guy in the office I have been working for the first part of the main vacation weeks. I have been double checking the inputted data in Racktables and making updates to where needed.

During this time I have had a lot of talks with the architect in my team who has been working on eliminating our cable worksheets and I showed him what Racktables could do in this regard. Within a day of playing with it he was pretty much hooked and has been documenting large parts of our fibre infrastructure.

We have last week and this week been showing it to the team responsible for mounting and connecting devices in our infrastructure and they are also pretty impressed with what this simple tool can do.

The juicy stuff – what does it do!

Now I have been talking about how we started using this tool but now what can it actually do? From the front page of their site:

Racktables is a nifty and robust solution for datacenter and server room asset management. It helps document hardware assets, network addresses, space in racks, networks configuration and much much more! – http://racktables.org/

It has functionality for managing racks in different locations and rows and sizes. Objects that can be mounted in racks and connected to other objects. IPv4 and IPv6 address management. IP SLB and 802.1Q (VLAN) management. It can even track your virtual infrastructure and has a built in patch cable database for inventory.

From a programming perspective it is a highly modular yet implemented in quite a simple way building on a MySQL database. Its about ~35 PHP files and ~75 tables in the database.

Most all of the logic comes from the Dictionary in which you can define Object Types, server, switch, router, software etc models and even add your own types and sub types. Attributes can be attached to ObjectTypes to expand on the info of an object. It is possible to define Parent-Child relationships between object types and also defining port types, connector types and how they are compatible with each other.

Racktables comes with a lot of things defined in the dictionary by default and those are a great basis for starting out. You will probably soon realize that a lot of the objects you have don’t exist but with the simple setup it is VERY easy to add them.

Racktables can also be extended with plugins and there are built in integrations with Cacti and Munin if you use those tools.

There is also support for using CDP/LLDP and SNMP against switches to allow for auto-populating the objects with ports and connections – we have not used this feature.

I can only recommend this tool. It’s interface looks like something that should have been long gone but it just works – even on tablets and any platform and browser because no fancy Flash/Silverlight or Javascript pop in here and there. It’s simple, it’s easy and it just works.

And if you are asking about scalability? We have added over 600 objects in almost 70 racks and have made almost 1000 links between ports and no slowdown has been noticed so far –  and as it is simple just add more resources to your web server to handle the load 🙂

vROPS: the peculiar side

vROPS is running again in a crippled state with AD login issues, licensing issues and alert issues but at least it is showing me alerts and emailing me again.

While digging through vROPS today in a Webex with VMware Technical Support I stumbled upon an efficiency alert that I simply had to share.

In summary the image below shows me that if I somehow manage to reclaim this snapshot space I don’t think I will have any storage capacity problems for a considerable amount of time!

RidiculousRead again – that is almost 2.8 billion TB (or 2.8 zettabytes) of disk space! on a 400GB VM. How many snapshots would that even take to fill? By my estimates around 7 billion full snapshots that were fully written. I’m not sure that is within vSphere 5.5 configuration maximums for snapshots per VM.

vRops down for the count

While I try to hold my frustration at bay and wait for VMware support to get back to me to figure out what the h*** happened yesterday that has sent my vROPS 6.0.1 cluster down for the count on this now close to 24 hours.

A recap of what happened up to the point of realizing that the cluster was what I would call inconsistent. I spent most of the day yesterday cleaning up by removing a number of old unused VMs. Amongst those were a couple of turned of VMs that I did not think much of before deleting them.

About 1½ hours after deleting the last VMs I got an error in vROPS about one adapter instance not being able to collect information about the before mentioned powered off VMs. I looked in the environment information tab to see if they were still listed along with some of the others I had deleted. But no – they weren’t there. Hmm.

Then I thought they might still be listed in the license group I had defined. Went over to look and to my horror this was the first sign something was wrong – none of my licenses were in use?! Looking in the license groups view all my hosts were suddenly shown as unlicensed and my license group that normally has around 1800 members was empty. What? Editing the license group showed that the 1800 members including the hosts under unlicensed where listed as “Always include” so how come they weren’t licensed.

At this point I began suspecting that the cluster was entering a meta state of existence. So looking at the Cluster Management page I missed a critical piece of info at first but more on that later. Everything was up and running so I went to the Solutions menu with the intent of testing the connection to each vCenter server. But doing so caused an error that the currently selected collector node was not available? But the cluster just told me everything was up? So tried every one of the 4 nodes but none worked. Okay what do I do. I tried removing an adapter instance and add it again. Big mistake. Can’t readd it with the same name so had to make a new name for the same old vCenter..

That still not worked. Then I went back to the cluster management and decided to take one of the data nodes offline and the online again to see if that fixed. While waiting at “Loading” after initiating the power off I suddenly got an error saying it was unable to communicate with the data node. Then the page reloaded and the node was still online. Unsure what to do I stared at the screen only to suddenly see a message “Your sessions has expired” and then being booted back to login?

When logging back in I now only saw half of the environment. Because the old adapter that I had removed and readded under another name was not collecting. It just stated Failed.

I decided to take the car home from the office here. Was not sure what to do and need a few hours to get it at some distance. Back home I connected to the cluster again and looked at Cluster Management again. Then I spotted the (or “a” at least) problem.

Below is a screen print of what it normally looks like:

CorrectAnd here is what it looked like now:

WrongNotice the slight problem that both HA nodes reported as being Master? That cannot be good. What to do other than power off the entire cluster and bring it online again.

About 30 minutes later the cluster was back online and I started to get alerts again. A lot of alerts. Even alerts that it had previously back in the Easter week had cancelled. But okay – monitoring is running again. So decided to leave it at that and pick it up this morning again.

Well still no dice – things were still not licensed. Damnit. So I opened a ticket with VMware. While uploading log bundles and waiting I tried different things to get it to work but nothing. Then suddenly my colleague says he can’t log into vRops with his vCenter credentials. What? I had been logged in as Admin while trying to fix this so hadn’t tested my vCenter account. But it did not work. Or atleast not when using user@doma.in notation. using DOMA\user it worked – atleast I could login and see everything from the adapter that I readded yesterday. Not the other one. What?

By this time a notification event popped up in vRops clicking it gave me “1. Error getting Alert resource”. What? Now pretty desperate I powered off the cluster again and then back on. This fixed the new error of not showing alerts. Atleast for 30 minutes. The suddenly some alerts showed this again.

Trying to login with vCenter credentials did not work at all now. This is escalating! Tried setting the login to a single vCenter instead of all vCenters. Okay so previously I had only been able to see the contents of the readded vCenter adapter so I tried the one I could not see anything from. DOMA\user worked and I could see the info from this. Success – I thought. Logging back out and trying it against the readded vCenter did not work with DOMA\user but user@doma.in worked? But when inspecting the environment I was seeing the data from the other vCenter? What?

Right now I am uploading even more logs to VMware. I will update this when I figure out what the h*** went wrong here.

 

Disabling “One or more ports are experiencing network contention” alert

From day one of deploying vRealize Operations Manager 6.0 I had a bunch of errors in our environment on distributed virtual port group ports. They listed with the error:

One or more ports are experiencing network contention

Digging into the exact ports that were showing dropped packets resulted in nothing. The VMs connected to these ports were not registering any packet drops. Odd.

It took a while before any info came out but it apparently was a bug in the 6.0 code. I started following this thread on the VMware community boards and found that I was not alone in seeing this error. In our environment the error was also only present when logging in as the admin user. vCenter admin users were not seeing it so this pointed towards a cosmetic bug.

A KB article was released about the bug and that the alert can be disabled but it does not described exactly how to disable the alert. The alert is disabled by default in the 6.0.1 release but if you installed 6.0 and upgraded to 6.0.1 and did not reset all settings (as I did not do) there error is still there.

To remove the error login to the vROPS interface and navigate to Administration then Policy and lastly Policy Library as marked in the image below:

PolicyOnce in the Policy Library view select the active policy that is triggering the alert. For me it was Default Policy. Once selected click the pencil icon to Edit the profile as show below:

EditOn the Policy Editor click the 5. step – Override Alert / Symptom Definitions. In Alert Definitions click the drop-down next to Object Type, fold out vCenter Adapter and click vSphere Distributed Port Group. To alerts will now show. Next to the “One or more ports are experiencing…” alert click the error by State and select Local with the red circle with a cross as show below.

Local DisabledI had a few issues with clicking Save after this. Do not know exactly what fixed it but I had just logged in as admin when it worked. This disables the alert! Easy.

 

 

 

Default Host Application error

Last week I was called up by one of our Windows Admins. He had some issues with a VM running Windows and IIS. As we were talking he also casually mentioned another error he was seeing that was “caused by VMware”. I was a bit sceptic as you might imagine 🙂

He was seeing this error when he attempted to browse the IIS web page by clicking the link available in the IIS Manager:

Default Host Application ErrorNotice the VMware icon in the bottom. This is an error from VMware tools! What? As any sane person would do I consulted Google. And got a hit here – https://communities.vmware.com/message/2349884

The third reply post gave me the answer. Seems that when installing VMware Tools it might associate itself with HTTP and HTTPS links. This would then cause a click on the link in IIS Manager to call VMware Tools which is unable to service the request. The fix is pretty straight forward.

Go to Control Panel, then Default Programs and Set Associations. Scroll down to the Protocols section and locate HTTP and HTTPS. Make sure these are set to your browser of choice – in the image below I set them back to Internet Explorer (he was a Windows Sysadm after all 🙂 ). If the association is wrong it would be set to Default Host Application as shown for TELNET.

Fix