vRealize Orchestrator 6.0.2.1 -> 7.0

Oh such end of the year content!

I set about updating our vRealize Orchestrator (vRO) appliance from 6.0.2.1 to 7.0 today to solve the recently released security issues (VMware Security Advisory ID: VMSA-2015-0008.1).

Easy update with the VAMI available but I quickly ran into this issue:

FailedUpgradeNot very informative – so looked at the updatecli.log file in the given location and it only told me that the pre and post installs had failed. Again not very informative. I looking into the vami.log file and saw that it had downloaded all the files and had made a file to mark a reboot required. So I thought – better try and reboot before starting the install again. This looked at first to work! But alas, the update just later threw this error:

FailedUpgrade2Will update post when I find solution!

Production Cluster Upgrade

During the spring of this year me and a few of my colleagues spent several months of meetings with storage solution providers and server hardware manufacturers to figure out if we should try out something new for our VMware production clusters. We had a budget for setting up a new cluster so we wanted to look at our options for trying something other than our traditional blade solutions we a spinning disk FC array which we have been using for years.

Some of the considerations we made regarding storage were that we wanted to start to leverage flash in some way or form to boost intense workloads. So the storage solution would need to use flash to accelerate IO. We also wanted to look at if server side flash could accelerate our IO as well. This lead us to the conclusion that we would like to avoid blades this time around. We would have more flexibility using rack servers with respect to more disk slots, PCIe expansions etc. Going with e.g. 1U server we would be sacrificing 6 additional rack units compared to 16 blades in a 10U blade chassis. Not huge in our infrastructure.

So we a bunch of different storage vendors, some new ones like Nimble Storage, Tintri, Pure Storage and some of the old guys like Hitachi and EMC. On the server side we talk to the regulars like Dell and HP but also Hitachi and Cisco.

All in all it was a great technically interesting spring and by summer we were ready to make our decision. In the end we decided to go with a known storage vendor but a new product. We chose a Hitachi VSP G200 as it in controller strength was on par with our existing HUS130 controllers but with smarter software and more cache. The configuration we went with was a tiered storage pool with a tier 1 layer consisting of 4 FMD 1.6TB in RAID10. This gives us 3.2TB Tier 1 storage and from the tests we have run – this tier is REALLY fast! The second and last tier is a large pool of 10K 1.2 TB disks for capacity. Totally we have just shy of 100TB of disk space on the array. It is setup so all new pages are written to the 10k layer but if data is hot it is migrated to the FMD layer within 30 seconds utilising Hitachi’s Active Flash technology. This feature takes some CPU cycles from the controller but from what we see right now this is a good trade off. We can grow to twice the size in capacity and performance as the configuration is at the moment so we should be safe for the duration of the arrays life.

On the server side we chose something new to us. We went with a rack server based Cisco UCS solution. A cluster consisting of 4x C220 M4 with 2x E5-2650V3 CPU’s and 384GB memory. We use a set of 10k disks in RAID1 for ESXi OS (yes we are very traditional and not very “Cisco UCS” like). The servers are equipped with 4x 10G in the form of a Cisco VIC 1227 MLOM and a Cisco VIC 1225 PCIe. As we were not really that hooked on setting up a SSD read cache (looking at vFlash for now) in production with out trying it we actually got a set of additional Cisco servers for some test environments. These are identical to the above but as some of my colleagues needed to test additional PCIe cards we went with C240 M4 instead for the additional PCIe slots. Two of these servers got a pair of 400GB SSD’s to test out vFlash. If it works we are moving those SSD’s to the production servers for use.

As I said we got the servers late summer and put the into production about 2½ months ago and boy we are not disappointed. Some of our workloads have experienced 20-50% improvements in performance. We ended up installing ESXi5.5 U3a and joining our existing 5.5 infrastructure due to time constraints. We are still working on getting vSphere 6.0 ready so hopefully that will happen in early spring next year.

We have made some interesting configurations on the Cisco UCS solution regarding the network adapters and vNic placement so I will throw up something later on how this was done. We also configured AD login using UserPrincipalName instead of sAMAccountName which was not in the documentation – stay tuned for that as well. And finally – have a nice Christmas all!

vRops 6.1 – follow up

Backup in September I wrote a piece when vRealize Operations Manager 6.1 was released. We were pretty excited about it because we were having a few issues with the 6.0.2 version we were running on. Among the problems we were having was vCenter SSO users suddenly not being able to login via the “All vCenters” option on the frontpage and selecting the individual vCenters to login to gave unpredictable results (logging in to vCenter A showed vCenter B’s inventory?!). We also had issues with alerts that we could not cancel – they would just keep piling up and about once a week I would shut the cluster down and start it again as it allowed me to cancel the alerts if I did it at the right time within 10-15 minutes after starting the cluster again.

However as you could also read we ran into an issue with 6.1 update and were forced to roll back and update to 6.0.3 that solved all issues but the login problem. But as we were the first to try an upgrade in production it took a while before a KB came out on the issue. I have had a to do item to write this up for a while so I can’t remember when the KB actually came out however it has not been updated for a month. The KB is 2133563 and notes that there is currently no resolution to the issue.

I recently spoke to a VMware employee who told me that the issue is in the xdb database and that the upgrade process is encountering something that either should not be in the xdb or that is missing. This causes the conversion from xdb to Cassandra to fail and the upgrade process to fail. I’m looking forward to seeing when a proper fix will come out.

We are closing in on the end of the year so I hope to be able to finish up a few blog articles before entering the new year – on the to do are a few items about vRA 7 and Cisco UCS with ESXi 5.5 and 6.