VMware vSAN on Cisco UCS Part 2 – UCS Profile

Is mentioned in part 1 this is part of a larger post that I never got around to finishing because it just grew to unmanagable size. This is part 2 and I will touch on my configurations of the UCS profiles that I use to run vSAN. Primarily this part is done because, despite Cisco having vSAN Ready Nodes, they lack a validated design of how to set it up – if you know of one please let me know!

Preamble

Cisco UCS managed servers have a great advantage of being easy to make consistently configured while still maintaining easy options to update configurations across multiple servers. I have been working with UCS Manager for close to 8 years now and have had a lot of problems occur and learned alot about the function of the product through that. These “recommendations” are based on my personal preferences and borrows from different types of best practices and configurations that I have encountered over the years.

As I primarily work with M5 configurations today I will focus on that and inject points for some M4 stuff I have encountered and their fixes/workarounds

One thing that I, all though not necessary, do is to separate my clusters into separate Sub-Organizations inside of UCS Manager. This gives a nice clean look where I can make generic policies in the Root organiation and specific policies under each sub-organization.

Boot Drive configuration

On M5 (and M6) I use a Storage Profile define the OS LUN for ESXi. As all other disks are in JBOD mode nothing needs to be done other than confirm JBOD mode (which is default if you select a SAS HBA for the server). The Storage Profile consists of a Disk Group policy and a Local LUN definition.

The Disk Group policy I usually define is setting RAID Level to RAID 1 Mirrored and then flipping Disk Group Configuration to manual. I then defined disk number 253 and 254 as the constituents of the Disk Group as these are always the two disks on the M.2 HW RAID controller. Everything else I leave deafult.

With this Disk Group Policy in hand I create a Storage Profile and under the Local LUNs section I create a LUN. I normally call the LUN OS and set a Size of 32 GB. Auto Deploy is set and Expand To Available is checked and finally the Disk Group Policy is set.

I could set the 32 GB larger but given the Expand to Available is enabled it will automatically fill in the 240 GB RAID 1 volume or 960 if choosing the large boot drives.

For M4 I use a different method which will be mentioned in the Boot Policy section.

Network Policies

Next up to configure is networks. Here I have borrowed a bit of what Cisco HyperFlex does. Hyperflex is Cisco answer to vSAN and it works to some extent in a similar manner.

First thing to do is to allow for QoS to have the correct MTU settings so that I can utilize CoS Preserve on the upstream switches if need be. Below table shows the settings I use in my environments.

PriorityCoSPacket DropWeightMTU
Platinum5No49216
Gold4Yes41600
Silver2Yesbest-effort1600
Bronze1Yesbest-effort9216
Best EffortAnyYesbest-effort9216
QoS System Class

Iuse Platinum for vSAN storage traffic, Gold for VM guest traffic, Silver for the ESXi management traffic and Bronze for vMotion interfaces. Note that both Bronze and Platinum allow MTU 9000 Jumbo frames to be used inside ESXi for optimum performance. Make sure the upstream switches from your Fabric Interconnects support MTU 9216.

I take these classes and create matching QoS Policies from. Simply use the same name and select the priority and I use all default settings otherwise. I need these policies when configuring vNICs.

I usually also create a Network Control Policy that allows CDP and LLDP both recieve and transmit, allow forged MAC and set the action to Link Down when an uplink fails. More on that later.

Before we start defining vNICs and LAN Connectivity Policies we need MAC addresses for the vNICs. UCS Manager allows you to define your own MAC addresses inside of the 00:25:B5 and then defining as much of the remaining as you want. You can easily just create a single pool and have UCS Manager assign MAC addresses from that pool but we borrow an idea from how Hyperflex designs their MAC pools.

What you do in Hyperflex is to select the 4th octet of the MAC as a prefix for a cluster e.g A1 so that start of each MAC is 00:25:B5:A1. That means you can identify a cluster in your network based on the 4th octet alone. Neat!

Next Hyperflex uses the 5th octet to define vNIC number and attached fabric. This means that vNIC1 will have A1 and vNIC2 will have B2. That means when setting it up you can match the 5th octet to a function. I use A1 and B2 for esxi management, vSAN on A3 and B4, guest traffic on A5 and B6, vMotion on A7 and B8 and any additional required NICs continue from there.

I create MAC pools to match a minimum of 8 vNics (2 mgmt, 2 vSAN, 2 guest and 2 vMotion). Then add 2 for NFS and 2 for virtual networking if needed.

With the MAC pools in hand and the policies from above I create a set of vNICs for ESXi. I prefer to have 2 for each function, one on fabric A and one on fabric B without fabric failover – ESXi can easily handle the failover and if I set it up like this ESXi can use both links from the server if need be and in case of a failure on one of the links I would rather see the vNICs go down and have ESXi handle the failover instead of it being transparent for ESXi.

Each vNic name is suffixed with the expected fabric so e.g. esxi-mgmt-a and esxi-mgmt-b. I set the “-a” as primary template in a Peer redundancy setup an “-b” to the secondary. This allows me to only update vlans and configuration on the “-a” vNIC and configuration will be in sync with the “-b” vNIC. The Template type is set to Updating to allow for adding things like additional vLANs to all servers using this vNIC without having to go through every profile. MTU needs to match the QoS policy selected and defined above. Select the matching MAC pool and set the Network Control Policy and done. Then repeat for each required vNIC.

I use the created vNICs to create a LAN Connectivity Policy which contains all the vNICs and setsthe adapter policy to VMWare (yes Cisco capitalizes it wrong 🙁 ). And that is it for networking for now. We will use the LAN Connectivity Policy when defining the Server Profile Template.

Server Profile Policies

I need a couple of Server Policies before we can create the Server Profile Template. First one I create is a Scrub Policy. This policy I generally make in the Root scope as I globally want scrub to be disabled for all types; Disk, BIOS, FlexFlash and Persistent Memory. I generally don’t want UCS to wipe settings unless specifically instructed to do so.

Next up is a Boot Policy. For M5 I define a Policy that uses Boot Mode UEFI and with Secure Boot enabled. Then I add a single boot option of type Local LUN using the LUN Name OS, which we defined in the Storage Profile previously.

If attempting to boot from an internal drive in an M4 as described in Part 1 some special options need to be set. Instead of using Local LUN select Embedded Disk and then modify the Uefi Boot Parameters option to set Boot Loader Name to “BOOTX64.EFI” and Boot Loader Path to “\EFI\BOOT\”. This is the only way I found to do UEFI secure boot on those drives.

I setup a Maintenance Policy for the Server Profile as well to set every action that might require reboots to “User Ack” which means that I need to manually approve any reboots of the host from profile changes. I also set the “On Next Boot” option to allow for easy firmware updating while updating ESXi. On Next Boot will apply any pending changes if the host reboots like when applying ESXi updates. Convenient!

Lastly I create a Host Firmware Package policy which sets the version of firmware to use in that cluster. As firmware packages can contain firmware for the SAS HBAs I want tight control as to which firmware is used. This also allows me to change the firmware level of the cluster in one step and then have pending changes for each host ready for when I’m ready to do the reboot to update firmware.

Server Profile Template

With all those profiles and things ready I can now create the template that each server will be instantiated from. This will be an updating template to allow for changes to be done consistently on all hosts and avoid configuration drift.

I usually just run through the wizard and select the policies created where applicable. As we don’t have any FC in our setup I usually don’t setup any vHBA’s. These can be added later given the Updating setting.

Only thing I do manually is to select the LAN Connectivity Policy to get the required vNICs for ESXi attached. Once added I complete the Wizard and go back into the network tab of the template to click “Modify vNIC/vHBA Placement”. I do this because the view to edit is easier to manage when access from there instead of in the wizard. I then manually place the vNICs in the order I want to force.

Conclusion

With all that there is now a profile template that can be used to produce identical ESXi hosts for vSAN usage. The profile even works on “compute only” nodes that don’t provide any storage to the system as long as they still use the M.2 HWRAID boot module. Very nice in my opinion.

Next up in part 3 I will go over some of my ESXi configurations that I prefer in the vSAN pods I run.

VMware vSAN on Cisco UCS Part 1 – Hardware

I have had parts of this post saved in draft for months without getting it finished because it was turning to a monster of a post if I tried covering it all. I finally found the drive to finish it when I realized that this was probably better if I split it in multiple posts instead of trying to include hardware considerations, UCS manager / standalone profile configurations and ESXi configurations into one single post.

So without further ado lets dive into the hardware part of VMware vSAN on Cisco UCS.

Please do note that these are my personal opinions and may or may not align with what you need in your datacenter solution. Most designs are individual for at specific use case and as such cannot be taken directly from here.

Base models

Cisco has a bunch of certified vSAN Ready nodes based on M4, M5 and M6 branches of servers. M3 isn’t supported as the hardware is both EOL and most of the controllers available for M3 models weren’t powerful enough for running vSAN workloads. The most common to use are Cisco’s C240 M5SX 2U models which allow for 24-26 drive bays total. For smaller deployments the C220 M5SX is also an excellent option with up 10 drives in 1U.

It is technically possible to run vSAN on other types of servers like the S3260 and B200 blades but they limit your options in terms of storage to compute ratio (S3260 being able to provide massive amounts of storage but little in compute and B200 being the opposite due to only having 2 disk slots).

One thing to note is if you plan on using NVMe storage options you need to focus on M5 and M6. M5 allows for up to 4 NVMe devices in U.2 format while M6 can support up to 24 NVMe devices. M4 only supports PCIe NVMe devices.

Boot options

Cisco has traditionally been a network boot company and as such the primary local boot option on M3 and M4 is SD cards if you don’t want to waste disk slots on boot devices. On B200 M4 with only 2 disk slots SD card is currently the only option as the disk slots are needed for a caching and capacity disk. On all M5 and M6 models (B200 included) there is a new dedicated slot for inserting a UCS-M2-HWRAID controller which can fit 2 M.2 drives (either 240 or 960 GB) and can do actual RAID that ESXi supports. Do not use the UCS-MSTOR-M2 controller which fits the same slot and fits 2 M.2 as well but this only supports the onboard LSI-SW RAID from the Intel chipset and that is only supported by Windows and Linux and not ESXi. It is not that expensive – just by the HWRAID controller 🙂

Specifically on the C240 M4 if you choose a UCSC-PCI-1C-240M4 you can insert up to two drives internally in the server that are managed by the onboard controller. You won’t have RAID functionality but it beats SD card booting by miles!

NIC

My go to here is using M5 servers with a UCSC-MLOM-C40Q-03 (VIC 1387) in combination with 6300 series Fabric Interconnects. That provides 2x40G per server which pairs nicely if your upstream network is 40 or 100G. On M6 that would be UCSC-M-V100-04 (VIC 1477) that provides the same.

If you are using 6400 series Fabric Interconnects and a 25G infrastructure you might want to go with UCSC-MLOM-C25Q-04 (VIC 1457) on M5 and UCSC-M-V25-04 (VIC 1467) on M6 to give 4×10/25G connections instead. Depends on your infrastructure.

On M4 it is technically possible to use the UCSC-MLOM-C40Q-03 (VIC 1387) all though the UCSC-MLOM-CSC-02 (VIC 1227) adapter is way more common but only provides 2x10G connections. If you run a pure 10G infrastructure and continue to do so I recommend adding an additional UCSC-PCIE-CSC-02 (VIC 1225) to provide 2x10G. I see this combination primarily used with 6200 series Fabric Interconnects.

For blades the standard is UCSB-MLOM-40G-03 (VIC 1340) for M4 and UCSB-MLOM-40G-04 (VIC 1440) for M5 and M6. Both cards are 2x40G. These need to be paired with IOM’s in the blade chassis which can limit the speed of the vNICs presented. Usually you get 2x20G on IOM 2304 and 2208. Consult your Cisco vendor to confirm how to get optimum speeds for your setup.

Controllers

Now the probably most crucial part of the any vSAN deployment – the controller. Albiet less important if you go for all-NVMe or even the new ESA option in vSAN 8 you need at SAS/SATA controller to handle your disks.

On C240 M4 this is usually UCSC-SAS12GHBA or UCSC-MRAID12G with a UCSC-MRAID12G-1GB cache module. Both are on the HCL but SAS HBA is prefferable over the RAID controller

On C220 and C240 M5 the only real options for vSAN are UCSC-SAS-M5 and UCSC-SAS-M5HD respectively. Primary difference is how many drives the controller is capable of utilizing which of course needs to be higher for the C240.

On the C240 M6 the option is CSC-SAS-M6T (UCSC-SAS-240M6) which allows for up to 16 disks but to be honest – if you are going for M6 nodes you should probably go for an M6N og M6SN for all NVMe configuration instead.

Disks

I won’t touch too much on this as various use cases and requirements need different numbers of disk groups and capacity devices. You use case may vary. We primarily use 3.8 TB Enterprise Value SATA SSD’s for capacity simply because they are fast enough and readily available to us. We aim to use NVMe caching devices if at all possible but if not we select a high endurance and performance SAS SSD for caching.

One note to have in mind. M4 only supports PCIe NVMe devices. On the C220 M5SX two front slots can be used for NVMe and on C220 M5SN all 10 slots can be NVMe. On the C240 M5SX slots 1 and 2 as well as 25 and 26 (on the rear) can be used for NVMe’s and on the C240M5SN bays 1-8 can be used for NVMe.

If you are retrofitting NVMe’s into existing C2x0 M5’s note that on the C220 M5 you need a CBL-NVME-220F to be able to use the front facing NVMe drives if not already present.

On the C240 M5 I recommend going for a UCSC-RIS-2C-240M5 which supports both 2xfront and 2xrear mounted NVMe’s if you remember to order a CBL-NVME-240SFF and UCSC-RNVME-240M5 to connect the front and rear slots respectively to the riser. This configuration allows you up to 4 NVMe caching devices while using SAS/SATA capacity drives up to 5 drives per group which can be a lot of disk and performance.

Conclusion

So those are the notes on hardware I have. I have not touched on CPU types and memory configurations at all as this is something that needs to match your workload. Somethings might need 3.0 Ghz base clock and no memory or loads of cores and memory. Pick something that matches the workload but I would recommend sticking to Xeon Gold CPU’s to get a good balance of performance and cores and selecting a configuration of 12 DIMMs for M5’s to get maximum memory bandwidth.

In the next article I’ll touch on the UCS Manager configurations that I use for vSAN.

System logging not Configured on host

A few weeks ago I noticed a warning on some of our hosts in our HyperFlex clusters and wondered what was going on. It was only hitting Compute Only nodes in the clusters.

The warning is indicating that the Syslog.global.logDir is not set as per KB2006834. But when I looked via ssh on the host it was logging data and the config option was set so it was working – so why the warning?

Well it turns out to be something not that complicated to fix. The admin who set up the nodes set the option to:

[] /vmfs/volumes/<UUID>/logs/hostname

That is giving it an absolute path on the host like you would do with the ScratchConfig.ConfiguredScratchLocation option. This works but triggers the warning as if it was not set.

The fix is simple. Simply change it to use the DatastoreName notation as this:

[DatastoreName] logs/hostname

This immediately removed the warning and everything continued as it had before.

IO Issues on Cisco C220 M4

Hi All

As promised more content is finally available! Unfortunately I cannot share any screenshots from this actual issue I worked on – you will have to take my word for it.

At my new job I was tasked with solving some issues that had been observed on a remote site on two Cisco C220 M4’s with local disks. The hosts run a set of redundant services but nothing is shared between them except network equipment. The issues were 1) some times around 04.00 AM the software running in the virtual machines would throw an alarm briefly and 2) powering on/off or snapshotting a VM would cause the software to throw alarms as well but on other VMs than the one being powered on/off or snapshotted. The event log on both hosts showed intermittent I/O latency warnings not connected to any of the above issues but nothing alarming.

Now one note though. The software running on these VMs is very latency sensitive so something like snapshots could potentially be a problem in any case but powering on VM A should not affect VM B unless the host is hurting for resources which is not the case.

Before diving in I asked out in the vExpert Slack if anyone had seen issues like this before or had any ideas of what to look for. James Kilby and Bilal Ahmed were quick to throw some ideas on the table. James suggested that vswp file being created on power on might cause the problem and Bilal suggested looking at the network. With those things in mind I started debugging.

First off I had already decided to update the ESXi version to latest 5.5 U3 + patches – it was a little out dated. Also decided to firmware upgrade the servers with the latest 3.0.3c release for Standalone C220 M4s. I had also found the latest supported drivers for the Cisco RAID controller (lsi-mr3) and the Intel NIC (igb) to rule out any compatibility issues. Also it was my hope that an update would remove some of the I/O latency warnings.

Now before upgrading anything I tested to see if the problem was still there. It was. Powering on a test VM with no OS and just 1 core and 4 GB RAM – instant alarms kicking off in the application. However powering off did not cause any noticeable problems. I proceeded to firmware update as my hope was this would solve the issues. Firmware upgrading on a remote site through a small connection is painful! It took a while. But once the first host was updated I proceeded to test if the issues were still there. There were. Damnit. Time to dig deeper. I tried out James’ idea of vswp being the problem and setting a 100% reservation seemed to solve the problem. However this was not a viable solution as this would only solve the problem if the powering on VM has a reservation. If anyone powered on a VM without it, it would still affect all other VMs, regardless of reservation on those.

I booted up our favorite debugging tool ESXTOP and vent into HBA mode and set delay down to 2 seconds. I then observed the Cisco RAID controllers behavior during power on operations and that freaked me out. It would happily do anything between 100 and 1000 IOPS at 5 to 150 ms while not powering on. The latency would spike high but nothing I was that scared of on a small set of local 10k disks. However when powering on a VM without reservation the HBA would stop doing any operations for upwards of 4 refreshes in ESXTOP (at 2 second intervals!). All indicators showing 0. No IO was passed so no latency was observed. This scared me a bit. Latest firmware and supported drivers. Damn. We weren’t seeing the same issue on another site with Dell servers but they also had SSDs instead of 10k disks. Was this the 10k disks not performing enough?

We had a short talk internally about what to do. My boss suggested that we looked at this Cisco bug:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCut37134/?referring_site=bugquickviewredir

This bug references an issue for some C220 M4’s that were installed with a specific version of the 5.5 ESXi Cisco Custom ISO. It was not the ISO these hosts were installed with but the solution was to use a different driver than the one ESXi default selects for the Cisco RAID controller HBA – the lsi-mr3 driver. Instead it instructs to make sure that the megaraid-sas driver is installed and to remove the lsi-mr3 and lsi-msgpt3 drivers and reboot which will make the megaraid-sas the active driver. We decided to try this. Downloaded the latest support megaraid-sas driver for the server and remove the lsi-mr3 and lsimsgpt3 drivers. Reboot and wait.

After getting online again with one host we tested. Powered on the test VM – and nothing. No alarms. What a difference. Tried it again looking at ESXTOP. No drops in IO. It was now doing 8.000 IOPS @ 15ms no problem. Major difference. Mass powering on VMs 15 at a time had taken minutes before the actual power on task was done and the machines would start booting. It took seconds now.

So what is the moral here? Apparently it could benefit you as a Cisco UCS server user to use the megaraid-sas driver instead of the lsi-mr3 driver. Both are included on the Cisco Custom ISO but it defaults to the lsi-mr3 so you actively have to do something to change that.

VMworld Europe 2016 – Day 1

Early morning day 2 of my VMworld 2016 trip seems like the time to do a short recap of yesterday.

Yesterday started with the General Session keynote where Pat Gelsinger and several others presented the view from VMware. Amongst his points I found the following things most interesting:

  • THE buzzword is Digital Transformation
  • Everyone is looking at Traditional vs Digital business
  • However only about 20% of companies are actively looking at doing this. 80% are stuck behind in traditional IT and spend time optimizing predictable processes.
  • Digital Business is the new Industrial Revolution

In 2016 – 10 years ago AWS was launched. Back there were about 29 million workloads running in IT. 2% of that was in the cloud mostly due to Salesforce. 98% was in traditional IT. Skip 5 years ahead now we have 80 million workloads and 7% in public cloud and 6% in private. Remaining 87% still in traditional perhaps virtualized IT. This year we are talking 15% public and 12% private cloud and 73% traditional of 160 million workloads. Pat’s research time have set a specific time and date for when cloud will be 50% (both public and private). That date is June 29th 2021 at 15:57 CEST. We will have about 255 million workloads by then. In 2030 50% of all workloads will be in public clouds. The hosting market is going to keep growing.

Also the devices we are connecting will keep growing. By 2021 we will have 8.7 billion laptops, phones, tablets etc connected. But looking at IoT by Q1 2019 there will be more IoT devices connected than laptops and phones etc and by 2021 18 billion IoT devices will be online.

In 2011 at VMworld in Copenhagen (please come back soon 🙂 ) the SDDC was introduced by Raghu Raghuram. Today we have it and keep expanding on it. So with today vSphere 6.5 and Virtual San 6.5 were announced for release as well as VMware Cloud Foundation as a single SDDC package and VMware Cross Cloud Services for managing your mutliple clouds.

vSphere 6.5 brings a lot of interesting new additions and updates – look here at the announcement. Some of the most interesting features from my view:

  • Native VC HA features with and Active, Passive, witness setup
  • HTML 5 web client for most deployments.
  • Better Appliance management
  • Encryption of VM data
  • And the VCSA is moving from SLES to Photon.

Updates on vCenter and hosts can be found here and here.

I got to stop by a few vendors at the Solutions exchange aswell and talk about new products:

Cohesity:

I talk to Frank Brix at the Cohesity booth who gave me a quick demo and look at their backup product. Very interesting hyper converged backup system that includes backup software for almost all need use cases and it scales linearly. Built-in deduplication and the possibility of presenting NFS/CIFS out of the deduped storage. Definitely worth a look if your are reviewing your backup infrastructure.

HDS:

Got a quick demo on Vvols and how to use it on our VSP G200 including how to move from the old VMFS to Vvols instead. Very easy and smooth process. I also got an update on the UCP platform that now allows for integration with an existing vCenter infrastructure. Very nice feature guys!

Cisco:

I went by the Cisco booth and got a great talk with Darren Williams about the Hyperflex platform and how it can be used in practice. Again a very interesting hyper-converged product with great potential.

Open Nebula:

I stopped by at OpenNebula to look at their vOneCloud product as an alternative to vRealize Automation now that VMware removed it from vCloud Suite Standard. It looks like a nice product – saw OpenNebula during my education back in 2011 I think while it was still version 1 or 2. They have a lot of great features but not totally on par with vRealize Automation – at least yet.

Veeam:

Got a quick walkthrough of the Veeam 9.5 features as well as some talk about Veeam Agent for Windows and Linux. Very nice to see them move to physical servers but there is still some ways to go before the can talk over all backup jobs.

 

Now for Day 2’s General Session!

Cisco UCS Manager AD authentication

Hello all

I have had this on my to-do for a while and finally got around to finishing it – using AD authentication on Cisco UCS Manager (UCSM). Now this is not something necessarily complicated but the official guides expect you to use a single AD domain and use sAMAccountName as the userid attribute. We have a large forest with a single root domain a lot of child domains all with parent-child trust to the root domain. We do not have sAMAccountName uniqueness across domains so instead we use userPrincipalName as the unique identifier for users. Users can also come from any of the child domains so to avoid having to add every domain we usually add a connection to Global Catalog instead. A note – the images below are from the UCSM 3.1 HTML5 interface but it is the same in the older 2.2 Java interface.

Now lets get into it. First things first we need to add domain controllers. I suggest you add two for redundancy purposes. Go to the Admin pane and down to User Management and unfold LDAP. Right click LDAP Providers and click Add.

In the image below I input some mock info but the important parts are to set the full DN of the user UCS should use to bind to AD. If you are using a multi domain forest set Base DN to the root domains Base DN and set the port to 3268 for GC LDAP and 3269 for GC LDAPS (remember to check the SSL box). Set the Filter to userPrincipalName=$userid. Input the password for the Bind account and select MS AD.

LDAP providerClick Next and you will be taken to the LDAP Group Rule and set Group Authorization Enabled and Group Recursion to Recursive. The rest should be default and now look like this:

LDAP group ruleWe now need to make a LDAP Provider Group. Right click the LDAP Provider Group and click Add. Give the group a name and add your domain controllers:

LDAP Provider GroupClick OK to finish creating the provider group. Now we need to add some group mappings for use with Group Authorization. Right click LDAP Group Maps and click Add. In the GUI below input the group DN and select which roles they should have. You can use the built in roles or create your own. Click OK to save.

LDAP Group MapsFinally, and this is where the magic happens, add a new Authentication Domain. Unfold Authentication under User Management and right click Authentication Domains and click Add. Give your Domain a name (you will see this when you login) and Select LDAP. Once LDAP is selected a drop-down will be show where you can select the LDAP Provider Group you created earlier. Once done click OK to save.

Authentication Domain

From now on when you access the login page you will see a drop down in which you can select either Native or the name of your Authentication domain. Select your Authentication domain and input your userPrincipalName and password in the fields and enjoy using AD login!

Production Cluster Upgrade

During the spring of this year me and a few of my colleagues spent several months of meetings with storage solution providers and server hardware manufacturers to figure out if we should try out something new for our VMware production clusters. We had a budget for setting up a new cluster so we wanted to look at our options for trying something other than our traditional blade solutions we a spinning disk FC array which we have been using for years.

Some of the considerations we made regarding storage were that we wanted to start to leverage flash in some way or form to boost intense workloads. So the storage solution would need to use flash to accelerate IO. We also wanted to look at if server side flash could accelerate our IO as well. This lead us to the conclusion that we would like to avoid blades this time around. We would have more flexibility using rack servers with respect to more disk slots, PCIe expansions etc. Going with e.g. 1U server we would be sacrificing 6 additional rack units compared to 16 blades in a 10U blade chassis. Not huge in our infrastructure.

So we a bunch of different storage vendors, some new ones like Nimble Storage, Tintri, Pure Storage and some of the old guys like Hitachi and EMC. On the server side we talk to the regulars like Dell and HP but also Hitachi and Cisco.

All in all it was a great technically interesting spring and by summer we were ready to make our decision. In the end we decided to go with a known storage vendor but a new product. We chose a Hitachi VSP G200 as it in controller strength was on par with our existing HUS130 controllers but with smarter software and more cache. The configuration we went with was a tiered storage pool with a tier 1 layer consisting of 4 FMD 1.6TB in RAID10. This gives us 3.2TB Tier 1 storage and from the tests we have run – this tier is REALLY fast! The second and last tier is a large pool of 10K 1.2 TB disks for capacity. Totally we have just shy of 100TB of disk space on the array. It is setup so all new pages are written to the 10k layer but if data is hot it is migrated to the FMD layer within 30 seconds utilising Hitachi’s Active Flash technology. This feature takes some CPU cycles from the controller but from what we see right now this is a good trade off. We can grow to twice the size in capacity and performance as the configuration is at the moment so we should be safe for the duration of the arrays life.

On the server side we chose something new to us. We went with a rack server based Cisco UCS solution. A cluster consisting of 4x C220 M4 with 2x E5-2650V3 CPU’s and 384GB memory. We use a set of 10k disks in RAID1 for ESXi OS (yes we are very traditional and not very “Cisco UCS” like). The servers are equipped with 4x 10G in the form of a Cisco VIC 1227 MLOM and a Cisco VIC 1225 PCIe. As we were not really that hooked on setting up a SSD read cache (looking at vFlash for now) in production with out trying it we actually got a set of additional Cisco servers for some test environments. These are identical to the above but as some of my colleagues needed to test additional PCIe cards we went with C240 M4 instead for the additional PCIe slots. Two of these servers got a pair of 400GB SSD’s to test out vFlash. If it works we are moving those SSD’s to the production servers for use.

As I said we got the servers late summer and put the into production about 2½ months ago and boy we are not disappointed. Some of our workloads have experienced 20-50% improvements in performance. We ended up installing ESXi5.5 U3a and joining our existing 5.5 infrastructure due to time constraints. We are still working on getting vSphere 6.0 ready so hopefully that will happen in early spring next year.

We have made some interesting configurations on the Cisco UCS solution regarding the network adapters and vNic placement so I will throw up something later on how this was done. We also configured AD login using UserPrincipalName instead of sAMAccountName which was not in the documentation – stay tuned for that as well. And finally – have a nice Christmas all!