Friday, April 4, 2025

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware.

VMware, of course now owned by BroadSoft, has prioritized their Aria Operations (VROPS) monitoring suite over any of the alternative monitoring solutions (of which there is no shortage). Usually open source solutions have a limited life cycle as developers leave the project and move on to the next zen thing. Zabbix is still widely popular after many years. They got it mostly right the first time, and it absolutely excels at monitoring Linux. 

To monitor VMware, it relies on VMware templates - and it used to present "objects" like datastores, as hosts. In version 7, it no longer does this, and instead tie the datastores as attributes of the true hosts - hypervisors, virtual machines, etc. This makes it a bit harder to monitor a datastore in an of itself - getting free space, used space, etc. if you want to do that.  But - in version 7 there are now all kinds of hardware sensors and stuff that were not available in version 5. There are more metrics (items), more triggers that fire out of the box, etc.

One big adjustment in v7 is the support for SLAs. I decided to give it a shot.

The documentation only deals with a 3 node cluster. Such as a back-end cluster. That is not what I wanted.

What I wanted, was to monitor a cluster of hypervisors in each of multiple datacenters.

To do this, I started with SLAs:

  • Rollup SLA - Quarterly
  • Rollup SLA - Weekly
  • Rollup SLA - Daily

 Then I created a Service

  • Rollup - Compute Platform

Underneath this, I created a Service for each data center. I used two tags on each one of these, one for datacenter and the other for platform (to future-proof in the even we use multiple platforms in a datacenter). Using an example of two datacenters, it looked like this.

  • Datacenter Alpha Compute
    • datacenter=alpha
    • platform=accelerated
  • Datacenter Beta Compute
    • datacenter=beta
    • platform=accelerated 

These services have really nothing defined in them except the tags, and I assigned a weight of 1 to each of them (equal weight - we assume all datacenters are equally important).

Underneath these datacenter services, we defined some sub-services.

  • Datacenter Alpha Compute
    • Health
      • Yellow
      • Red
    • Memory Usage
    • CPU
      • CPU Utilization
      • CPU Usage
      • CPU Ready
    • Datastore
      • Read Latency
      • Write Latency
      • Multipath Links
    • Restart 

We went into the trigger prototypes, and made sure that the datacenter, cluster and platform were set as tags so that all problems would have these tags coming in, necessary for the Problem Tag filter.  We also had to add some additional tags to differentiate between severity warning and critical (we used level=warning for warnings, and level=high for anything higher than a warning). 

On the problem tags filter, we wanted to catch only problems for our datacenters and this specific platform, so we used those two filters on every tag. In this example below, we have a cpu utilization service - a sub-service of CPU, which in turn is a sub-service of a datacenter.

This worked fairly well, until we started doing some creative things.

First, we found that all of the warnings were impacting the SLAs. What we were attempting to do, was to put in some creative rules, such as:

  • If 3 hypervisors or more in a cluster have a health = yellow, dock the SLA
  • Use a weight of 9 on a health red, and a weight 3 on as situation where 3 hypervisors in a cluster have a health of yellow. 

THIS DID NOT WORK. Why? Because unless every single hypervisor was a sub-service, there was no way to make it work because the rules all applied to child services. We couldn't have all of the hypervisors be a sub-service - too difficult to maintain, too many of them, and we were using Discovery which meant that they could appear or disappear at any time. We needed to do SLAs at the cluster level or the datacenter level, not individual servers (even though indeed we monitor individual servers, but they are defined through discovery).

So, we had to remove all warnings from the SLAs

    • They were affecting the SLA too drastically (many hypervisors hit health=yellow for a while and then recover). We had to revert to just the red ones and assume that a health=red affects availability (it doesn't truly affect availability necessarily, but it does in certain cases).

    • We could not make the rules work without adding every single hypervisor in every single datacenter as a sub-service which simply wasn't feasible.

The problem we now face, is that the way the SLAs roll up, the rollup SLA value is essentially the lowest of the SLA values underneath it.

  • Platform SLA (weight 0) = 79.7 - huh?
    • Datacenter A (weight 1) = 79.7
    • Datacenter B (weight 1) = 99.2
    • Datacenter C (weight 1) = 100

The platform SLA should be an average, I think, of the 3 datacenters if they are all equal-weighted. But that is not what we are observing.

The good news though, is that if Datacenter A has a problem with health=red, the length of time that problem exists seems to be counting against the SLA properly. And this is a good thing and a decent tactic for examining an SLA.

The next thing we plan to implement, is a separation between two types of SLAs:

  • Availability (maybe we rename this health)
  • Performance
So a degradation in cpu ready, for example, would impact the performance SLA, but not the availability SLA. Similar for read/write latency on a datastore. 

I think in a clustered hypervisor environment, it is much more about performance than availability. The availability might consider the network, the ability to access storage, and whether the hypervisor is up or down. The problem is that we are monitoring individual hypervisors, and not the VMware clusters themselves, which are no longer presented as distinct monitor-able objects in Zabbix 7.
 
But I think for next steps, we will concentrate more on resource usage, congestion, and performance than availability.

Optimizing for NUMA in a Virtualized Environment

Back in January, we had a platform tenant come to us requesting that the vcpu.numa.preferHT setting be set on all of their VMs. As we weren't completely familiarized with this setting, it naturally concerned us since this is a shared on-prem cloud platform that runs VNFs.  This particular tenant had some extremely latency-sensitive workloads, so they were interested in reducing latency and jitter that one of their customers had apparently complained about.

Diving into NUMA in a virtualized environment is not for lightweights. It requires you to understand SMP, the latest advances in NUMA architecture, and on top of that, how it maps and applies to VMs that run operating systems like Linux, on top of what is already a POSIX Unix-like or Linux-like OS (PhotonOS) which is proprietary.

I found that this diagram has been extremely helpful in showing how NUMA works to platform tenants who don't understand NUMA or don't understand it well.


In this diagram, we have a theoretical dual socket cpu system, which has 24 cores on each socket. So 48 total physical cores. Each socket is a physical NUMA node, so there is a Numa Node 0, and a Numa Node 1. Now this is not a "law" that a socket is equivalent to a physical numa node, but in this scenario, that is indeed the case (and on many low to mid-grade servers, depending on the chipset and architecture you will usually see sockets being equal to numa nodes).

Each NUMA node usually gets its own "memory real estate". To use a kitchen analogy, the idea is that processes (VMs) that access memory should be able to hit their local cache (i.e. kitchen table), but if they get a cache-miss, they can pull from their local memory allotment (i.e. run to the basement pantry). If this is not possible, there is an expensive "trip to the store" - remote memory - which takes longer to retrieve from. So you want, ideally, 100% of the memory requests to be local in a high performing environment.

If one were to go into the BIOS and enable Hyperthreading, contrary to what people seem to think, this does NOT double the number of processor cores you have! What happens, on each socket, is that the physical cores are halved, presenting 48 "logical" cores instead of 24 physical cores.

Now, hyperthreading does give some advantages - especially for certain kinds and types of workloads. There is a high (and always growing) amount of non-blocking parallelism in SMT processing, so having 48 cores can certainly boost performance - although this is an in-depth topic that is not covered in great depth in this post.  But the important takeaway is that by increasing from 24 cores to 48 logical cores, it is not a pure linear increase. There is considerably more overhead managing 48 cores than 24 just for starters. Then there is the context-switching and everything else that factors in.

A virtual machine (that is NOT numa aware), that is provisioned for 52 vCPU, does NOT fit onto a single NUMA node (which has 24 cores, but 48 if HT is enabled). It may be provisioned by default to have 1 core per socket resulting in 52 sockets which allows the hypervisor to "sprinkle them around" to available slots of all of the NUMA nodes at the expense if managing this. Or, perhaps the provisioned steps in and overrides default behavior and specifies 52 cores on a single socket. Neither situation will contain a VM onto a single NUMA node because of the simple fact that it does not fit. There are only 24 physical cores - and 48 if Hyperthreading is enabled as it is in this particular example.  

Now to elaborate on this example, there really aren't even 24 physical cores because the OS of the hypervisor is going to reserve cores for itself - perhaps two physical cores on each socket. So in reality, one may think they can fit a 48 core VM (people like to provision in a power of 2 in a binary world) onto a single NUMA node (in this example), only to discover that this isn't the case after deployment, because they in fact needed to provison with 44 cores in order for the numa placement algorithm to "home" the VM on one NUMA home node (0) versus another adjacent one (1).

So "rightsizing" a VM, using 44 cores on a single socket will give some performance benefit in most cases, because the VM will be scheduled on one NUMA node or another in this example. If you provisioned 2 VMs, one might NUMA node 0 while the next one gets assigned NUMA node 1. But, they should stay put on that home node. UNLESS - more VMs are provisioned, and contention begins. In this case, the NUMA scheduler may decide to shift them from one NUMA node to another. When this happens, it is known as a "NUMA Migration".

As stated earlier, when a VM is on a specific NUMA node, it is best to have its memory localized. But it IS possible to provision a 44 core VM that sits on a NUMA Home Node (i.e. 0), but the memory is 90% localized instead of fully localized. In this case, the VM might have more memory provisioned than it should, to be properly optimized.  This is why it is super important to ensure that the memory is being utilized - and not just reserved! And frankly, the same with the vCPU resources! But in general, making a VM "NUMA Aware" by having it sized to fit on NUMA nodes will cut down on migrations, and in most cases (but not all), improve performance.

VMware has a "Rightsizing" report in Aria Operations, that figures this out - and exposes VMs that were over-provisioned with memory and cores that are not being used.  This is a big problem for a shared environment because the net effect of having scores of VMs on a hypervisor that are underutilized, is that the CPU Ready percentage starts to go through the roof. You will look at the stats, and see low CPU utilization and low memory utilization, and think "everything looks under control". But then you will look at the latency on datastores (read and write - write especially), or the CPU Ready percentage (which is showing how much time a process waits for CPU runtime - or, time on the die), and see that they are unsatisfactory to downright abysmal. This is why the Rightsizing report exists - because instead of buying more hardware, there does exist the option to downsize (rightsize) the virtual machines without making additional capital investments in hardware servers.

Last topic to discuss, is NUMA Node Affinity. VMs can set, in Advanced Properties, a NUMA Node Affinity. This is a very very dangerous setting, because now, with this set, a VM literally stays put, and it could be at the expense of everything else in a shared infrastructure because the NUMA load balancer can't touch it or move it. 

As I understand it, the vCPU "Hot Add" feature also nullifies what the NUMA load balancer is trying to do, because of the fact that elasticity of vCPUs creates an "all bets are off" situation that it simply can't manage effectively. In this day and age of automation and auto-scaling, all of these auto scaling engines that want to monitor CPU and add it "hot" (on the fly - without shutting a VM down and restarting it) look super cool. But they can result in severe performance degradation.

Lastly, the vcpu.numa.preferHT setting - what this does, is align a virtual machine (vCPU topology) with the underlying physical topology. This gets into the topic of VPD (Virtual Proximity Domain) and PPD (Physical Proximity Domain). By using preferHT, the NUMA scheduler will take more aggressive advantage of hyperthreads when making core placements, than it would using physical cores as the basis for its placements (which prioritizes memory locality). 

If a VM is cache / memory intensive, memory locality is important. Say you are running a VM with a Times Ten in-memory database.  But if you are a packet-pumping cpu-intensive VM that doesn't need to do a lot of memory access (reads and writes), the computational boost of hyperthreading might give these VMs more of an advantage.

Friday, February 7, 2025

Pinephone Pro (with Tow-Boot) - Installing a new OS on the eMMC

In my previous Pinephone Pro, I was describing how I was coming up to speed on the different storage mechanisms on the Pinephone Pro: SPI vs eMMC vs microSD.  

Contextually, we are talking about block storage, and there is a well known command that one can run to see block storage on a Linux device: lsblk.  Running this command on your Pinephone Pro - in a Terminal - can help you understand "what is what and where". And it's important to understand this.

One concern I had, was installing a new OS to the eMMC, and blowing away the boot process. I had installed Tow-Boot on the phone, but had to make sure it was in its own spot (it was, SPI) - away from where a new OS was going to go - especially if you plan to clean or format the eMMC before installing a new OS. So my previous post discusses how I had to figure all of this out, and learned that Tow-Boot was installed on the SPI - making it safe to install a new OS.

Here was my process for installing this new OS, with comments:

  1. Download the image
    • Figure out what display manager you want. 
      • Phosh? Plasma? Gnome? Xfce? There is not shortage of choices.
      • I went with Plasma - because it worked well when I ran the OS on the microSD
    •  I went with 20250206
      • Check the md5 hash - which is always wise to verify the integrity of the image.
      • Unpack/Uncompress the "xz" file.
        • NOTE: In Mac Pro, xz did not work, and neither did tar xvfz. Using the GUI and clicking the file in File Manager invoked the Archive Utility to unpack the file. But in Linux, xz or tar should work fine.
  2. Make sure your Pinephone Pro is sufficiently charged. 
    • At least 50%, preferably 75%, and optimally fully charged at 90% or greater. 
    • I should note that with Manjaro, the power charge percentage was not always accurate.
  3. Power off your Pinephone Pro.
  4. Connect the Pinephone Pro using USB-C connector, into a USB-C connector of your laptop.
  5. Power the phone up, and after first vibration, hit the Volume-Up button
    • You are looking for a blue light on your Pinephone Pro, signifying that you are in USB Mode.
  6. Make sure the laptop/computer sees the Pinephone Pro as a device.
    • In my case, on a Mac Pro, I used File Manager.
    • Examine the /dev devices and this is IMPORTANT! Because if you install your OS onto your laptop partition, you have a big big problem.
      • I quickly noticed that /dev/disk4 had the BOOT_MJRO volume name on it, ensuring that disk4 was the disk I wanted to install the new OS to.
  7. Unmount the disk
    • because you cannot format or do an image copy on a disk that's already mounted. 
    • on a Mac Pro, diskutil was used for this: sudo diskutil unmountDisk /dev/disk4 
  8. Clean the partition
    • sudo dd if=/dev/zero of=/dev/disk4 bs=1M count=100
  9. Copy the image to the eMMC 
    • Tools like Balena Etcher can be used for this.
    • The "dd" tool is a Linux tried-true way to do this and this is what I chose:
      • sudo dd if=20250206-0046-postmarketOS-v24.12-plasma-mobile-5-pine64-pinephonepro.img of=/dev/dis
        k4 bs=1M status=progress
  10. Watch the progress, and once finished, eject the phone
    • sudo diskutil eject /dev/disk4
  11. Power the Pinephone Pro down
  12. Unplug the USB-C connector that is connected between Pinephone Pro and the laptop/computer.
  13. Power the Pinephone Pro back up.
    • You will see a terminal show up on the screen - don't mess with it - just wait.
    • Eventually the Plasma Display Manager will (or should) light up.
    • The OS will take some minutes to initialize, and to be responsive to user input.
  14. Log into the phone

Wednesday, January 29, 2025

Pinephone Pro - Booting an OS off SPI vs eMMC

I finally got a chance to pick the Pinephone Pro back up and play with it some more.

I was able to charge up the battery, and boot the phone and verify that Tow-Boot was installed on it properly. That was my first step. I believe I verified this by holding the volume down button, and waiting for the light to turn aqua (note, it may have been volume up, I should check this for correctness).

Next, I rebooted the phone,  and it booted into the Manjaro OS which is installed on the eMMC drive of the phone.

Next, I put the PostMarketOS into the microSD card slot, and booted the phone. Apparently Tow-Boot uses the following boot order:

  1. SPI - more on this in a bit, I had to learn what this is
  2. microSD Card
  3. eMMC (which has Manjaro on it)

I didn't get a Boot Menu - but maybe a key sequence (volume up?) would give me such a menu. It booted straight into the PostMarket OS. 

I proceeded to experiment with PostMarket OS, and did a complete update of all of the packages on it.

Next, I wondered how I could "replace" the default Manjaro with the PostMarket OS, which was newer than Manjaro, such that it would boot PostMarket OS on the eMMC, allowing me recycle the microSD card for perhaps another OS distribution I could take a look at later. 

It turns out, that there is a PostMarketOS "on-disk installer".  It is called pmbootstrap.

THIS is where I had to learn about SPI. Because there is a warning about over-writing your Tow-Boot installation, if Tow-Boot was not installed on SPI. 

so...what is SPI? (more search required)

SPI Flash is a type of non-volatile memory that uses the Serial Peripheral Interface (SPI) protocol for communication. It is commonly used in embedded systems for data storage and transfer, allowing devices to retain information even when powered off. 

Apparently it is a newer (or improved, perhaps) concept, found on phones with System-On-A-Chip (SOC) architectures. 

so...how do you know if you even have SPI?

Answer: I had to figure out which version of Pinephone Pro I have. 

I finally learned that there is a Developer Edition of the Pinephone Pro, and there is a Explorer Edition. The Explorer Edition supposedly has the SPI. 

But what confused me, is that it said the phone supporting SPI had the Rockchip RK3399S SoC. And when I went into the terminal on the phone and ran "lscpu", it said I had an ARM Cortex A-53 chip. 

so...now I am thoroughly confused.

Well, I finally learned, that the Rockchip RK3399S SoC combines four Cortex-A53 cores with two Cortex-A72 cores.

hmmm, I did not see the 72 in the lscpu command I ran - but, it does look like I have the SPI.

but, how do I know that Tow-Boot was installed on the SPI, versus the eMMC? Because if I have this wrong, I can't boot an OS as there would be no bootloader partition.

I think the SPI is mmcblk1 device. And /boot is on mmcblk1p1 partition of that device.

The Manjaro (previous installation) is definitely on the eMMC, which is on mmcblk2 device, which has two partitions on it, one of them being /root.

Sunday, January 19, 2025

NUMA PreferHT VM setting on a Hyperthread-Enabled ESXi Hypervisor

This could be a long post, because things like NUMA can get complicated.

For background, we are running servers - hypervisors - that have 24 cores. There are two chips - wafers as I like to refer to them - each with 12 cores, giving a total of 24 physical cores.

When you enable hyperthreading, you get 48 cores, and this is what is presented to the operating system and cpu scheduler (somewhat - more on this later).  But - you don't get an effective doubling of cores when you enable hyperthreading. What is really happening, is that the 24 cores are "cut in half" so that another 24 cores can be "fit in", giving you 48 logical cores.  

Worth mentioning also, is that each (now half) core, has a "sibling" - and this also matters from a scheduling perspective when you see things like cpu pinning used - because if you pin something to a specific core, then that "sibling" cannot be used for something else.  For example, if you enabled hyperthreading, the cores would look like:

0 | 1

2 | 3

4 | 5

... and so on. So if someone pinned to core 4, core 5 is also "off the table" now from a scheduling perspective because pinning is a physical core concept, not a logical core concept.

So with this background, we had a tenant who wanted to enable a "preferHT" setting. This setting can be applied to an entire hypervisor by setting numa.PreferHT=1, affecting all VMs deployed on it.

Or, one can selectively add this setting to a particular or specific virtual machine by going into the Advanced Settings and configuring numa.vcpu.preferHT=TRUE.  

In our case, it was the VM setting being requested - not the hypervisor setting.  Now, this tenant is the "anchor tenant" on the platform, and their workloads are very latency sensitive. So it was important to jump through this hoop when it was requested. First, we tested the setting by powering a VM off and adding the setting, then powering the VM back on. No problems with this. We then migrated the VM to another hypervisor, and had no issues with that either. Aside of that, though, how do you know that the VM setting "took" - meaning that it was picked up and recognized?

It turns out, that there are a couple of ways to do this:

1. esxtop

When you load esxtop, it is going to show you cpu by default. But if you hit the "m" key, it goes into a "memory view". If you go into memory view by hitting "m" and then hit the "f" key, a list of fields will show up. One of them, is NUMA Statistics. So by selecting this, you get a ton of interesting information about NUMA. The settings you are most interested in, are going to be:

NHN - Current home node for the virtual machine or resource pool - in our case, this was 0 or 1 (we had two numa nodes, as there is usually one per physical cpu socket).

NMIG - Number of NUMA migrations between two snapshot samples

NRMEM - (NUMA Remote Memory): Amount of remote memory allocated to the virtual machine, in MB

NLMEM (NUMA Local Memory) - Amount of local memory allocated to the virtual machine, in MB

L%D - this shows the amount of memory that is Localized. You want this number to be 100% but seeing the number in the 90s is probably okay also because it is showing that the memory access is not traversing a NUMA bus, which adds latency.

GST_NDx (Guest Node x): Guest memory being allocated for the VM on NUMA node x, where x is the node number

MEMSZ (Memory Size): Total amount of physical memory allocated to a virtual machine

2. vmdumper command

I found this command on a blog post - which I will list in my sources at the end of this blog post. This useful command, can show you a lot of interesting information about how NUMA is working "under the hood" (in practice). It can show you a Logical Processor to NUMA Node Map, it can show you how many home nodes are utilized for a given VM, and show you the assignment of NUMA clients to the respective NUMA nodes.

One of the examples covered in this blog post refers to the situation where a VM has 12 vCPUs on a 10 core system, and then goes down and shows what it would look like if the VM had 10 vCPU instead.


Sources:

http://www.staroceans.org/ESXi_VMkernel_NUMA_Constructs.htm

https://frankdenneman.nl/2010/02/03/sizing-vms-and-numa-nodes/

https://frankdenneman.nl/2010/10/07/numa-hyperthreading-and-numa-preferht/

https://docs.pexip.com/server_design/vmware_numa_affinity.htm

https://docs.pexip.com/server_design/numa_best_practices.htm#hyperthreading

https://knowledge.broadcom.com/external/article?legacyId=2003582


 

Wednesday, January 8, 2025

MySQL Max Allowed Packet

I recently conducted an upgrade, and for the life of me I couldn't figure out why the application wouldn't initialize.

I checked MySQL - it seemed to be running fine. I logged into the database, checked the Percona cluster status, it looked fine.

I checked RabbitMQ, and it also seemed to be running fine.

In checking the application logs, I saw an exception about a query and the packet size being too big, and I thought this was strange - mainly because of the huge size of the packet.

Sure enough, after calling support, I was informed that I needed to change the MySQL configuration in my.cnf and add a directive in the [mysqld] section.

max_allowed_packet=128M

In terms of what this value should 'really' be, I was told that this is a normal setting on most installations.

Who knew? It's unusual to be adding new parameters on the fly like this to a clustered database. 

But, sure enough, after restarting the database (well, the whole VM actually because I had done updates), it came up just fine.

Monday, November 18, 2024

Cisco UCS M5 Server Monitoring with Zabbix

I got a request from my manager recently, about using Zabbix to monitor Cisco servers.  

Specifically, someone had asked about whether it was possible to monitor the CRC errors on an adaptor.

Right now, the monitoring we are doing is coming from the operating systems and not at the hardware level. But we do use Zabbix to montor vCenter resources (hypervisors), using VMware templates, and we use Zabbix to "target monitor" certain virtual machines at the Linux OS level (Linux template) and at Layer 7 (app-specific templates).

Up to this point, our Zabbix monitoring has been, essentially, "load and forget" where we load the template, point Zabbix to a media webhook (i.e. Slack) and just monitor what comes in. We haven't really even done much extension of the templates, using everything "out of the box". Recently, we did add some new triggers on VMware monitoring, for CPU and Memory usage thresholds. We were considering adding some for CPU Ready as well.

But...this ask was to monitor Cisco servers, with our Zabbix monitoring system.

The first thing I did, was to check and see what templates for Cisco came "out of the box". I found two:

  1. Cisco UCS by SNMP
  2. Cisco UCS Manager by SNMP

I - incorrectly - assumed that #2, the Cisco UCS Manager by SNMP, was a template to interface with a Cisco UCS Manager. I learned a bit later, that it is actually a template to let Zabbix "be" or "emulate" a Cisco UCS Manager (as an alternative or replacement). 

First, I loaded the Cisco UCS by SNMP template. The template worked fine from what I could tell, but it didn't have any "network" related items (i.e. network adaptors).

As mentioned, after reading that Cisco UCS Manager was an extension or superset of Cisco UCS by SNMP, I went ahead and loaded that template on some selected hosts. We were pleased to start getting data flowing in from those hosts, and this time the items included in the template were adaptor metrics, but very basic metrics such as these shown below.

Adaptor/Ethernet metrics in Cisco UCS Manager Template

This was great. But we needed some esoteric statistics, such as crc errors on an adaptor. How do we find these? Are they available?

Well, it turns out that they indeed are available...in a MIB called:CISCO-UNIFIED-COMPUTING-ADAPTOR-MIB

Unfortunately, this MIB is not included in the CISCO-UCS-Manager template. So what to do now? Well, there are a couple of strategies...

  1. Add a new Discovery Rule to the (cloned) Cisco UCS Manager template. 
  2.  Create a new template for the adaptor mib using a tool called mib2zabbix.
I tried to do #1 first, but had issues because the discover rule needed an LLD Macro and I wasn't sure how, syntactically, to create the Discovery Rule properly. My attempts at doing so, failed to produce any results when I tested the rule.
 
I went to pursue #2, which led me down an interesting road. First, the mib2zabbix tool requires the net-snmp package to be installed. And on CentOS, this package alone will not work - you also have to install net-snmp-util package to get the utilities like snmptranslate that you need.

The first time I ran mib2zabbix, it produced a template that I "knew" was not correct. I didn't see any of the crc objects in the template at all.  I did some additional research, and found that for mib2zabbix to work correctly, there has to be a correct "mib search path". 

To create the search path, you create a ".snmp" folder in your home directory, and in that directory, you create an snmp.conf file. This file looked as follows for me to be able to run snmptranslate and mib2zabbix "properly".
 
mibdirs +/usr/share/snmp/mibs/cisco/v2
mibdirs +/usr/share/snmp/mibs/cisco/ucs-C-Series-mibs
mibdirs +/usr/share/snmp/mibs/cisco/ucs-mibs


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...