Monday, November 18, 2024

Cisco UCS M5 Server Monitoring with Zabbix

I got a request from my manager recently, about using Zabbix to monitor Cisco servers.  

Specifically, someone had asked about whether it was possible to monitor the CRC errors on an adaptor.

Right now, the monitoring we are doing is coming from the operating systems and not at the hardware level. But we do use Zabbix to montor vCenter resources (hypervisors), using VMware templates, and we use Zabbix to "target monitor" certain virtual machines at the Linux OS level (Linux template) and at Layer 7 (app-specific templates).

Up to this point, our Zabbix monitoring has been, essentially, "load and forget" where we load the template, point Zabbix to a media webhook (i.e. Slack) and just monitor what comes in. We haven't really even done much extension of the templates, using everything "out of the box". Recently, we did add some new triggers on VMware monitoring, for CPU and Memory usage thresholds. We were considering adding some for CPU Ready as well.

But...this ask was to monitor Cisco servers, with our Zabbix monitoring system.

The first thing I did, was to check and see what templates for Cisco came "out of the box". I found two:

  1. Cisco UCS by SNMP
  2. Cisco UCS Manager by SNMP

I - incorrectly - assumed that #2, the Cisco UCS Manager by SNMP, was a template to interface with a Cisco UCS Manager. I learned a bit later, that it is actually a template to let Zabbix "be" or "emulate" a Cisco UCS Manager (as an alternative or replacement). 

First, I loaded the Cisco UCS by SNMP template. The template worked fine from what I could tell, but it didn't have any "network" related items (i.e. network adaptors).

As mentioned, after reading that Cisco UCS Manager was an extension or superset of Cisco UCS by SNMP, I went ahead and loaded that template on some selected hosts. We were pleased to start getting data flowing in from those hosts, and this time the items included in the template were adaptor metrics, but very basic metrics such as these shown below.

Adaptor/Ethernet metrics in Cisco UCS Manager Template

This was great. But we needed some esoteric statistics, such as crc errors on an adaptor. How do we find these? Are they available?

Well, it turns out that they indeed are available...in a MIB called:CISCO-UNIFIED-COMPUTING-ADAPTOR-MIB

Unfortunately, this MIB is not included in the CISCO-UCS-Manager template. So what to do now? Well, there are a couple of strategies...

  1. Add a new Discovery Rule to the (cloned) Cisco UCS Manager template. 
  2.  Create a new template for the adaptor mib using a tool called mib2zabbix.
I tried to do #1 first, but had issues because the discover rule needed an LLD Macro and I wasn't sure how, syntactically, to create the Discovery Rule properly. My attempts at doing so, failed to produce any results when I tested the rule.
 
I went to pursue #2, which led me down an interesting road. First, the mib2zabbix tool requires the net-snmp package to be installed. And on CentOS, this package alone will not work - you also have to install net-snmp-util package to get the utilities like snmptranslate that you need.

The first time I ran mib2zabbix, it produced a template that I "knew" was not correct. I didn't see any of the crc objects in the template at all.  I did some additional research, and found that for mib2zabbix to work correctly, there has to be a correct "mib search path". 

To create the search path, you create a ".snmp" folder in your home directory, and in that directory, you create an snmp.conf file. This file looked as follows for me to be able to run snmptranslate and mib2zabbix "properly".
 
mibdirs +/usr/share/snmp/mibs/cisco/v2
mibdirs +/usr/share/snmp/mibs/cisco/ucs-C-Series-mibs
mibdirs +/usr/share/snmp/mibs/cisco/ucs-mibs


Thursday, November 7, 2024

Zabbix to BigPanda Webhook Integration

Background
BigPanda has made its way into the organization. I wasn't sure at first why, given that there's no shortage of Network Monitoring OSS / EMS systems in play. 

Many vendors use their own EMS. VMware for example, uses VROPS (vRealize Operations Suite - now known as Aria Operations). So there is and has been a use case for consolidating this information from these disparate monitoring systems into a "Northbound" system. 

So that's what BigPanda is, I guess. It was pitched as a Northbound system. It does not seem to be very mature, and it is simpler to use than most of them (based on limited inspection and reading). But the business case pitch is that it has an Artificial Intelligence rules engine that provides superior correlation, and if this is true, it could certainly make it a northbound system worthy of consideration.

So - that is why we stepped in to integrate Zabbix with BigPanda. We already have VROPS as our "authoritative" monitoring system for all things VMWare. Our team, which does use this VROPS, does not own and manage that platform (another group does). I believe they use it to monitor the vCenters, the hypervisors, and datastores.  I don't think they're using it to monitor tenant workloads (virtual machines running on the hypervisors).

Our Zabbix platform, which we manage ourselves, is a "second layer of monitoring" behind VROPS. It manages only the VMWare Hypervisors along with some targeted specific virtual machines we run (load balancers, cloud management platform VMs, et al).  The BigPanda team wanted to showcase the ability to correlate information from Zabbix and VROPS, so we volunteered to integrate the two systems. 

Note: It is critical when integrating, that these integration steps be done in precisely this order!!!

Integration Steps

Setting up the Media Type

First, you need to "create" a Media Type - and this means Importing one, not creating one. There are two buttons when you click Media Type, "Create" and "Import". Because the Media Type has already been crafted, we will use "Import". The BigPanda Media Type, which is classified as a Webhook media type, is available for download, and you can find this (json) file at the following link: https://docs.bigpanda.io/docs/zabbix

When you import this webhook media type,  you have the option to "Update Existing" or "Create New". The first time, of course, requires "Create New" but any subsequent updates to the webhook would utilize the "Update Existing" button. 

After the media type has been created, everything will auto-populate. The Media Type tab will have a name (BigPanda in this case), a Type (Webhook), and a set of parameters. Most of these can be left alone, but four of them will need to be changed to literal values of macros (literal values for initial testing is recommended): BP_app_key, BP_endpoint, BP_token - and the zabbix url (which is at the bottom and out of view in the screenshot example below).

Big Panda Media Type Screenshot Example


Setting Up the User Group

Next, you will create a User Group. The main reason for creating a (new) Big Panda user group, is that you can restrict the access of the Hosts that Big Panda has access to. If you wanted to allow Big Panda to have free roam access to all monitored hosts, then you probably could use one of the other host groups available. We wanted Big Panda to only receive alerts for specific hosts (hypervisors, test VMs, etc) so this was the justification for creating a new and separate Big Panda group. In the Host Permissions, we give this new user group Read access to these host groups.

Below is an example of what this group looks like.

Now, one thing worth looking at in this example, is the fact that the newly created User Group has Debug disabled. But there is a separate Debug Enabled group which does have Debug enabled, and any groups that we want to be debugged, can simply be slipped into this group.  There will be more on debugging later. Another thing worth mentioning, is that we did NOT enable FrontEnd access for this user group. This is an integration outbound, and we don't expect a Big Panda user / group to be logging into the UI.

Setting Up the User

Next, we create the User. Users need to have a Media Type, and are placed in User Groups which is why the Media Type and User Groups were created BEFORE the user.  Below, is an example of how the user is defined:

Notice that the user is mapped into the bigpandaservice User Group that we created in the previous step, which is why the User Group was pre-created in a previous step.

Now, after we establish the user fields, it is critically important to attach the User to the Media Type. Without this mapping, the alerts from Zabbix WILL NOT SEND!!!


After this Update button is hit, it is wise to verify and double-verify that this Media Type sticks - in our case, it did not and we had to remove the user and re-create it for some reason.

The final step in configuration is to create a Trigger Action on the Media Type. This is how that looks:


Next, you can click on Media Type, and select the "Test" button next to BigPanda. If you don't fill in the umpteen fields, and leave them as macros, with just the 4 fields we configured in the Media Type (endpoint, api key, token and zabbix url), the Test button "should" produce a 201 result, but you may get a json parse error because no actual data was sent. This is okay.

If the 201 exists, the Big Panda should receive the test alert. But this does not mean that the trigger is firing!!! The step to be taken after the Media Type "Test" button, is to generate an alert condition on the hosts that the Big Panda host group has access to, and make sure that Big Panda receives it!

Debugging & Troubleshooting

Troubleshooting requires making sure that all of these configuration steps were taken properly. This Webhook integration is all about mappings - users to user groups, users to media types, trigger definitions, host groups that are correct, etc.

When it comes to debugging, the debugging for a Webhook occurs within the Webhook!!!

The BigPanda Webhook, meaning the json file you imported, if you click on the Webhook you can see this json! In the screenshot below, notice the field called "script"...


If you were to click the "pencil" icon to the right, it will open up the entire webhook source code, which in this case is written in JavaScript.  

Now, you will notice that the BigPanda Webhook is sending messages to the Zabbix log at Level 4. The problem is, most people shouldn't be using Level 4 in their Zabbix logging (in zabbix_server.cfg file). It is too voluminous and makes debugging impossible if you are watching or tailing the log looking for webhook-specific messages. 

What I did, for testing and debugging, was to use a level that allows me to see the Webhook information without having to comb through a mountain of Zabbix debug information that you would normally see at Level 4 (Debug level). You will see in the screenshot below, that I commented out the "level 4" and replaced it with "level 2" - temporarily of course, until I could make sure that the Webhook was working properly. This example below, of course is just that: an example of how you can more simply debug the webhook. There are more lines in this code that I made these kinds of changes to, but the screenshot gives you an example of how it's done.

So hopefully that helps anyone wanting to get the BigPanda Webhook working in Zabbix, or for that matter, these steps should be helpful for any Webhook integration (i.e. Slack, Discord, et al).

Wednesday, September 18, 2024

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it was designated EOL and the repositories were archived, but I worked through that and it seemed everyone was using the system just fine.

Today, however, I had someone contact me to tell me that they provisioned a virtual machine, but it was stuck in an incomplete "Provisioning" state (a state that has a blue icon with a rocketship in it). The VM was provisioned on vCenter and working, but the state in Morpheus never set to "Finalized".

I couldn't figure this out, so I went to the Morpheus help site and I discovered that I myself had logged a ticket on this issue quite a while back. It turned out that the reason the state never flipped in that case, was because the clustering wasn't working properly.

So I checked RabbitMQ. It looked fine.

I checked MySQL and Percona, and I suspected that perhaps the clustering wasn't working properly. In the process of restarting the VMs, one of the virtual machines wouldn't start. I had to do a bunch of Percona advanced troubleshooting to figure out that I needed to do a wsrep recover commit before I could start the system and have it properly join the cluster. 

The NEXT problem was that Zabbix was screeching about these Morpheus VMs using too much disk space. It turned out that the /var file system was 100% full - because of ElasticSearch. Fortunately I had an oversized /home directory, and was able to do an rsync of the elasticsearch directory over to /home and re-link it.

But this gets to the topic of system administration with respect to disks.

First let's start with some KEY commands you MUST know:

>df -Th 

This command (disk free = df) shows how much space is used in human readable format, but with the mountpoint and file system type. This tells you NOTHING about the physical disks though!

>lsblk -f

This command (list block device) will give you the physical disk, the mountpoint, the uuid and any labels. It is a device specific command and doesn't show you space consumption.

>fdisk -l

I don't really like this command that much because of the output formatting. But it does list disk partitions and related statistics.

Some other commands you can use are:

>sudo file -sL /dev/sda3

the -s flag enables reading of block or character files and -L enables following of symlinks:

>blkid /dev/sda3

Similar command to lsblk -f above.

When a Percona Cluster Node Stops Working

Had a horrible problem where a Percona node (2 of 3) went down and wouldn't start.

I finally ran a command: 

> mysqld_safe --wsrep-recover --tc-heuristic-recover=ROLLBACK

This didn't work, so I had to run a journalctl -xe command to find out that the startup for Percona is actually in a temporary startup file: /var/lib/mysql/wsrep_recovery.xxxxx

From this, I could see pending transactions. Well, transactions either need to be committed, or rolled back.

The rollback didn't work, so, I tried the commit, which DID work.

> mysqld_safe --wsrep-recover --tc-heuristic-recover=COMMIT

Now, you can also edit your /etc/my.cnf file and put this option in that file in this format:

[mysqld]

tc-heuristic-recover = COMMIT

So after running the commit, which seemed to run fine, I went ahead and attempted to start the mysql service again: 

> systemctl start mysql

Fortunately, it came up!

Now - a quick way to check and make sure your percona node is working properly, is to log into mysql, and run the following query:

mysql> show status like 'wsrep%';

Below are the following variables that I tend to look for:
| wsrep_cluster_conf_id            | 56                                                   
| wsrep_cluster_size                  | 3                                                    
| wsrep_cluster_state_uuid        | f523290f-9336-11eb-be5b-d6f9514c9c3c                 
| wsrep_cluster_status               | Primary                                              
| wsrep_connected                     | ON                                                   
| wsrep_local_bf_aborts            | 0                                                    
| wsrep_local_index                  | 2                                                    
| wsrep_ready                            | ON                                                   

The cluster conf id should be the same on all of your cluster nodes!

Monday, September 16, 2024

Recovering a Corrupted RPM Database

I got this scary error when trying to run an upgrade on a cloud management system.

Here is what caused it:

1. The OS was CentOS 7.

2. The repositories for CentOS  7 were removed because CentOS 7 was End of Life (EOL). 

The repos were moved to an archive, and I have a post about how to update a Cent7 OS using archived repos in a previous post.

3. The upgrade was running Chef scripts that in turn were making yum update calls.

What effectively happened, was that the rpm database was getting corrupted:

We were getting the error DB_RUNRECOVERY: Fatal error, run database recovery.
 

Sounds frightening. The rpm database is where all of the package information is stored on a Linux operating system. Without this database intact, you cannot update or install anything, really. And there are numerous things that will invoke dnf, or yum, or some package manager which triggers it to check the integrity of this database.

As it turns out, a post I found saved the day. Apparently rebuilding the rpm database is simple.

From this link, to give credit where credit is due: rebuilding the rpm database

$ mv /var/lib/rpm/__db* /tmp/
$ rpm --rebuilddb
$ yum clean all

Tuesday, September 10, 2024

Updating CentOS 7 After EOL

I found a site that showed how you could update CentOS 7 after Red Hat shut down all of the repositories for it when it was classified End of Life.

I thought I would post on how to do this, lest I cannot locate that link or perhaps it gets taken down.

The link is at https://gcore.de/en/help/linux/centos7-new-repo-url-after-eol.php

Basically the process is as follows:

1. Backup the CentOS-* repositories.

2. Backup the existing epel.repo

2. Make a new CentOS.repo repository file, with the following:

[base]
name=CentOS-7.9.2009 - Base
baseurl=https://vault.centos.org/7.9.2009/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

#released updates
[updates]
name=CentOS-7.9.2009 - Updates
baseurl=https://vault.centos.org/7.9.2009/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

# additional packages that may be useful
[extras]
name=CentOS-7.9.2009 - Extras
baseurl=https://vault.centos.org/7.9.2009/extras/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

# additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-7.9.2009 - CentOSPlus
baseurl=https://vault.centos.org/7.9.2009/centosplus/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=0
metadata_expire=never

#fasttrack - packages by Centos Users
[fasttrack]
name=CentOS-7.9.2009 - Contrib
baseurl=https://vault.centos.org/7.9.2009/fasttrack/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=0
metadata_expire=never
NOTE: I had to change the repos from http to https. 

3. Make a new epel.repo repository file with the following:

[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/$basearch
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
metadata_expire=never

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/$basearch/debug
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
metadata_expire=never

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/SRPMS
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
metadata_expire=never
NOTE: These base urls are already https in his post, so no changes needed here.
 

Next, Remove all currently available metadata: yum clean all

Now enter yum check-update to load a new list of all available packages and to check if your local installation has all available updates. 

Afterwards you can install packages as usual using yum install.

NOTE: I just did a yum update instead of a yum install. Hope that was correct. It seemed to work fine.

 

Tuesday, August 27, 2024

Programming a Saab

I use the term "Programming" loosely here because I am not talking about Programming in the true sense of the word (writing code that is compiled and run on a chipset).

I am really referring to the use of software so that you can tune and make settings adjustments to the car's software components. 

The Saab has several control units, such as the Engine Control Unit (ECU) - sometimes also referred to as an Engine Control Module (ECM).  General Motors, who made the Saab 9-3 as a joint venture after taking over the auto division of Saab, uses a device called a Tech II  to pull codes, run diagnostics and adjust settings on the cars. These Tech IIs are handheld devices that interface with the OBD connector (which is under the dashboard in most car models). 

The OBD connectors, these are fairly standard, which allows you to drive the car into just about any auto store (Advanced Auto, O'Reilly, Autozone, et al) and they can plug an OBD reader in and get the codes, look them up and make recommendations (and or sell parts which is why they do this as a courtesy).

Since they don't make Saabs anymore, there is no US-based network of dealerships, and mechanics are disappearing fast - only a handful of Saab shops are left operating, and some of them are simply individuals who work on Saabs for various reasons (restoring them, extra cash, etc). So having an OBD reader is certainly helpful if you buy or own a Saab, because you will DEFINITELY need to learn to do some things on your own (most garages won't even a Saab enter their engine bays). 

Buying a Tech II device, which has the Saab software module (PCMCIA card), is almost necessary if you're hardcore into your Saab. But they're expensive. And hard to find, actually. When they pop up on places like eBay, they get snatched up pretty quick by enthusiasts, restorers, mechanics etc. Also, the Tech II devices interface with laptop software, and there are two kinds: TIS2000, and a newer version called TISWeb. This link discusses these laptop software packages:

https://www.uksaabs.co.uk/UKS/viewtopic.php?t=123074

But ... if you cannot get a Tech II device, there is another way to skin the cat!

You see, software is software. And you don't "need" a handheld device as a host for the software. Any laptop will do, if you have the software! Fortunately, someone (Saab?) released the software in open source. You can download and run it. Not the source code I don't think, but the compiled X86 program that will run on a Windows laptop with an installer that sets it up.  But - how do you interface it with the car? There is a cable you can buy, called OBDLink SX. One side is OBD, the other side of it is USB and plugs into the laptop (more on this later).

Now - all this said - you DO need to know what you're doing with this software. Or you can brick the car! But if you learn how to use this software, you can reset faults, run diagnostics, and you can even swap car components and re-flash them (i.e. the ECU). Many Saab parts, believe it or not, are tied to the VIN and you cannot just pull them off of one Saab and stick them on another without running this kind of software.

Lastly, the software. If you don't have a Tech II or can't afford one or can't find one, there is some software called the Trionic Can Flasher (trioniccanflasher). With this, you can flash a new ECU if the one on your Saab went bad - provided you can follow steps.

For example, the steps for cloning a Trionic 8 ecu are as follows:

1: start trioniccanflasher, select T8 and your interface (which corrresponds to the serial port on laptop)

2: read ecu content from the original ecu

3: select t8 mcp and read ecu again

4: switch to the new ecu

5: make sure legion bootloader and unlock sys partitions are checked

6: select t8 mcp and flash that

7: select t8 and flash that

Now - what if you are on a workbench, say at a Saab garage with ten cars that need ECUs, and you don't want to deal with the laptop and getting in and out of the car(s)? There is a different interface you can use where one connector plugs into the ECUs and the other end on the laptop (AEZ Flasher 2?). Honestly, I am not savvy about this yet and don't even know what interface this is (but will update this post once I do).

NOTE: GM makes a software called Tech2Win. I hear that this software does not work with the OBDLink SX cable - but cannot verify this at this time of writing. UPDATE: Indeed it did not work, but someone somehow went in and patched the software and apparently now it DOES work - but only with the MDI 1 (not MDI 2) clone cable adaptor.

https://www.saabcentral.com/threads/tech2win-for-saab-fixes-i-bus-missing-on-2003-9-3.731283/

Cisco UCS M5 Server Monitoring with Zabbix

I got a request from my manager recently, about using Zabbix to monitor Cisco servers.   Specifically, someone had asked about whether it wa...