Monday, November 18, 2024

Cisco UCS M5 Server Monitoring with Zabbix

I got a request from my manager recently, about using Zabbix to monitor Cisco servers.  

Specifically, someone had asked about whether it was possible to monitor the CRC errors on an adaptor.

Right now, the monitoring we are doing is coming from the operating systems and not at the hardware level. But we do use Zabbix to montor vCenter resources (hypervisors), using VMware templates, and we use Zabbix to "target monitor" certain virtual machines at the Linux OS level (Linux template) and at Layer 7 (app-specific templates).

Up to this point, our Zabbix monitoring has been, essentially, "load and forget" where we load the template, point Zabbix to a media webhook (i.e. Slack) and just monitor what comes in. We haven't really even done much extension of the templates, using everything "out of the box". Recently, we did add some new triggers on VMware monitoring, for CPU and Memory usage thresholds. We were considering adding some for CPU Ready as well.

But...this ask was to monitor Cisco servers, with our Zabbix monitoring system.

The first thing I did, was to check and see what templates for Cisco came "out of the box". I found two:

  1. Cisco UCS by SNMP
  2. Cisco UCS Manager by SNMP

I - incorrectly - assumed that #2, the Cisco UCS Manager by SNMP, was a template to interface with a Cisco UCS Manager. I learned a bit later, that it is actually a template to let Zabbix "be" or "emulate" a Cisco UCS Manager (as an alternative or replacement). 

First, I loaded the Cisco UCS by SNMP template. The template worked fine from what I could tell, but it didn't have any "network" related items (i.e. network adaptors).

As mentioned, after reading that Cisco UCS Manager was an extension or superset of Cisco UCS by SNMP, I went ahead and loaded that template on some selected hosts. We were pleased to start getting data flowing in from those hosts, and this time the items included in the template were adaptor metrics, but very basic metrics such as these shown below.

Adaptor/Ethernet metrics in Cisco UCS Manager Template

This was great. But we needed some esoteric statistics, such as crc errors on an adaptor. How do we find these? Are they available?

Well, it turns out that they indeed are available...in a MIB called:CISCO-UNIFIED-COMPUTING-ADAPTOR-MIB

Unfortunately, this MIB is not included in the CISCO-UCS-Manager template. So what to do now? Well, there are a couple of strategies...

  1. Add a new Discovery Rule to the (cloned) Cisco UCS Manager template. 
  2.  Create a new template for the adaptor mib using a tool called mib2zabbix.
I tried to do #1 first, but had issues because the discover rule needed an LLD Macro and I wasn't sure how, syntactically, to create the Discovery Rule properly. My attempts at doing so, failed to produce any results when I tested the rule.
 
I went to pursue #2, which led me down an interesting road. First, the mib2zabbix tool requires the net-snmp package to be installed. And on CentOS, this package alone will not work - you also have to install net-snmp-util package to get the utilities like snmptranslate that you need.

The first time I ran mib2zabbix, it produced a template that I "knew" was not correct. I didn't see any of the crc objects in the template at all.  I did some additional research, and found that for mib2zabbix to work correctly, there has to be a correct "mib search path". 

To create the search path, you create a ".snmp" folder in your home directory, and in that directory, you create an snmp.conf file. This file looked as follows for me to be able to run snmptranslate and mib2zabbix "properly".
 
mibdirs +/usr/share/snmp/mibs/cisco/v2
mibdirs +/usr/share/snmp/mibs/cisco/ucs-C-Series-mibs
mibdirs +/usr/share/snmp/mibs/cisco/ucs-mibs


Thursday, November 7, 2024

Zabbix to BigPanda Webhook Integration

Background
BigPanda has made its way into the organization. I wasn't sure at first why, given that there's no shortage of Network Monitoring OSS / EMS systems in play. 

Many vendors use their own EMS. VMware for example, uses VROPS (vRealize Operations Suite - now known as Aria Operations). So there is and has been a use case for consolidating this information from these disparate monitoring systems into a "Northbound" system. 

So that's what BigPanda is, I guess. It was pitched as a Northbound system. It does not seem to be very mature, and it is simpler to use than most of them (based on limited inspection and reading). But the business case pitch is that it has an Artificial Intelligence rules engine that provides superior correlation, and if this is true, it could certainly make it a northbound system worthy of consideration.

So - that is why we stepped in to integrate Zabbix with BigPanda. We already have VROPS as our "authoritative" monitoring system for all things VMWare. Our team, which does use this VROPS, does not own and manage that platform (another group does). I believe they use it to monitor the vCenters, the hypervisors, and datastores.  I don't think they're using it to monitor tenant workloads (virtual machines running on the hypervisors).

Our Zabbix platform, which we manage ourselves, is a "second layer of monitoring" behind VROPS. It manages only the VMWare Hypervisors along with some targeted specific virtual machines we run (load balancers, cloud management platform VMs, et al).  The BigPanda team wanted to showcase the ability to correlate information from Zabbix and VROPS, so we volunteered to integrate the two systems. 

Note: It is critical when integrating, that these integration steps be done in precisely this order!!!

Integration Steps

Setting up the Media Type

First, you need to "create" a Media Type - and this means Importing one, not creating one. There are two buttons when you click Media Type, "Create" and "Import". Because the Media Type has already been crafted, we will use "Import". The BigPanda Media Type, which is classified as a Webhook media type, is available for download, and you can find this (json) file at the following link: https://docs.bigpanda.io/docs/zabbix

When you import this webhook media type,  you have the option to "Update Existing" or "Create New". The first time, of course, requires "Create New" but any subsequent updates to the webhook would utilize the "Update Existing" button. 

After the media type has been created, everything will auto-populate. The Media Type tab will have a name (BigPanda in this case), a Type (Webhook), and a set of parameters. Most of these can be left alone, but four of them will need to be changed to literal values of macros (literal values for initial testing is recommended): BP_app_key, BP_endpoint, BP_token - and the zabbix url (which is at the bottom and out of view in the screenshot example below).

Big Panda Media Type Screenshot Example


Setting Up the User Group

Next, you will create a User Group. The main reason for creating a (new) Big Panda user group, is that you can restrict the access of the Hosts that Big Panda has access to. If you wanted to allow Big Panda to have free roam access to all monitored hosts, then you probably could use one of the other host groups available. We wanted Big Panda to only receive alerts for specific hosts (hypervisors, test VMs, etc) so this was the justification for creating a new and separate Big Panda group. In the Host Permissions, we give this new user group Read access to these host groups.

Below is an example of what this group looks like.

Now, one thing worth looking at in this example, is the fact that the newly created User Group has Debug disabled. But there is a separate Debug Enabled group which does have Debug enabled, and any groups that we want to be debugged, can simply be slipped into this group.  There will be more on debugging later. Another thing worth mentioning, is that we did NOT enable FrontEnd access for this user group. This is an integration outbound, and we don't expect a Big Panda user / group to be logging into the UI.

Setting Up the User

Next, we create the User. Users need to have a Media Type, and are placed in User Groups which is why the Media Type and User Groups were created BEFORE the user.  Below, is an example of how the user is defined:

Notice that the user is mapped into the bigpandaservice User Group that we created in the previous step, which is why the User Group was pre-created in a previous step.

Now, after we establish the user fields, it is critically important to attach the User to the Media Type. Without this mapping, the alerts from Zabbix WILL NOT SEND!!!


After this Update button is hit, it is wise to verify and double-verify that this Media Type sticks - in our case, it did not and we had to remove the user and re-create it for some reason.

The final step in configuration is to create a Trigger Action on the Media Type. This is how that looks:


Next, you can click on Media Type, and select the "Test" button next to BigPanda. If you don't fill in the umpteen fields, and leave them as macros, with just the 4 fields we configured in the Media Type (endpoint, api key, token and zabbix url), the Test button "should" produce a 201 result, but you may get a json parse error because no actual data was sent. This is okay.

If the 201 exists, the Big Panda should receive the test alert. But this does not mean that the trigger is firing!!! The step to be taken after the Media Type "Test" button, is to generate an alert condition on the hosts that the Big Panda host group has access to, and make sure that Big Panda receives it!

Debugging & Troubleshooting

Troubleshooting requires making sure that all of these configuration steps were taken properly. This Webhook integration is all about mappings - users to user groups, users to media types, trigger definitions, host groups that are correct, etc.

When it comes to debugging, the debugging for a Webhook occurs within the Webhook!!!

The BigPanda Webhook, meaning the json file you imported, if you click on the Webhook you can see this json! In the screenshot below, notice the field called "script"...


If you were to click the "pencil" icon to the right, it will open up the entire webhook source code, which in this case is written in JavaScript.  

Now, you will notice that the BigPanda Webhook is sending messages to the Zabbix log at Level 4. The problem is, most people shouldn't be using Level 4 in their Zabbix logging (in zabbix_server.cfg file). It is too voluminous and makes debugging impossible if you are watching or tailing the log looking for webhook-specific messages. 

What I did, for testing and debugging, was to use a level that allows me to see the Webhook information without having to comb through a mountain of Zabbix debug information that you would normally see at Level 4 (Debug level). You will see in the screenshot below, that I commented out the "level 4" and replaced it with "level 2" - temporarily of course, until I could make sure that the Webhook was working properly. This example below, of course is just that: an example of how you can more simply debug the webhook. There are more lines in this code that I made these kinds of changes to, but the screenshot gives you an example of how it's done.

So hopefully that helps anyone wanting to get the BigPanda Webhook working in Zabbix, or for that matter, these steps should be helpful for any Webhook integration (i.e. Slack, Discord, et al).

Wednesday, September 18, 2024

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it was designated EOL and the repositories were archived, but I worked through that and it seemed everyone was using the system just fine.

Today, however, I had someone contact me to tell me that they provisioned a virtual machine, but it was stuck in an incomplete "Provisioning" state (a state that has a blue icon with a rocketship in it). The VM was provisioned on vCenter and working, but the state in Morpheus never set to "Finalized".

I couldn't figure this out, so I went to the Morpheus help site and I discovered that I myself had logged a ticket on this issue quite a while back. It turned out that the reason the state never flipped in that case, was because the clustering wasn't working properly.

So I checked RabbitMQ. It looked fine.

I checked MySQL and Percona, and I suspected that perhaps the clustering wasn't working properly. In the process of restarting the VMs, one of the virtual machines wouldn't start. I had to do a bunch of Percona advanced troubleshooting to figure out that I needed to do a wsrep recover commit before I could start the system and have it properly join the cluster. 

The NEXT problem was that Zabbix was screeching about these Morpheus VMs using too much disk space. It turned out that the /var file system was 100% full - because of ElasticSearch. Fortunately I had an oversized /home directory, and was able to do an rsync of the elasticsearch directory over to /home and re-link it.

But this gets to the topic of system administration with respect to disks.

First let's start with some KEY commands you MUST know:

>df -Th 

This command (disk free = df) shows how much space is used in human readable format, but with the mountpoint and file system type. This tells you NOTHING about the physical disks though!

>lsblk -f

This command (list block device) will give you the physical disk, the mountpoint, the uuid and any labels. It is a device specific command and doesn't show you space consumption.

>fdisk -l

I don't really like this command that much because of the output formatting. But it does list disk partitions and related statistics.

Some other commands you can use are:

>sudo file -sL /dev/sda3

the -s flag enables reading of block or character files and -L enables following of symlinks:

>blkid /dev/sda3

Similar command to lsblk -f above.

When a Percona Cluster Node Stops Working

Had a horrible problem where a Percona node (2 of 3) went down and wouldn't start.

I finally ran a command: 

> mysqld_safe --wsrep-recover --tc-heuristic-recover=ROLLBACK

This didn't work, so I had to run a journalctl -xe command to find out that the startup for Percona is actually in a temporary startup file: /var/lib/mysql/wsrep_recovery.xxxxx

From this, I could see pending transactions. Well, transactions either need to be committed, or rolled back.

The rollback didn't work, so, I tried the commit, which DID work.

> mysqld_safe --wsrep-recover --tc-heuristic-recover=COMMIT

Now, you can also edit your /etc/my.cnf file and put this option in that file in this format:

[mysqld]

tc-heuristic-recover = COMMIT

So after running the commit, which seemed to run fine, I went ahead and attempted to start the mysql service again: 

> systemctl start mysql

Fortunately, it came up!

Now - a quick way to check and make sure your percona node is working properly, is to log into mysql, and run the following query:

mysql> show status like 'wsrep%';

Below are the following variables that I tend to look for:
| wsrep_cluster_conf_id            | 56                                                   
| wsrep_cluster_size                  | 3                                                    
| wsrep_cluster_state_uuid        | f523290f-9336-11eb-be5b-d6f9514c9c3c                 
| wsrep_cluster_status               | Primary                                              
| wsrep_connected                     | ON                                                   
| wsrep_local_bf_aborts            | 0                                                    
| wsrep_local_index                  | 2                                                    
| wsrep_ready                            | ON                                                   

The cluster conf id should be the same on all of your cluster nodes!

Monday, September 16, 2024

Recovering a Corrupted RPM Database

I got this scary error when trying to run an upgrade on a cloud management system.

Here is what caused it:

1. The OS was CentOS 7.

2. The repositories for CentOS  7 were removed because CentOS 7 was End of Life (EOL). 

The repos were moved to an archive, and I have a post about how to update a Cent7 OS using archived repos in a previous post.

3. The upgrade was running Chef scripts that in turn were making yum update calls.

What effectively happened, was that the rpm database was getting corrupted:

We were getting the error DB_RUNRECOVERY: Fatal error, run database recovery.
 

Sounds frightening. The rpm database is where all of the package information is stored on a Linux operating system. Without this database intact, you cannot update or install anything, really. And there are numerous things that will invoke dnf, or yum, or some package manager which triggers it to check the integrity of this database.

As it turns out, a post I found saved the day. Apparently rebuilding the rpm database is simple.

From this link, to give credit where credit is due: rebuilding the rpm database

$ mv /var/lib/rpm/__db* /tmp/
$ rpm --rebuilddb
$ yum clean all

Tuesday, September 10, 2024

Updating CentOS 7 After EOL

I found a site that showed how you could update CentOS 7 after Red Hat shut down all of the repositories for it when it was classified End of Life.

I thought I would post on how to do this, lest I cannot locate that link or perhaps it gets taken down.

The link is at https://gcore.de/en/help/linux/centos7-new-repo-url-after-eol.php

Basically the process is as follows:

1. Backup the CentOS-* repositories.

2. Backup the existing epel.repo

2. Make a new CentOS.repo repository file, with the following:

[base]
name=CentOS-7.9.2009 - Base
baseurl=https://vault.centos.org/7.9.2009/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

#released updates
[updates]
name=CentOS-7.9.2009 - Updates
baseurl=https://vault.centos.org/7.9.2009/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

# additional packages that may be useful
[extras]
name=CentOS-7.9.2009 - Extras
baseurl=https://vault.centos.org/7.9.2009/extras/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
metadata_expire=never

# additional packages that extend functionality of existing packages
[centosplus]
name=CentOS-7.9.2009 - CentOSPlus
baseurl=https://vault.centos.org/7.9.2009/centosplus/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=0
metadata_expire=never

#fasttrack - packages by Centos Users
[fasttrack]
name=CentOS-7.9.2009 - Contrib
baseurl=https://vault.centos.org/7.9.2009/fasttrack/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=0
metadata_expire=never
NOTE: I had to change the repos from http to https. 

3. Make a new epel.repo repository file with the following:

[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/$basearch
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
metadata_expire=never

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/$basearch/debug
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
metadata_expire=never

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=https://archives.fedoraproject.org/pub/archive/epel/7/SRPMS
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1
metadata_expire=never
NOTE: These base urls are already https in his post, so no changes needed here.
 

Next, Remove all currently available metadata: yum clean all

Now enter yum check-update to load a new list of all available packages and to check if your local installation has all available updates. 

Afterwards you can install packages as usual using yum install.

NOTE: I just did a yum update instead of a yum install. Hope that was correct. It seemed to work fine.

 

Tuesday, August 27, 2024

Programming a Saab

I use the term "Programming" loosely here because I am not talking about Programming in the true sense of the word (writing code that is compiled and run on a chipset).

I am really referring to the use of software so that you can tune and make settings adjustments to the car's software components. 

The Saab has several control units, such as the Engine Control Unit (ECU) - sometimes also referred to as an Engine Control Module (ECM).  General Motors, who made the Saab 9-3 as a joint venture after taking over the auto division of Saab, uses a device called a Tech II  to pull codes, run diagnostics and adjust settings on the cars. These Tech IIs are handheld devices that interface with the OBD connector (which is under the dashboard in most car models). 

The OBD connectors, these are fairly standard, which allows you to drive the car into just about any auto store (Advanced Auto, O'Reilly, Autozone, et al) and they can plug an OBD reader in and get the codes, look them up and make recommendations (and or sell parts which is why they do this as a courtesy).

Since they don't make Saabs anymore, there is no US-based network of dealerships, and mechanics are disappearing fast - only a handful of Saab shops are left operating, and some of them are simply individuals who work on Saabs for various reasons (restoring them, extra cash, etc). So having an OBD reader is certainly helpful if you buy or own a Saab, because you will DEFINITELY need to learn to do some things on your own (most garages won't even a Saab enter their engine bays). 

Buying a Tech II device, which has the Saab software module (PCMCIA card), is almost necessary if you're hardcore into your Saab. But they're expensive. And hard to find, actually. When they pop up on places like eBay, they get snatched up pretty quick by enthusiasts, restorers, mechanics etc. Also, the Tech II devices interface with laptop software, and there are two kinds: TIS2000, and a newer version called TISWeb. This link discusses these laptop software packages:

https://www.uksaabs.co.uk/UKS/viewtopic.php?t=123074

But ... if you cannot get a Tech II device, there is another way to skin the cat!

You see, software is software. And you don't "need" a handheld device as a host for the software. Any laptop will do, if you have the software! Fortunately, someone (Saab?) released the software in open source. You can download and run it. Not the source code I don't think, but the compiled X86 program that will run on a Windows laptop with an installer that sets it up.  But - how do you interface it with the car? There is a cable you can buy, called OBDLink SX. One side is OBD, the other side of it is USB and plugs into the laptop (more on this later).

Now - all this said - you DO need to know what you're doing with this software. Or you can brick the car! But if you learn how to use this software, you can reset faults, run diagnostics, and you can even swap car components and re-flash them (i.e. the ECU). Many Saab parts, believe it or not, are tied to the VIN and you cannot just pull them off of one Saab and stick them on another without running this kind of software.

Lastly, the software. If you don't have a Tech II or can't afford one or can't find one, there is some software called the Trionic Can Flasher (trioniccanflasher). With this, you can flash a new ECU if the one on your Saab went bad - provided you can follow steps.

For example, the steps for cloning a Trionic 8 ecu are as follows:

1: start trioniccanflasher, select T8 and your interface (which corrresponds to the serial port on laptop)

2: read ecu content from the original ecu

3: select t8 mcp and read ecu again

4: switch to the new ecu

5: make sure legion bootloader and unlock sys partitions are checked

6: select t8 mcp and flash that

7: select t8 and flash that

Now - what if you are on a workbench, say at a Saab garage with ten cars that need ECUs, and you don't want to deal with the laptop and getting in and out of the car(s)? There is a different interface you can use where one connector plugs into the ECUs and the other end on the laptop (AEZ Flasher 2?). Honestly, I am not savvy about this yet and don't even know what interface this is (but will update this post once I do).

NOTE: GM makes a software called Tech2Win. I hear that this software does not work with the OBDLink SX cable - but cannot verify this at this time of writing. UPDATE: Indeed it did not work, but someone somehow went in and patched the software and apparently now it DOES work - but only with the MDI 1 (not MDI 2) clone cable adaptor.

https://www.saabcentral.com/threads/tech2win-for-saab-fixes-i-bus-missing-on-2003-9-3.731283/

Friday, August 16, 2024

Pinephone Pro - Unboxing and Use Part II

I picked up the Pinephone Pro, which I had attached to a standard USB-C charger. It indeed was sitting at 100%. So it looks like the charging works okay.

The OS asked me for a pin code to unlock the screen. Yikes. I wasn't prompted to set up a pin code! 

I rebooted the phone to see if I could figure out what OS was on it from the boot messages. I figured out that the phone was running the Pinephone Manjaro OS. 

https://github.com/manjaro-pinephone/phosh/releases

Since the Manjaro OS has a default pincode, I attempted that pin code and got lucky - it wasn't changed, and it worked.  I (re) connected to WiFi, and noticed that the OS is prompting for my WiFi Password every single time and doesn't seem to remember it from before. Secure? Yes Annoying? Yes.

The form factor issue I ran into using the Firefox browser seemed to be more related to Firefox than the OS. The issue with Firefox is that the browser is sized past the phone form factor, and you need to scroll left and right which is a major hassle. The browser doesn't auto-size itself for the screen dimensions.

I played with the Terminal app, and noticed that the user when I launched the Terminal app was pico-xxxx (I don't remember what the suffix is). I tried to sudo to root, but didn't know what the password was for this user. 

Lastly, I played a video from YouTube, and the sound was very tinny. So the speaker on this phone is not high-end. I have not yet attempted to use a headphone on this device yet. 

Since the Linux-Mobile apps are so limited, many apps you typically run from a dedicated icon app/client on a mobile phone will need to be run from a browser.

I am not sure Manjaro is the "right" OS to use on this phone, or if the version of the OS running is current or stale. I ordered the Docking Hub and a Micro SD Card and when those arrive, maybe I will try flashing a new/different OS on this phone.

Friday, August 9, 2024

Pinephone Pro - Unboxing and First Use

I ordered a Linux Pinephone that just arrived.

In the United States, trying to get off of Google, Apple, and even Samsung is nigh onto impossible. Carriers make a ton of money off of selling and promoting phones, and have locked Linux phones out of their stores and off of their networks because they can't all collude and make money, either by selling the devices (carriers) or siphoning your data on their operating systems or defaulting the browser, etc.

There are probably numerous videos that show the unboxing of a Pinephone, so I will skip that and just make some general comments on my first experience.

When I unboxed the phone, there was no charger included. I bought this phone used on eBay, and while it came in the box, I wasn't sure if they come standard with a charger or not. The phone uses USB-C as a charger, though, and I had plenty of these. The phone had some weight to it. The screen seemed quality, but the back cover looked like a cheap piece of plastic and I could feel something pushing against the back cover (battery? dip or kill switches?). As I don't yet have a SIM for it, I have not yet opened the back.

The phone did not boot up at first. I wasn't sure of the button sequences, so I downloaded the Pinephone User Guide to get going. I decided that the phone probably needed to be charged, and plugged it into my USB-C charger, and immediately, I got a Linux boot sequence on the screen. Linux boot sequences are intimidating to just about anyone and most certainly to a user that is unfamiliar with Linux and not Linux-savvy.

When the boot sequence finished, the phone shut itself down again - presumably because it didn't have enough juice to boot and stay running. I left the phone on the charger, and returned to it 3-4 hours later.

When I came in and picked the phone up and powered it on, I got the boot sequence again and it booted up to the operating system. The OS was reasonably intuitive. I don't have a SIM in the phone yet, so I configured it for WiFi as a first step. Then I tried to set the clock, and I added my city but it is using UTC as the default. Next I went looking to see what apps were installed. It took me a few minutes to realize that the "Discover" app is the app for finding, updating and installing applications.  The first time I tried to run Discover, it crashed. When I re-launched it, it showed me some apps and I tried to update a couple of them, and got a repository error. I finally was able to update Firefox, though. Then I launched Firefox. 

Right away with Firefox, I had issues with screen real-estate and positioning. The browser didn't fit on the screen, and I didn't see a way to shrink it down to fit the screen properly. After closing the 2nd tab I had opened, I was able to use my finger to "grab" the browser, and pull it around, but clearly the browser window fit and lack of a gyroscope to re-orient the browser when the phone is turned sideways are going to make this browser a bit of a hassle - unless I can solve this.

I want to test out the sound quality. That's next.


Wednesday, June 26, 2024

Rocky Generic Cloud Image - Image Prep, Cloud-Init and VMware Tools

 

The process I have been using up to now, has been to download the generic cloud images from various Linux Distro sites (CentOS, now Rocky). These images are pre-baked for clouds, meaning that they're smaller, more efficient, and they generally have cloud packages installed on them (i.e. cloud-init).

It is easier (and more efficient) to use one of these images, in my thinking, than to try and take an ISO and build an image "from scratch".

The problem, though, is that "cloud images" are generally public cloud images: AWS, Azure, GKE, et al.  If you are running your own private cloud on VMware, you will run into problems using these cloud images.

Today, I am having issues with the Rocky 9.5 generic cloud image.

I am downloading the qcow2, using qemu-img convert to convert qcow2 to vmdk, then running ovftool using a templatized template.vmx file. Everything works fine, but when I load the image into our CMP which initializes with cloud-init, the VM is booting up fine, but no cloud-init is running, so you cannot log into the VM.

Here is the template.vmx.parameterized file I am using. I use sed to replace the parameters, then the file is renamed template.vmx before running ovftool on it.

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "11"
vmci0.present = "TRUE"
floppy0.present = "FALSE"
svga.vramSize = "16777216"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
scsi0.virtualDev = "lsilogic"
scsi0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "PARM_VMDK"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
ide0:0.present ="true"
ide0:0.startConnected = "TRUE"
ide0:0.fileName = "/opt/images/nfvcloud/imagegen/rocky9/cloudinit.iso"
ide0:0.deviceType = "cdrom-image"
displayName = "PARM_DISPLAYNAME"
guestOS = "PARM_GUESTOS"
vcpu.hotadd = "TRUE"
mem.hotadd = "TRUE"
bios.hddOrder = "scsi0:0"
bios.bootOrder = "cdrom,hdd"
sched.cpu.latencySensitivity = "normal"
svga.present = "TRUE"
RemoteDisplay.vnc.enabled = "FALSE"
RemoteDisplay.vnc.keymap = "us"
monitor.phys_bits_used = "42"
softPowerOff = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.shares = "normal"
sched.mem.minsize = "1024"
memsize = "PARM_MEMSIZE"
migrate.encryptionMode = "opportunistic"

I have tried using cdrom,hdd and just hdd on the boot order. Neither makes a difference.

When I run the ovftool program, it generates the following files, which look correct.

Rocky-9-5-GenericCloud-LVM-disk1.vmdk
Rocky-9-5-GenericCloud-LVM-file1.iso
Rocky-9-5-GenericCloud-LVM.mf
Rocky-9-5-GenericCloud-LVM.ovf

The ovf file, I have inspected. It does have references to both the vmdk and iso file in it, as it should.

The iso file, I ran a utility on it and it seems to look okay also. The two directories user_data and meta_data seem to be on there as they should be.

$ isoinfo  -i Rocky-9-5-GenericCloud-LVM-file1.iso -l

Directory listing of /
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  META_DAT
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  USER_DAT

Directory listing of /META_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

Directory listing of /USER_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

This Rocky generic cloud image, it does NOT have VMware Tools (open-vm-tools package) installed on it - I checked into that. But you shouldn't need VMware Tools for cloud-init to initialize properly.

I am perplexed as to why cloud-init won't load properly, and I am about to drop kick this image and consider alternative ways of generating an image for this platform. I don't understand why these images work fine on public clouds, but not VMware. 

I may need to abandon this generic cloud image altogether and use another process. I am going to examine this Packer process. 

https://docs.rockylinux.org/guides/automation/templates-automation-packer-vsphere/

 

Thursday, June 20, 2024

New AI Book Arrived - Machine Learning for Algorithmic Trading

This thing is like 900 pages long.

You want to take a deep breath and make sure you're committed before you even open it.

I did check the Table of Contents and scrolled quickly through, and I see it's definitely a hands-on applied technology book using the Python programming language.

I will be blogging more about it when I get going.

 




Tuesday, June 4, 2024

What Makes an AI Chip?

I haven't been able to understand why the original chip pioneers, like Intel and AMD, have not been able to pivot in order to compete with NVidia (Stock Symbol: NVDA).

I know a few things, like the fact that when gaming became popular, NVidia made the graphics chips that had graphics acceleration and such. Graphics tend to draw polygons, and drawing polygons is geometric and trigonometric - which require floating point arithmetic (non-integer based mathematics). Floating point is difficult for a CPU to do, so much so that classical CPUs either offloaded or employed other tricks to do these kinds of computations.

Now, these graphics chips are the "rave" for AI. And Nvidia stock has gone through the roof while Intel and AMD have been left behind.

So what does an AI chip have, that is different from an older CPU?

  • Graphics processing units (GPUs) - used mainly for training AI models
  • Field-programmable gate arrays (FPGAs) - used mainly for inference
  • Application-specific integrated circuits (ASICs) - used in various capacities of AI

CPUs use all three of these in some form or another, but an AI chip has all three of these in a highly optimized and accelerated design. Things like prediction (such as branching prediction), parallelism, etc. They're simply better at running "algorithms".

This link, by the way, from NVidia, discusses the distinction between Training and Inference:
https://blogs.nvidia.com/blog/difference-deep-learning-training-inference-ai/

CPUs, they were so bent on running Microsoft for so long, and emulating continuous revisions of instructions to run Windows (286-->386-->486-->Pentium--> and on and on), that they just never went back and "rearchitected" or came up with new chip architectures. They sat back and collected money, along with Microsoft, to give you incremental versions of the same thing - for YEARS.

When you are doing training for an AI model, and you are running algorithmic loops millions upon millions of times, the efficiency and time start to add up - and make a huge difference in $$$ (MONEY). 

So the CPU companies, in order to "catch up", I think, with NVidia, would need to come up with a whole bunch of chip design software. Then there is the software kits necessary to develop to the chips. You also have the foundry (which uses manufacturing equipment, much of it custom per the design), etc. Meanwhile, NVidia has its rocket off the ground, with decreasing G forces (so to speak), which accelerates its orbit. It is easy to see why an increasing gap would occur.

But - when you have everyone (China, Russia, Intel, AMD, ARM, et al) all racing to catch up, they will at some point, catch up. I think. When NVidia slows down. We shall see.

Tuesday, April 16, 2024

What is an Application Binary Interface (ABI)?

After someone mentioned Alma Linux to me, it seemed similar to Rocky Linux, and I wondered why there would be two Linux distros doing the same thing (picking up from CentOS and remaining RHEL compatible).

I read that "Rocky Linux is a 1-to-1 binary to RHEL while AlmaLinux is Application Binary Interface-compatible with RHEL".

Wow. Now, not only did I learn about a new Linux distro, but I also have to run down what an Application Binary Interface, or ABI is.

Referring to this, Stack Exchange post: https://stackoverflow.com/questions/2171177/what-is-an-application-binary-interface-abi, I liked this "oversimplified summary":

API: "Here are all the functions you may call."

ABI: "This is how to call a function."

Friday, March 1, 2024

I thought MacOS was based on Linux - and apparently I was wrong!

I came across this link, which discusses some things I found interesting to learn:

  • Linux is a Monolithic Kernel - I thought because you could load and unload kernel modules, that the Linux kernel had morphed into more of a Microkernel architecture because of this. But apparently not?
  •  The macOS kernel is officially known as XNU, which stands for “XNU is Not Unix.” 
 According to Apple's GitHub page:

 "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and C++ API for writing drivers”.

  Very interesting. I stand corrected now on MacOS being based on Linux.

Neural Network Architecture - Sizing and Dimensioning the Network

In my last blog post, I posed the question of how many hidden layers should be in a neural network, and how many hidden neurons should be in each hidden layer. This is related to the Neural Network Design, or Neural Network Architecture.

Well, I found the answer, I think, in the book entitled An Introduction to Neural Networks for Java authored by Jeff Heaton. I noticed, incidentally, that Jeff was doing AI and writing about it as early as 2008 - fifteen years ago prior to the current AI firestorm we see today - and possibly before that, using languages like Java, C# (C Sharp), and Encog (which I am unfamiliar with).

In this book, in Table 5.1 (Chapter 5), Jeff states (quoted):

"Problems that require two hidden layers are rarely encountered. However, neural networks with two hidden layers can represent functions with any kind of shape. There is currently no theoretical reason to use neural networks with any more than two hidden layers. In fact, for many practical problems, there is no reason to use any more than one hidden layer. Table 5.1 summarizes the capabilities of neural network architectures with various hidden layers." 

Jeff then has the following table...

"There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

  • The number of hidden neurons should be between the size of the input layer and the size of the output layer.
  • The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
  • The number of hidden neurons should be less than twice the size of the input layer."

Simple - and useful! Now, this is obviously a general rule of thumb, a starting point.

There is a Goldilocks method for choosing the right sizes of a Neural Network. If the number of neurons is too small, you get higher bias and underfitting. If you choose too many, you get the opposite problem of overfitting - not to mention the issue of wasting precious and expensive computational cycles on floating point processors (GPUs).

In fact, the process of calibrating a Neural Network leads to a concept of Pruning, where you examine which Neurons affect the total output, and prune out those that don't have the measure of contribution that makes a significant difference to the end result.

AI - Neural Networks and Deep Learning - Nielsen - Chap 5 - Vanishing and Exploding Gradient

When training a Neural Net, it is important to have what is referred to as a Key Performance Indicator - a KPI. This is an objective, often numerical, way of "scoring" the aggregate output so that you can actually tell that the model is learning - that it is trained - and that the act of training the model is improving the output. This seems innate almost, but it is important to always step back and keep this in mind.

Chapter 5 discusses the effort that goes into training a Neural Net, but from the perspective of Efficiency. How well, is the Neural Net actually learning as you run through a specified number of Epochs, with whatever batch sizes you choose, etc.?

In this chapter, Michael Nielsen discusses the Vanishing Gradient. He graphs the "speed of learning" on each Hidden Layer, and it is super interesting to notice that these Hidden Layers do not learn at the same rate! 

In fact, the Hidden Layer closest to the Output always outperforms the preceding Hidden Layer in terms of speed of learning.

So after reading this, the next questions in my mind - and ones that I don't believe Michael Nielsen addresses head-on in his book, is 

  • how many Hidden Layers does one need?
  • how many Neurons are needed in a Hidden Layer?

I will go back and re-scan, but I don't think there are any Rules of Thumb, or general guidance tossed out in this regard - in either book I have covered thus far.  I believe that in the examples chosen in the books, the decisions about how to size (dimension) the Neural Network is more or less arbitrary.

So my next line of inquiry and research will be on the topic of how to "design" a Neural Network, at least from the outset, with respect to the sizing and dimensions.  That might well be my next post on this topic.

Friday, February 23, 2024

AI - Neural Networks and Deep Learning - Nielsen - Chap 3 - Learning Improvement

As if Chapter 2 wasn't heavy enough, I moved onto Chapter 3, which introduced some great concepts, which I will mention here in this blog post. But, I couldn't really follow the detail very well, largely again due to the heavy mathematical expressions and notations used. 

But, I will iterate what he covers in this chapter, and I think each one of these will require its own "separate study", preferably in a simpler way and manner.

Chapter 3 discusses more efficient ways to learn.

It starts out discussing the Cost Function. 

In previous chapters, Nielsen uses the Quadratic Cost Function. But in Chapter 3, he introduces the Cross-Entropy Cost Function, and discusses how by using this, it avoids learning slowdown. Unfortunately, I can't comment much further on this because frankly, I got completely lost in this discussion.

He spends a GREAT DEAL of text discussing Cross-Entropy, including a discussion about the fact that he uses different Learning Rates on the Quadratic Cost Function (.15), versus the Cross-Entropy Cost Function (.005) - and discusses that the rates being different doesn't matter because it is more about how the speed of learning changes, than the actual speed of learning.

After a while, he mentioned an alternative to Cross-Entropy called Softmax. Now, this term seemed familiar. In doing a backcheck, I found that Softmax was used in the first book I read, called AI Crash Course by Haydelin de Ponteves.  I remembered both Softmax and Argmax being mentioned.

Softmax introduces a layer of neurons parallel to the output neurons. So if you had 4 output neurons, you would have 4 Softmax neurons preceding the 4 output neurons.  What Softmax does, is return a Probability Distribution. All of the neurons add up to 1, and if one neuron decreases, there must be an alternative increase amongst the other Softmax neurons.  This could be useful, for example, in cases where the AI is guessing which animal type it is: Dog, Cat, Parrot, Snake. You might see a higher correlation between Dog and Cat, and a lower one with the Parrot and Snake. 

Nielsen then goes on to discuss Overfitting (OverTraining) and Regularization, which is designed to combat Overfitting. He discusses four approaches to Regularization, which I won't echo here, as I clearly will need to consult elsewhere for simpler discussions, definitions and examples of these.



Wednesday, February 21, 2024

AI - Neural Networks and Deep Learning - Nielsen - Chap 2 - Backpropagation

Backpropagation is the "secret sauce" of Neural Networks. And therefore, very important to understand.

Why? 

Because it is how Neural Networks adapt and, well, learn.  

Backpropagation is responsible, essentially, for re-updating weights (and, potentially bias also I suppose), after calculating the differences between actual results and predicted results, so that the cost is minimized over iterations of training, to ensure that weights (and biases) are optimized - and cost is minimized.

Doing so, is rather difficult, tedious and requires an understanding of Mathematics on several principle levels (i.e. Linear Algebra, Calculus, and even Trigonometry if you truly want to understand Sigmoid functions).

In Chapter 2 of this book, I was initially tracking along, and was following the discussion on Notation (for weights and biases in Neural Network Nodes). But that was as far as I got before I got confused and stuck in the envelope of intimidating mathematical equations.

I was able to push through and read it, but found that I didn't understand it, and after several attempts to reinforce by re-reading several times, I had to hit the eject button and look elsewhere for a simpler discussion on Backpropagation.

This decision to eject and look elsewhere for answers, paid huge dividends that allowed me to come back and comment on why I found this chapter so difficult.

  1. His Cost function was unnecessarily complex
  2. He did not need to consider biases, necessarily, in teaching the fundamentals of backpropagation
  3. The notational symbols introduced are head-spinning

In the end, I stopped reading this chapter, because I don't know that trying to understand all of his jargon is necessary to get the gist and the essence of Backpropagation, even from the standpoint of having/getting some knowledge at a mathematical level on how it's calculated.

To give some credit where credit is due, this video from Mikael Lane helped me get a very good initial understanding of BackPropagation: Simplest Neural Network Backpropagation Example

Now, I did have a problem trying to understand where, at 3:30 or so of the video, he comes up with ∂C/∂w = 1.5 * 2(a-y) = 4.5 * w - 1.5

But, aside of that, his example helped me understand because he removed the bias from the equation! You don't really need a bias! Nobody else that I saw, had dumbed things down by doing this and it was extremely helpful. His explanation of how the Chain Rule of Differentiation is applied, was also timely and helpful.

NOTE: Mikael also has a 2nd follow-up video on the same topic: 

Another Simple Backpropagation Example

From there, I went and watched another video, which does use the bias, but walks you through backpropagation in a way that makes it easier to grasp and understand, even with the Calculus used.

Credit for this video goes to Bevan Smith, and the video link can be found here:

Back Propagation in training neural networks step by step 

Bevan gives a more thorough walk-through of calculations than the initial Mikael Lane video does. Both videos use the Least Squares method of Cost. 

The cost function at the final output, is: Cost = (Ypredicted - Yactual)²

The derivative of this, is quite simple obviously, which helps in understanding examples: 2Yp - Ya 

Nielsen, unfortunately, goes into none of this simple explanation, and chooses a Quadratic Cost Function that, for any newbie with rusty math skills, is downright scary to comprehend: 

Nielsen then goes on to cover 4 Equations of Backpropagation which, frankly, look PhD level to me, or at least as far as I am concerned. There is some initial fun in reading this, as though you are perhaps an NSA decoder trying to figure out how to reverse engineer The Enigma (German CODEC used in WWII). But, after a while, you throw your hands up in the air on it. He even goes into some Mathematical Proofs of the equation (yikes). So this stuff is for very very heavy "Math People".

At the end, he does dump some source code in Python that you can run, which is cool, and that all looked to me like it worked fine when I ran it.
 
 

C=12yaL2=12j(yjaLj)2C=12yaL2=12j(yjaLj)2

Thursday, February 8, 2024

Linux Phones Are Mature - But US Carriers Won't Allow Them

Today I looked into the status of some of the Linux phones, which are mature now.

Librem is one of the ones most people have heard about, but the price point on it is out of reach for anyone daring enough to jump in the pool and start swimming with a Linux phone.

Pinephone looks like it has a pretty darn nice Linux phone now, but after watching a few reviews, it is pretty clear that you need to go with the Pinephone Pro, and put a fast(er) Linux OS on it. 

The main issue with performance on these phones, has to do with the graphics rendering. If you are running the Gnome Desktop for example, the GUI is going to take up most of the cycles and resources that you want for your applications. I learned this on regular Linux running on desktop servers years ago, and got into the habit of installing a more lightweight KDE desktop to try and get some of my resources back under my control.

Today, I found a German phone that apparently is really gaining in popularity in Europe - especially Germany. It is called Volla Phone.  Super nice phone, and they have done some work selecting the hardware components and optimizing the Linux distro for you, so that you don't have to spend hours and hours tweaking, configuring, and putting different OS images on the phone to squeeze performance out of it.

Volla Phone - Linux Privacy Phone

 

Problem is - United States carriers don't allow these phones! They are not on the "Compatibility List". Now, I understand there might be an FCC cost to certifying devices on a cellular network (I have not verified this). The frequencies matter of course, but the SIM cards also matter. Volla Phone will, for instance, apparently work on T-Mobile, but only if you have an older SIM card. If you are on T-Mobile and have a new SIM card, then it won't work because of some fields that aren't exchanged (if I understand correctly).

Carriers that are in bed with Google and Apple, such as at&t and Verizon, they're going to do everything they can to prevent a Linux BYOD (Bring Your Own Device) phone hitting their network. They make too much $$$$$$$$$$$$ off of Apple and Android. T-Mobile, they're German of course, so maybe they have a little bit more of the European mindset. These are your three network rollouts across the United States, and all of your mom and pop cellular plays (i.e. Spectrum Mobile, Cricket, et al) are just MVNOs riding on that infrastructure. 

So if you have one of these Linux phones, you can use it in your home. On WiFi. But if you carry it outdoors, it's a brick apparently. Here we are in 2024, and that STILL seems to be the case.

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...