Friday, August 18, 2023

Recovering a Corrupted NSX-T Manager

If your NSX-T Manager cluster is running as a cluster of VMs, if one is corrupted, there is a good chance they all are if the issue was related to storage connectivity. Or, maybe it is just one. If you are running a cluster, and don't have a backup to restore, these steps can be used to repair the file system. Mileage varies on repairing file systems, so there is no guarantee this will work, but this is the process to attempt nontheless.

1.  Connect to the console of the appliance.

2.  Reboot the system.

3.  When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. If you wait too long and the boot sequence does not pause, you must reboot the system again. Press e to edit the menu.

4.  Keep the cursor on the Ubuntu selection.

5.  Press e to edit the selected option.

6.  Enter the user name ( root) and the GRUB password for root (not the same as the appliance's user root).Password "VMware1" before release 3.2 and "NSX@VM!WaR10" 3.2 and beyond.

7.  Search for the line starting with "linux" having boot command.

8. Remove all options after root= (Starting from UUID) and add "rw single init=/bin/bash".

9.  Press Ctrl-X to boot.

10.  When the log messages stop, press Enter. You will see the prompt root@(none):/#.

11.  Run following commands to repair the file system.

  • e2fsck -y /dev/sda1
  • e2fsck -y /dev/sda2
  • e2fsck -y /dev/sda3
  • e2fsck -y /dev/mapper/nsx-config
  • e2fsck -y /dev/mapper/nsx-image
  • e2fsck -y /dev/mapper/nsx-var+log
  • e2fsck -y /dev/mapper/nsx-repository
  • e2fsck -y /dev/mapper/nsx-secondary

The Linux XFS File System - How Resilient Is It?

We are using VMWare Datastores, using NFS version 3.x.  The storage was routed, which is never a good thing to do because let's face it, if your VMs all lose their storage simultaneously, that constitutes a disaster. Having dependencies on a router, which can lose its routing prefixes due to a maintenance or configuration problem, is architecturally deficient (polite way of putting it). To solve this, you need to make sure that you don't have routing hops (storage on same segment as storage interface on hypervisor).

So, after our storage routers went AWOL due to a maintenance event, I noticed some VMs came back and appeared to be fine. They had rebooted and were at a login prompt.  Other VMs, however, did not come back, and had some nasty things printing on the console (you could not log into these VMs).


What we noticed, was that any Linux virtual machine running with XFS file system type on boot or root (/boot or /) had this issue of being unrecoverable.  VMs that were using ext3 or ext4 seemed to be able to recover and start running their services - although some were still echoing some messages to the console.

There is a lesson here. That the file system matters when it comes to resiliency in a virtualized environment.

I did some searching around for discussions on file system types, and of course there are many. This one in particular, I found interesting:  ext4-vs-xfs-vs-btrfs-vs-zfs-for-nas


Wednesday, August 9, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Q Learning

The Crash Course in AI presents a fun project for the purpose of developing a familiarization with the principles of Q Learning.  

It presents the situation of Robots in a Maze (Warehouse), with the idea that the AI will learn the optimal path through a maze.

To do this, the following procedure is followed:

  1. Define a Location-to-State Mapping where each location (alphabetic; A, B, C, etc) is correlated to an Integer value (A=0,B=1,C=2,D=3, et al).
  2. Define the Actions (each location is an action, represented by its integer value) which is represented as a array of integers.
  3. Define the Rewards - here, each square in the maze, has certain squares it is adjacent to that constitute a "move". 
     The set of Reward Arrays, is an "array of arrays", and we know that an "array of arrays" is 
     essentially a matrix! Hence, we can refer to this large "rule set" as a Rewards Matrix.  
        
     # Defining the rewards
     R = np.array([
                            [0,1,0,0,0,0,0,0,0,0,0,0], --> A's only valid move is to B
                            [1,0,1,0,0,1,0,0,0,0,0,0], --> B's only valid move is to A, C, F
                            [0,1,0,0,0,0,1,0,0,0,0,0], --> C's only valid move is to B, G
                            [0,0,0,0,0,0,0,1,0,0,0,0], --> D's only valid move is H
                            [0,0,0,0,0,0,0,0,1,0,0,0], --> E's only valid move is to I
                            [0,1,0,0,0,0,0,0,0,1,0,0], --> F's only valid move is to B, J
                            [0,0,1,0,0,0,1,1,0,0,0,0], --> G's only valid move is to C, G, H
                            [0,0,0,1,0,0,1,0,0,0,0,1], --> H's only valid move is to D, G, L
                            [0,0,0,0,1,0,0,0,0,1,0,0], --> I's only valid move is to E, J
                            [0,0,0,0,0,1,0,0,1,0,1,0], --> J's only valid move is to F, I, K
                            [0,0,0,0,0,0,0,0,0,1,0,1], --> K's only valid move is to J, L
                            [0,0,0,0,0,0,0,1,0,0,1,0] --> L's only valid move is to H, K
                         ])

So this array, these "ones and zeroes" govern the "rules of the road" in terms of the maze. In fact, you could draw the maze out graphically based on these rules.

Now - from a simple perspective, armed with this information, you can feed a starting and ending location into the "Engine", and it will compute the optimal path for you. In cases where there are two optimal paths, it may give you one or the other.

But how does it do this? How does it "know"?

This gets into two key concepts, that comprise and feed an equation, known as the Bellman Equation.
  • Temporal Difference - how well (or how fast) the AI (model) is learning
  • Q-Value - this is an indicator of which choices led to greater rewards
If we consider that models like this might have thousands or even millions of X/Y coordinate data points (remember, it is a geographic warehouse model), it is not scalable for the AI to store all of the permutations of these as it works through the model. What this Bellman Equation does, is allow for a Calculus-like coefficient to be used such that we know if we hit coordinate X,Y, what the optimal steps were to reach X,Y.

Basically, as we traverse the maze, before we start, all Q values are (initialized to) zero. As we traverse the maze, the model calculates the Temporal Difference, and if it is high then the model flags it as a Reward, while if it is low, it is flagged as a "frustration". High values early on, are "pleasant surprises" to the model. So - in summary, as the maze is traversed, the TD is calculated, followed by a Q value adjustment (Q Value for the state/action combination to be precise).

Now...before I forget to mention this, the Rewards Matrix needs to be adjusted to reflect the ideal ending location.  For example, if the maze was to begin at point E, and and at point G, the X/Y axis (starting location, ending location) of G would need to have a huge value that would tell the AI to stop there and go no further. You can see this in the coding example of the book:

# Optimize the ending state with the ultimate reward
R_new[ending_state, ending_state] = 1000

I have to admit - I started coding, before I fully read and digested what was in the book. I got tripped up by two variables in the code: Alpha, Gamma. Alpha was coded as .9, while Gamma was coded as .75. I was very confused by these constants; what they were, why they were used. 

I had to go back into the book.
  • Alpha - Learning Rate
  • Gamma - Discount Factor

Hey, this AI stuff - these algorithms, they're all about Mathematics (as well as Statistics and Probability). I am not a Mathematician, and only took intermediate Calculus, so some of us really need to concentrate and put our thinking caps on if we truly want to follow the math behind these models and equations.


Artificial Intelligence Book 1 - Crash Course in AI - Thompson Sampling

I bought this book in Oct 2020. Maybe due to holidays and other distractions, I never picked the book up until 2021, at which point I decided it took mental energy, and set it back down.

Well, now that AI is the rave in 2023, I decided to pick this book up and push the education along.

I love that this book uses Python. It wants you to use some user interface called Colab, which I initially looked at but quickly abandoned in favor of the tried-and-true vi editor.

The book starts off with the Multi-Armed-Armed-Bandit "problem". 

What is that? Well, the name stems from the One-Armed-Bandit; a slot-machine, which "steals" from the players of the machine.  

The Multi-Armed-Bandit, I presume, turns this on its head as it represents a slot machine player that is playing a single machine with multiple N number of arms (or, perhaps a bank of N number of single-armed machines). By using a binary system of rewards (0/1) this problem feeds into a Reinforcement Learning example where the optimal sequence of handle pulls results in the maximum rewards. 

This "use case" of the Multi-Armed-Bandit problem (slot machines or single slot machines with multiple arms), is solved by the use of Thompson Sampling. 

Thompson Sampling (the term Sampling should give this away) is a Statistics-based approach that solves some interesting problems. For example, take the case of the Multi-Armed Bandit problem just described. Just because a slot machine has paid out the most money historically, it does not mean that the particular slot machine will continue to be the best choice for the future gambler (or future pulls of the arms on the slot machine/s).  Thompson Sampling, through continual redistribution and training, accommodates the idea of exploiting the results of the past, while exploring the results of the future.  

The fascinating thing about Thompson Sampling, is that it was developed back in 1933, and largely ignored until more recently. The algorithm (or a rendition of it) has been applied in a number of areas recently, by a growing number of larger-sized companies, to solve interesting issues.

In this book, the problem that employs Thompson Sampling, is one in which a paid Subscription is offered, and the company needs to figure out how to optimize the revenue at the right price point.

Sources: 

Weber, Richard (1992), "On the Gittins index for multiarmed bandits", Annals of Applied Probability

A Tutorial on Thompson Sampling Daniel J. Russo1 , Benjamin Van Roy2 , Abbas Kazerouni2 , Ian Osband3 and Zheng Wen4 1Columbia University 2Stanford University 3Google DeepMind 4Adobe Research

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...