Wednesday, August 9, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Q Learning

The Crash Course in AI presents a fun project for the purpose of developing a familiarization with the principles of Q Learning.  

It presents the situation of Robots in a Maze (Warehouse), with the idea that the AI will learn the optimal path through a maze.

To do this, the following procedure is followed:

  1. Define a Location-to-State Mapping where each location (alphabetic; A, B, C, etc) is correlated to an Integer value (A=0,B=1,C=2,D=3, et al).
  2. Define the Actions (each location is an action, represented by its integer value) which is represented as a array of integers.
  3. Define the Rewards - here, each square in the maze, has certain squares it is adjacent to that constitute a "move". 
     The set of Reward Arrays, is an "array of arrays", and we know that an "array of arrays" is 
     essentially a matrix! Hence, we can refer to this large "rule set" as a Rewards Matrix.  
        
     # Defining the rewards
     R = np.array([
                            [0,1,0,0,0,0,0,0,0,0,0,0], --> A's only valid move is to B
                            [1,0,1,0,0,1,0,0,0,0,0,0], --> B's only valid move is to A, C, F
                            [0,1,0,0,0,0,1,0,0,0,0,0], --> C's only valid move is to B, G
                            [0,0,0,0,0,0,0,1,0,0,0,0], --> D's only valid move is H
                            [0,0,0,0,0,0,0,0,1,0,0,0], --> E's only valid move is to I
                            [0,1,0,0,0,0,0,0,0,1,0,0], --> F's only valid move is to B, J
                            [0,0,1,0,0,0,1,1,0,0,0,0], --> G's only valid move is to C, G, H
                            [0,0,0,1,0,0,1,0,0,0,0,1], --> H's only valid move is to D, G, L
                            [0,0,0,0,1,0,0,0,0,1,0,0], --> I's only valid move is to E, J
                            [0,0,0,0,0,1,0,0,1,0,1,0], --> J's only valid move is to F, I, K
                            [0,0,0,0,0,0,0,0,0,1,0,1], --> K's only valid move is to J, L
                            [0,0,0,0,0,0,0,1,0,0,1,0] --> L's only valid move is to H, K
                         ])

So this array, these "ones and zeroes" govern the "rules of the road" in terms of the maze. In fact, you could draw the maze out graphically based on these rules.

Now - from a simple perspective, armed with this information, you can feed a starting and ending location into the "Engine", and it will compute the optimal path for you. In cases where there are two optimal paths, it may give you one or the other.

But how does it do this? How does it "know"?

This gets into two key concepts, that comprise and feed an equation, known as the Bellman Equation.
  • Temporal Difference - how well (or how fast) the AI (model) is learning
  • Q-Value - this is an indicator of which choices led to greater rewards
If we consider that models like this might have thousands or even millions of X/Y coordinate data points (remember, it is a geographic warehouse model), it is not scalable for the AI to store all of the permutations of these as it works through the model. What this Bellman Equation does, is allow for a Calculus-like coefficient to be used such that we know if we hit coordinate X,Y, what the optimal steps were to reach X,Y.

Basically, as we traverse the maze, before we start, all Q values are (initialized to) zero. As we traverse the maze, the model calculates the Temporal Difference, and if it is high then the model flags it as a Reward, while if it is low, it is flagged as a "frustration". High values early on, are "pleasant surprises" to the model. So - in summary, as the maze is traversed, the TD is calculated, followed by a Q value adjustment (Q Value for the state/action combination to be precise).

Now...before I forget to mention this, the Rewards Matrix needs to be adjusted to reflect the ideal ending location.  For example, if the maze was to begin at point E, and and at point G, the X/Y axis (starting location, ending location) of G would need to have a huge value that would tell the AI to stop there and go no further. You can see this in the coding example of the book:

# Optimize the ending state with the ultimate reward
R_new[ending_state, ending_state] = 1000

I have to admit - I started coding, before I fully read and digested what was in the book. I got tripped up by two variables in the code: Alpha, Gamma. Alpha was coded as .9, while Gamma was coded as .75. I was very confused by these constants; what they were, why they were used. 

I had to go back into the book.
  • Alpha - Learning Rate
  • Gamma - Discount Factor

Hey, this AI stuff - these algorithms, they're all about Mathematics (as well as Statistics and Probability). I am not a Mathematician, and only took intermediate Calculus, so some of us really need to concentrate and put our thinking caps on if we truly want to follow the math behind these models and equations.


Artificial Intelligence Book 1 - Crash Course in AI - Thompson Sampling

I bought this book in Oct 2020. Maybe due to holidays and other distractions, I never picked the book up until 2021, at which point I decided it took mental energy, and set it back down.

Well, now that AI is the rave in 2023, I decided to pick this book up and push the education along.

I love that this book uses Python. It wants you to use some user interface called Colab, which I initially looked at but quickly abandoned in favor of the tried-and-true vi editor.

The book starts off with the Multi-Armed-Armed-Bandit "problem". 

What is that? Well, the name stems from the One-Armed-Bandit; a slot-machine, which "steals" from the players of the machine.  

The Multi-Armed-Bandit, I presume, turns this on its head as it represents a slot machine player that is playing a single machine with multiple N number of arms (or, perhaps a bank of N number of single-armed machines). By using a binary system of rewards (0/1) this problem feeds into a Reinforcement Learning example where the optimal sequence of handle pulls results in the maximum rewards. 

This "use case" of the Multi-Armed-Bandit problem (slot machines or single slot machines with multiple arms), is solved by the use of Thompson Sampling. 

Thompson Sampling (the term Sampling should give this away) is a Statistics-based approach that solves some interesting problems. For example, take the case of the Multi-Armed Bandit problem just described. Just because a slot machine has paid out the most money historically, it does not mean that the particular slot machine will continue to be the best choice for the future gambler (or future pulls of the arms on the slot machine/s).  Thompson Sampling, through continual redistribution and training, accommodates the idea of exploiting the results of the past, while exploring the results of the future.  

The fascinating thing about Thompson Sampling, is that it was developed back in 1933, and largely ignored until more recently. The algorithm (or a rendition of it) has been applied in a number of areas recently, by a growing number of larger-sized companies, to solve interesting issues.

In this book, the problem that employs Thompson Sampling, is one in which a paid Subscription is offered, and the company needs to figure out how to optimize the revenue at the right price point.

Sources: 

Weber, Richard (1992), "On the Gittins index for multiarmed bandits", Annals of Applied Probability

A Tutorial on Thompson Sampling Daniel J. Russo1 , Benjamin Van Roy2 , Abbas Kazerouni2 , Ian Osband3 and Zheng Wen4 1Columbia University 2Stanford University 3Google DeepMind 4Adobe Research

Friday, July 21, 2023

Do you need BGP with a VxLAN WAN?

I presumed that VxLAN technology would supercede/replace the need for E-VPNs.

This led me to do some additional research on E-VPN vs VxLAN, and what I am finding, is that there are some benefits to using both together.

This link from Cisco, discusses this:

VXLAN Network with MP-BGP EVPN Control Plane Design Guide

This post lists some specific benefits to using MP-BGP for the Control Plane of a VxLAN tunneled overlay network:

  1. The MP-BGP EVPN protocol is based on industry standards, allowing multivendor interoperability.
  2.  It enables control-plane learning of end-host Layer-2 and Layer-3 reachability information, enabling organizations to build more robust and scalable VXLAN overlay networks.
  3. It uses the decade-old MP-BGP VPN technology to support scalable multi-tenant VXLAN overlay networks.
  4. The EVPN address family carries both Layer-2 and Layer-3 reachability information, thus providing integrated bridging and routing in VXLAN overlay networks.
  5. It minimizes network flooding through protocol-based host MAC/IP route distribution and Address Resolution Protocol (ARP) suppression on the local VTEPs.
  6. It provides optimal forwarding for east-west and north-south traffic and supports workload mobility with the distributed anycast function.
  7. It provides VTEP peer discovery and authentication, mitigating the risk of rogue VTEPs in the VXLAN overlay network.
  8. It provides mechanisms for building active-active multihoming at Layer-2.

Wednesday, June 28, 2023

VMWare Storage - Hardware Acceleration Status

Today, we had a customer call in and tell us that they couldn't do Thick provisioning from a vCenter template. We went into vCenter (the GUI), and sure enough, we could only provision Thin virtual machines from it.

But - apparently on another vCenter cluster, they COULD provision Thick virtual machines. There seemed to be no difference between the virtual machines. Note that we are using NFS and not Block or iSCSI or Fibre Channel storage.

We went into vCenter, and lo and behold, we saw this situation...


NOTE: To get to this screen in VMWare's very cumbersome GUI, you have to click on the individual datastore, then click "Configure" then a tab called "Hardware Acceleration" occurs.

So, what we have here, is one datastore that says "Not Supported" on a host, and another datastore in the same datastore cluster that says "Supported" on the exact same host. This sounds bad. This looks bad. Inconsistency. Looks like a problem.

So what IS hardware acceleration when it comes to Storage? To find this out, I located this KnowledgeBase:

 Storage Hardware Acceleration

There is also a link for running storage HW acceleration on NAS devices:

Storage Hardware Acceleration on NAS Devices 

When these two (above) links are referenced, there are (on left hand side) some additional links as well.

For each storage device and datastore, the vSphere Client display the hardware acceleration support status.

The status values are Unknown, Supported, and Not Supported. The initial value is Unknown.

For block devices, the status changes to Supported after the host successfully performs the offload operation. If the offload operation fails, the status changes to Not Supported. The status remains Unknown if the device provides partial hardware acceleration support.

With NAS, the status becomes Supported when the storage can perform at least one hardware offload operation.

When storage devices do not support or provide partial support for the host operations, your host reverts to its native methods to perform unsupported operations.

NFS = NAS, I am pretty darned sure. 

So this is classic VMWare confusion. They are using a "Status" field, and using values of Supported / Non-Supported, when in fact Supported means "Working" and Non-Supported means "Not Working" based on (only) the last operation attempted.

So. Apparently, if a failure on this offload operation occurs, this flag gets turned to Non-Supported, and guess what? That means you cannot do *any* Thick Provisioning.

In contacting VMWare, they want us to re-load the storage plugin. Yikes. Stay Tuned....

VMWare also has some Best Practices for running iSCSI Storage, and the link to that is found at:

VMWare iSCSI Storage Best Practices

Wednesday, April 19, 2023

Colorizing Text in Linux

 I went hunting today, for a package that I had used to colorize text. There are tons of those out there of course. But - what if you want to filter the text and colorize based on a set of rules?

There's probably a lot of stuff out there for that, too. Colord for example, runs as a daemon in Linux.

Another package, is grc, found at this GitHub site: https://github.com/garabik/grc

Use Case: 

I had a log that was printing information related to exchanges with different servers. I decided to color these so that messages from Server A were green, Server B were blue, etc. In this way, I could do really cool things like suppress messages from Server B (no colorization). Or, I could take Control Plane messages from, say, Server C, and highlight those Yellow.  

This came in very handy during a Demo, where people were watching the messages display in rapid succession on a large screen.

Monday, February 27, 2023

Hyperthreading vs Non-Hyperthreading on an ESXi Hypervisor

We started to notice that several VNF (Virtual Network Function) vendors were recommending to turn off (disable) Hyper-threading on hypervisors.  But why? They claimed it helped their performance. 

Throwing a switch and disabling this, means that the number of cores that are exposed to users, is cut in half. So a 24 core CPU, has 48 cores if Hyper-threading is enabled, and only has 24 cores if it is disabled. 

This post isn't meant to go into the depths of Hyper-threading itself. The question we had, was whether disabling it or enabling it, affected performance, and to what degree.

We ran a benchmark that was comprised of three "layers". 

  • Non-Hyperthreaded (24 cores) vs Hyperthreaded (48 cores)
  • Increasing vCPU of the Benchmark VM (increments of eight: 1,8,16,24)
  • Each test ran several Sysbench tests with increasing threads (1,2,4,8,16,32) 

The servers we are running on, include:  Cisco M5 (512G RAM, 24 vCPU)

We collected the results in Excel, and ran a Pivot Char graph on it, and this is what we found (below).

VM with 1,8,16,24 vCPU running Sysbench with increasing threads
on a Hyperthread-disabled system (24) vs Hyperthread-enabled system (48)

It appears to me, that Hyperthreading starts to look promising when two things happen:

  1. vCPU resources on the VM increase past a threshold of about 8 vCPU.
  2. an application is multi-threaded, and is launching 16 or more threads.

Notice that on an 8 vCPU virtual machine, the "magic number" is 8 threads. On a 16 vCPU virtual machine, you do not see hyperthreading become an advantage until 16 threads are launched. On a 24 vCPU system, we start to see hyperthreading become favorable at about 16 threads and higher.

BUT - if the threads are low, between 1 and about 8, the hyperthreading works against you.

Thursday, February 16, 2023

Morpheus API - pyMorpheus Python API Wrapper

I have been working on some API development in the Morpheus CMP tool.

The first thing I do when I need to use an API, is to see if there is a good API wrapper. I found this one API wrapper out on Github, called pyMorpheus.

With this wrapper, I was up and running in absolutely no time, making calls to the API, parsing JSON responses, etc.

The Use Case I am working on, is a "re-conciliator" that will do two things:

  • Remove Orphaned VMs
Find, and delete (upon user confirmation) those VMs that have had their "rug pulled out" from Morpheus (deleted in vCenter but still sitting in Morpheus as an Instance)
  •  Convert Certain Discovered VMs to Morpheus

This part sorta kinda worked.  The call to https://<applianceurl>/servers/id/make-managed did take a Discovered VM  and converted it to an instance, with a "VMWare" logo on it. 

But I was unable to set advanced attributes of the VMs - Instance Type, Layout, Plan, etc. and this made it only a partial success.

Maybe if we can get the API fixed up a bit, we can get this to work.

One issue, is the "Cloud Sync". When we call the API, we do a cloud sync, to find Discovered VMs. We do the same cloud sync, to determine whether any of the VM's fields in Morpheus change their state, if someone deletes a VM in vCenter (such a state change gives us the indicator that the VM is, in fact, now an orphan).  The Cloud Sync is an asynchronous call. You have to wait for an indefinite amount of time, to ensure that the results you are looking for in vCenter, are reflected in Morpheus. It's basically polling, which is not an exact art. For this reason, the reconciliator tool needs to be run as an operations tool, manually, as opposed to some kind of batch scheduled job.


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...