Grasping Technology: 2023

Monday, November 27, 2023

Quantum Computing - Is it worth an investment of my time to understand?

I recently watched a presentation done on Quantum Computing by a guy who looks to me, like he is following it as a fascination or hobby. After watching the presentation, I decided to do some quick searching on Quantum Computing to see if there were some things I was specifically looking for that I didn't see covered in the presentation - especially along practical lines.

I found this site, with essentially corroborated his presentation:

https://www.explainthatstuff.com/quantum-computing.html

Absolute KUDOS to the author of this site, because he explains an advanced topic in a simplistic way, and discusses many of the top-of-the-head questions one might have about Quantum Computing. If you can stay patient and get down to the bottom, he shows a patent of a Quantum Computing Architecture, which is super interesting.

https://cdn4.explainthatstuff.com/monroe-kim-quantum-computer.png

He also makes this statement, which is very interesting:

"Does that mean quantum computers are better than conventional ones? Not exactly. Apart from Shor's algorithm, and a search method called Grover's algorithm, hardly any other algorithms have been discovered that would be better performed by quantum methods."

I remembered reading a blurb about Einstein some years back, and some of his comments about "Spooky Action At a Distance", where an electron 'way over here' would seem to be inextricably and unexplainably linked to another electron "way over there". And, while we even today don't seem to have a full or proven explanation of why that behavior happens, we are apparently finding it reliable enough to exploit it for purposes of Quantum Computing (the key concept here is Entanglement). Through complex state manipulation (see Schrødinger's Cat), Entanglement unlocks massively parallel computing. I won't even attempt to go deeper than this on this post.

Now...why do we want atomic level computing? With states that are entangled (see concept of entanglement), and the whole bit?

The main Use Case for this level of expense and sophistication, is cryptography - the ability to break ciphers. After all, 256 bit AES encryption is moot for that level of super-computing.

But I wanted to see if there were others, and found this site, which kind of shows "who is doing what" with regards to Quantum Computing.

Quantum Computing Applications

I think between these links here, you can be brought up to speed on what Quantum Computing is, why it is a Thing, and some potential uses for it. Which is what most of us essentially want at this point.

Thursday, November 16, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Chapter 13 - Memory Patch

Okay, last chapter in the book!

In this chapter, you get to "create" (actually, "create" means download and run) some Github hosted code that allows you to train a model to learn how to play the video game "Snake".

Snake is an early video game, probably from the 1970s or 1980s. I don't know the details of it but I am sure there is plenty of history on it. I think you could run it on those Radio Shack Tandem TRS80 computers that had 640K of RAM on them and saved to a magnetic cassette tape (I remember you could play Pong, and I think Snake was one of them also).

The idea was that each time the snake ate an apple (red square) the snake's length would increase (by one square). You could move up, down, left, right constrained by coordinate boundaries, and if the snake overlapped with itself, it died and the game ended.

When I first ran the model training for this, it ran for more than a day - perhaps all weekend, and then died. The command prompt, when I returned to check on progress, had a [ Killed ] message.

I had other models in this book die this way, and decided that I was running out of memory, and my solution to the other models was to edit the source code, and decrease the number of Epochs, and reduce the loop complexity. This made the models a LOT less efficient and reliable, but I still saw beneficial results from running them with this tactic.

In this case, for some reason, I went to Github and looked at the Issues, and I saw a guy complaining about a Memory Leak in the Tensorflow libraries. There was a patch to fix this!

Below is a Unix/Linux "diff" command, which shows this patch:

% diff train.py train.py.memoryleak
5d4
< import tensorflow as tf
12,15d10
< import gc
< import os
< import keras
<
64,67c59
<             #qvalues = model.predict(currentState)[0]
<             qvalues = model.predict(tf.convert_to_tensor(currentState))[0]
<             gc.collect()
<             keras.backend.clear_session()
---
>             qvalues = model.predict(currentState)[0]

So in summary, the patches are:

The original statement qvalues = model.predict(currentState)[0] is replaced by:

qvalues = model.predict(tf.convert_to_tensor(currentState))[0]

There is also a garbage collect statement: gc.collect() that is added for the patch.
A Keras library call "clear_session()" has been added

Of course some imports are necessary to reference and use these new calls.

This fixes the memory problem. It does not appear that the training will ever end on its own when ou run this code. You have to Ctl-C it to get it to stop, because it just trains and trains, looking for a better score and more apples. I had to learn this the hard way after running train.py for a full weekend.

So this wraps up the book for me. I may do some review on it, and will likely move on to some new code samples and other books.

Friday, October 20, 2023

Artificial Intelligence Book 1 - Using Images in AI - Deep Convolutional Q Learning

I just finished Chapter 12 in my AI Crash Course book, from Hadelin de Ponteves.

Chapter 12 is a short chapter actually. It explains, in a refreshing and surprising simple way, the concept of Convolutional Q Learning which pertains to how image recognition/translation is fed into a Deep Q Neural Network (from prior chapters in the book).

The chapter covers four steps:

Convolution - applying feature detectors to an image
Max Pooling - simplifying the data
Flattening - taking all of the results of #1 and #2 and putting them into a one-dimensional array
Full Connection - feeding the one-dimensional array as Inputs into the Deep Q Learning model

I probably don't need to go over all of these details in this blog as that would be redundant.

If you have some exposure to Computing and are familiar with Bitmaps, I think this process has shares some conceptual similarity to Bitmaps.

For example, in Step 1 - Convolution - you are essentially sliding a feature detector or "filter" (e.g. 3x3 in the book) over an image - starting on Row 1, and sliding it left to right one one column at a time before dropping down to Row 2 and repeating that process. On each slide interval, you are employing a mapping process of multiplying each square of the image (again using a 3z3 area in the book) to the corresponding value of the map. Each individual iteration of this process creates a single Feature Map.

In sliding this 3x3 map across, you can only go 5 times to the right before you run out of real estate. Then you drop down - and you can only drop down 5 times until you run out of real estate in that direction. So if I am correct in my interpretation about how this works, you would get 5 x 5 = 25 Feature Maps with a 7x7 image and a 3x3 filter.

Pooling is actually similar to the process of filtering to a feature map. The aim of it is to reduce the size and complexity of all of those feature maps. The sliding process is the main difference; instead of going one column / row on each slide, you are sliding (using a 2x2 in the book) over the entire area of the pool size.

Once you get all of the pools, these are flattened out into a single dimensional array, and fed into the Inputs of the standard Q Learning model, with the outputs pertaining to image recognition.

This diagram shows how all of this is done, with a nice comparison between Biological image recognition, with this AI image recognition process.

Image Recognition - Biological vs AI

Source: frontiersin.org

Now in Chapter 12 of the book, the process represents what we see above. There is just a single Convolutional Layer and Pooling Layer before the AI Neural Network (hidden layers) are engaged.

Chapter 12 does not cover the fact that the Convolutional Layer is an iterative process that includes Convolution followed by Sub-Sampling in an iterative fashion.

This diagram below represents this.

In the next chapter, there is a code sample, so I will be able to see whether it includes this sub-sampling process or not.

Deep Q Learning - Neural Networks - Training the Model Takes Resources

I now am starting to see why those companies with deep pockets have an unfair advantage in the not-so-level playing field of adopting AI. Resources.

It takes a LOT of energy and computing resources to train these Artificial Intelligence models.

In Chapter 11 of AI Crash Course (by Hadelin de Ponteves), I did the work. I downloaded, inspected, and ran the examples, which are based on Google's Deep Mind project. The idea is to use an AI to control server temperature, and compare this with an "internal" (no AI) temperature manager.

What you would do, is to train the model (first), and it would produce a model.h5 file, that would then be used when you ran the actual model through testing.

The problem, though, is that on my rather powerful Mac Pro laptop, the training would never run. I would return HOURS later, only to see [ killed ] on the command prompt. The OS apparently was running out of resources (memory probably).

So I started tinkering with the code.

First, I reduced the number of epochs (from 25 to 10).

#number_epochs = 25

number_epochs = 10

Which looked like it helped, but ultimately didn't work.

Then, I reduced the number of times the training loops would run. When I looked at the original code, the number of iterations was enormous.

# STARTING THE LOOP OVER ALL THE TIMESTEPS (1 Timestep = 1 Minute) IN ONE EPOCH

while ((not game_over) and timestep <= 5 * 30 * 24 * 60):

This is 216,000 loop iterations in the inner loop, and of course this needs to be considered from the context of the outer loop (25, or, adjusted down to 10 as I did). So 216,000 * 25 = 5 million, 400 thousand. If we reduce to 10 the number of Epochs, we are dealing with 2 million, 600 thousand.

I don't know how much memory (Heap) is used over that many iterations but on a consumer machine, you are probably going to tax it pretty hard (remember it has to run the OS and whatever tasks happen to be running on it).

I was FINALLY able to get this to run by reducing the number of Epochs to 10, and reducing the steps to 5 * 30 * 24 (3600). And even with this drastic reduction, you could see the benefits the AI performed over the non-AI temperature control mechanism.

Thursday, October 5, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Deep Q Learning

I read about Q Learning, and was feeling somewhat proud of myself for sticking my toe into the water.

Then I read about Deep Q Learning - in this same book - and it was as if someone took an ice bath and dumped it over my head. I went into the tunnel - advancing through chapters 9,10 and 11, only to come out the other end confused ("what did I just read?").

The coding examples were interesting enough that I kept pushing forward, but a lot of what is in the code is masked by the fact that the math and formulas were hidden away in libraries like Keras. So while I thought the examples were cool and I had a grasp of the problems they were attempting to solve, I still came out at the end with confusion and question marks in my head.

Q Learning vs Deep Q Learning

In Chapter 9, which covers Deep Q-Learning, things start to get very complex very fast. So what is the difference between Q Learning (introduced in Chapter 7-8), and Deep Q Learning?

More complex problems in Deep Q Learning - with more variables
The approach to solving a more complex problem

With regards to the approach to solving problems, the book gets into a good discussion - worth mentioning here - about the difference between ArgMax and SoftMax.

Argmax vs Softmax

In Q Learning, the 'name of the game' was to find (and use) the highest Q Value. This is referred to as "Exploitive" and is known as the ArgMax method of Reinforcement Learning.

In Deep Q Learning, probability distributions across several variables are being continually updated during the training of the model. You have a set of (input) variables, with specific weights, but as you take random samples and compare the predicated value to the actual value, the weights are updated according to the new realities (results). This process is referred to as Explorative (in nature) and is named the SoftMax method.

Chapter 9 starts you off simple(r). With a Real Estate example of predicting home prices. Seems sensible enough, since we can all think of input variables that help drive the price of a home. The focus here is on trying to show the process, which is broken down into the following steps:

Uploading the Data Set (actual home prices)
Building the Neural Network
Training the Neural Network
Displaying the Results

From here, the book advances into Deep Learning Theory. The idea borrows from the human brain, which is connected by Synapses that send signals. This is the fundamental concept behind Deep Q Learning because it starts with a certain number of "Layers". There are a minimum of three layers that are as follows:

Input Layer - consists of Input Values, and each of these gets weights that are continually adjusted
Hidden Layer(s) - these "neurons" are also continually adjusted
Output Layer - this layer compares predicted values to actual values and computes Loss Error.

The loss error gets back-propagated through the layers, re-adjusting the weights continually, using a concept called Gradient Descent (which requires at least a basic understanding of Calculus). The book covers three types of Gradient Descent (Batch, Stochastic and Mini-Batch).

The book mentions Activation Functions that take weighted input values, and return an output value. The book mentions three of these, which sound intimidating, but if you are familiar with Electronics and/or Trigonometry, these names actually make some sense:

Sigmoid Activation Function - a logarithmic curve denoting a move from state value (no lower than) 0 to (no higher than) 1.
Rectifier Activation Function - a linear but angular approach from state 0 to state 1
Threshold Activation Function - An abrupt binary state transition from 0 to 1. This is much like flipping a switch into an on/off state.

Now from here (Chapter 9), the book goes into Chapter 10 - Self Driving Car - which an implementation of Chapter 9 - and quite fun to do. Then it dives into Chapter 11, which uses the example taken from Google's DeepMind project that optimizes server temperature with a simulation.

Chapter 11 in particular really drives home the process by showing how you can optimize or minimize costs.

Building the Environment
Building the Brain - using DropOut vs NoDropOut techniques
Implementation (of the Deep Learning Algorithm)
Training the AI - using either Early Stopping or No Early Stopping
Testing the AI

Seeing is believing, and when you see this code run and start to view the results, I have to admit it is pretty darn cool.

It also takes a LONG time to run. I had to shorten the epochs from 100 to 25 to keep the job from getting killed (I am not sure what exactly was killing it). Running for 100 epochs was taking my laptop HOURS to finish (2-3 hours). But at the end of each Epoch, almost always the energy savings from the AI was superior to the energy savings of not using the AI (which in this case is modeled by the server's mainboard temperature controller).

There's so much more to discuss. But I think I have hit the highlights here.

Tuesday, October 3, 2023

Examining My Blog Analytics

This Blog - I never really "truly" cared about who saw it. I think the blog has served more as a personal diary for me than a blog that I wrote for purposes of harvesting subscribers and attention.

I have a friend who is creating a "business" (I will withhold my comments about things like Business Model). Maybe it's a hobby he thinks is a business. At any rate, he was asking me questions as he was creating his website, and this did make me realize that I know little to nothing about crawlers, indexing, search engines, and how to maximize search engine results. I surprised myself with this realization.

I mean, I have a working knowledge and understanding about things like Cookies, Meta Tags, and stuff like this. But it isn't something I am fluent with.

This blog gets a few peeks here and there, and a few have made some comments that I have helped them with certain issues they were looking for answers for. But really, traffic on this blog is drop in the ocean - and a small drop at that.

Today, I pulled up Analytics, and noticed that 2/3 of the pages on this blog were not indexed. There were about 3 reasons listed:

Alternate Page with Proper Canonical Tag (WTF is this???)
Mobile Usability
No Sitemap

I made some changes to it:
   •   Added a SiteMap
   •   Changed the Theme to accommodate Mobile Pages
   •   Resubmitted the Blog for Indexing - including individual pages with the Canonical error

Now here is something quite interesting.
• I put a search string in DuckDuckGo, and up comes this blog lickety split - right at the top.
• Same search string in Bing.com and my blog came up lickety split - right at the top.

Note: The search string I used was "Techgrasper Blogspot"

But - when I put the same search string into Google, there was literally NO MENTION of my blog. At all. Zero.

And this blog is hosted on a Google-owned platform!!! So what is going on there???

Is this blog on a blacklist or being censored?
Is this blog suppressed because I am not paying Google, or generating ads on it?
Some other reason Google is not showing the blog?

This sure does open the can of worms when it comes to the control of information flow, and who sees what. Clearly when different search engines show THIS LEVEL of inconsistency, we have some major major major major issues.

Ironically enough from a timing perspective, my whole desire to check my blog in search results is happening at the same time that Google is defending itself from a Sherman Act lawsuit.

Look - if this is a blog, called techgrasper, and there aren't many things out on the world wide web that have the words blog + techgrasper in them, shouldn't this blog SHOW UP???? On the prevailing dominant search engine???? Hosted on a blogging platform that THEY OWN????

Why I Am Not an Expert in User Interfaces and Web Front-Ends

Quite some years ago, I worked at a cool startup company towards the tail of the Dot Com bubble. I won't mention it here I guess. But - they were pioneers in Mobile Apps.

They promised this "write once run anywhere" concept, where you would write your app in a proprietary markup language, and they would parse this markup and render the content on a host of early devices. The field for competing devices in these days was insane, including stuff like Palm Pilots to VERY early mobile phone browsers that ran their own simplistic markup (phones in those days had no resources to the markups had to be dumbed down and simplified).

Anyway, we split the development up into "tiers" where we had "front end" developers who were super knowledgeable in how to render content, and then we had the "back end" developers who wrote the server logic (really it was 3 Tier, so business logic and database logic). Most of the stuff was written in Java, and these guys had licensed WebLogic application servers, and were using EJBs (which were a hot technology at the time). They even used Entity Beans, which practically nobody used (most shops had gone with a stateless architecture and used Session Beans). WebLogic was cool, admittedly, and introduced me to Connection Pools for databases, Threads, etc. These concepts are widely used today.

As cool as WebLogic was, it was not my first exposure to an Application Server. I was the first one in a large Telco company to bring in an Application Server (I worked in an Innovation Center back then). I contracted in for a Time Keeping system from a Silicon Valley company (Tock I think it was called), which we customized, and this system was architected for Netscape Application Server - which I believe was indeed the first "application server". Feel free to comment if anyone has a correction to this belief.

Anyway, I digress. But these technologies, if they didn't outright invent it, they lent themselves well to the concept of 3 Tier application design as we know it today. Which allowed your web front ends to be developed in a compartmentalized fashion.

I never really worked much on GUIs and Web Front Ends. I did a little bit "here and there". Websphere pages (later on). The Java Spring framework. I created a website for a small consulting company we owned back at one point. But aside of brief stints with user interfaces, the focus was never really web front-ends.

I am realizing that I need to "bone up" a bit. My next post will explain why.

Monday, October 2, 2023

Service Now Integration using pysnow API client for Python

My latest technology initiative, has been doing some first-hand integration to Service Now, using the Service Now API. The first thing I did, was to load the API calls into PostMan. Once I tested the OAuth 2.0 authentication, and made a couple of test calls, I was ready to proceed with Python.

I searched for a Service Now Python Client, and sure enough, there is one. It is called "pysnow" and it can be installed with the Python pip utility:

>pip install pysnow

Once installed, you can interact with Service Now in a very straightforward manner - although there are some client-specific things one should learn from reading the documentation. Authentication uses OAuth 2.0, and the token re-generation is done as part of the API client, which is convenient.

Once you have authenticated, you generally bind to a resource first (i.e. a Table), and once you have bound to it, you can then make a call against that resource (i.e. a query). Data from calls can be accessed using helper functions such as first() or first_or_none().

Here is a snippet (from their documentation) on how the client is used:

import pysnow

# Create client object
c = pysnow.Client(instance='myinstance', user='myusername', password='mypassword')

# Define a resource, here we'll use the incident table API
incident = c.resource(api_path='/table/incident')

# Query for incidents with state 3
response = incident.get(query={'state': 3})

# Print out the first match, or `None`
print(response.first_or_none())

Service Now, in my mind, is just a huge relational database full of tables. And the API calls are allowing you to retrieve from these tables (GET calls), update these tables (PUT calls), or delete from these tables (DELETE calls).

You can pass queries as arguments on the GET calls, and the queries are very similar to those you might use with SQL, supporting things such as wildcard with LIKE clauses, etc.

There was one case, where I had to abandon the pysnow client, and use Python Requests. It was a case where one of the API calls required a PATCH call. I had never actually even heard of a PATCH call before encountering this, but it's a valid call - just one that is a bit more rare to encounter and up to now, I had not seen it. The pysnow API did not support a PATCH request, interestingly enough, and after figuring this out, I had to (re) write the API client calls using Python Requests for the PATCH API call.

Aside of this, the only other surprise I had, was the number of fields I was getting back on many of these calls. Some of these records were incredibly large.

Friday, August 18, 2023

Recovering a Corrupted NSX-T Manager

If your NSX-T Manager cluster is running as a cluster of VMs, if one is corrupted, there is a good chance they all are if the issue was related to storage connectivity. Or, maybe it is just one. If you are running a cluster, and don't have a backup to restore, these steps can be used to repair the file system. Mileage varies on repairing file systems, so there is no guarantee this will work, but this is the process to attempt nontheless.

1. Connect to the console of the appliance.

2. Reboot the system.

3. When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. If you wait too long and the boot sequence does not pause, you must reboot the system again. Press e to edit the menu.

4. Keep the cursor on the Ubuntu selection.

5. Press e to edit the selected option.

6. Enter the user name ( root) and the GRUB password for root (not the same as the appliance's user root).Password "VMware1" before release 3.2 and "NSX@VM!WaR10" 3.2 and beyond.

7. Search for the line starting with "linux" having boot command.

8. Remove all options after root= (Starting from UUID) and add "rw single init=/bin/bash".

9. Press Ctrl-X to boot.

10. When the log messages stop, press Enter. You will see the prompt root@(none):/#.

11. Run following commands to repair the file system.

e2fsck -y /dev/sda1
e2fsck -y /dev/sda2
e2fsck -y /dev/sda3
e2fsck -y /dev/mapper/nsx-config
e2fsck -y /dev/mapper/nsx-image
e2fsck -y /dev/mapper/nsx-var+log
e2fsck -y /dev/mapper/nsx-repository
e2fsck -y /dev/mapper/nsx-secondary

The Linux XFS File System - How Resilient Is It?

We are using VMWare Datastores, using NFS version 3.x. The storage was routed, which is never a good thing to do because let's face it, if your VMs all lose their storage simultaneously, that constitutes a disaster. Having dependencies on a router, which can lose its routing prefixes due to a maintenance or configuration problem, is architecturally deficient (polite way of putting it). To solve this, you need to make sure that you don't have routing hops (storage on same segment as storage interface on hypervisor).

So, after our storage routers went AWOL due to a maintenance event, I noticed some VMs came back and appeared to be fine. They had rebooted and were at a login prompt. Other VMs, however, did not come back, and had some nasty things printing on the console (you could not log into these VMs).

What we noticed, was that any Linux virtual machine running with XFS file system type on boot or root (/boot or /) had this issue of being unrecoverable. VMs that were using ext3 or ext4 seemed to be able to recover and start running their services - although some were still echoing some messages to the console.

There is a lesson here. That the file system matters when it comes to resiliency in a virtualized environment.

I did some searching around for discussions on file system types, and of course there are many. This one in particular, I found interesting: ext4-vs-xfs-vs-btrfs-vs-zfs-for-nas

Wednesday, August 9, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Q Learning

The Crash Course in AI presents a fun project for the purpose of developing a familiarization with the principles of Q Learning.

It presents the situation of Robots in a Maze (Warehouse), with the idea that the AI will learn the optimal path through a maze.

To do this, the following procedure is followed:

Define a Location-to-State Mapping where each location (alphabetic; A, B, C, etc) is correlated to an Integer value (A=0,B=1,C=2,D=3, et al).
Define the Actions (each location is an action, represented by its integer value) which is represented as a array of integers.
Define the Rewards - here, each square in the maze, has certain squares it is adjacent to that constitute a "move".

The set of Reward Arrays, is an "array of arrays", and we know that an "array of arrays" is

essentially a matrix! Hence, we can refer to this large "rule set" as a Rewards Matrix.

# Defining the rewards

R = np.array([

[0,1,0,0,0,0,0,0,0,0,0,0], --> A's only valid move is to B

[1,0,1,0,0,1,0,0,0,0,0,0], --> B's only valid move is to A, C, F

[0,1,0,0,0,0,1,0,0,0,0,0], --> C's only valid move is to B, G

[0,0,0,0,0,0,0,1,0,0,0,0], --> D's only valid move is H

[0,0,0,0,0,0,0,0,1,0,0,0], --> E's only valid move is to I

[0,1,0,0,0,0,0,0,0,1,0,0], --> F's only valid move is to B, J

[0,0,1,0,0,0,1,1,0,0,0,0], --> G's only valid move is to C, G, H

[0,0,0,1,0,0,1,0,0,0,0,1], --> H's only valid move is to D, G, L

[0,0,0,0,1,0,0,0,0,1,0,0], --> I's only valid move is to E, J

[0,0,0,0,0,1,0,0,1,0,1,0], --> J's only valid move is to F, I, K

[0,0,0,0,0,0,0,0,0,1,0,1], --> K's only valid move is to J, L

[0,0,0,0,0,0,0,1,0,0,1,0] --> L's only valid move is to H, K

])

So this array, these "ones and zeroes" govern the "rules of the road" in terms of the maze. In fact, you could draw the maze out graphically based on these rules.

Now - from a simple perspective, armed with this information, you can feed a starting and ending location into the "Engine", and it will compute the optimal path for you. In cases where there are two optimal paths, it may give you one or the other.

But how does it do this? How does it "know"?

This gets into two key concepts, that comprise and feed an equation, known as the Bellman Equation.

Temporal Difference - how well (or how fast) the AI (model) is learning
Q-Value - this is an indicator of which choices led to greater rewards

If we consider that models like this might have thousands or even millions of X/Y coordinate data points (remember, it is a geographic warehouse model), it is not scalable for the AI to store all of the permutations of these as it works through the model. What this Bellman Equation does, is allow for a Calculus-like coefficient to be used such that we know if we hit coordinate X,Y, what the optimal steps were to reach X,Y.

Basically, as we traverse the maze, before we start, all Q values are (initialized to) zero. As we traverse the maze, the model calculates the Temporal Difference, and if it is high then the model flags it as a Reward, while if it is low, it is flagged as a "frustration". High values early on, are "pleasant surprises" to the model. So - in summary, as the maze is traversed, the TD is calculated, followed by a Q value adjustment (Q Value for the state/action combination to be precise).

Now...before I forget to mention this, the Rewards Matrix needs to be adjusted to reflect the ideal ending location. For example, if the maze was to begin at point E, and and at point G, the X/Y axis (starting location, ending location) of G would need to have a huge value that would tell the AI to stop there and go no further. You can see this in the coding example of the book:

# Optimize the ending state with the ultimate reward
R_new[ending_state, ending_state] = 1000

I have to admit - I started coding, before I fully read and digested what was in the book. I got tripped up by two variables in the code: Alpha, Gamma. Alpha was coded as .9, while Gamma was coded as .75. I was very confused by these constants; what they were, why they were used.

I had to go back into the book.

Alpha - Learning Rate
Gamma - Discount Factor

Hey, this AI stuff - these algorithms, they're all about Mathematics (as well as Statistics and Probability). I am not a Mathematician, and only took intermediate Calculus, so some of us really need to concentrate and put our thinking caps on if we truly want to follow the math behind these models and equations.

Artificial Intelligence Book 1 - Crash Course in AI - Thompson Sampling

I bought this book in Oct 2020. Maybe due to holidays and other distractions, I never picked the book up until 2021, at which point I decided it took mental energy, and set it back down.

Well, now that AI is the rave in 2023, I decided to pick this book up and push the education along.

I love that this book uses Python. It wants you to use some user interface called Colab, which I initially looked at but quickly abandoned in favor of the tried-and-true vi editor.

The book starts off with the Multi-Armed-Armed-Bandit "problem".

What is that? Well, the name stems from the One-Armed-Bandit; a slot-machine, which "steals" from the players of the machine.

The Multi-Armed-Bandit, I presume, turns this on its head as it represents a slot machine player that is playing a single machine with multiple N number of arms (or, perhaps a bank of N number of single-armed machines). By using a binary system of rewards (0/1) this problem feeds into a Reinforcement Learning example where the optimal sequence of handle pulls results in the maximum rewards.

This "use case" of the Multi-Armed-Bandit problem (slot machines or single slot machines with multiple arms), is solved by the use of Thompson Sampling.

Thompson Sampling (the term Sampling should give this away) is a Statistics-based approach that solves some interesting problems. For example, take the case of the Multi-Armed Bandit problem just described. Just because a slot machine has paid out the most money historically, it does not mean that the particular slot machine will continue to be the best choice for the future gambler (or future pulls of the arms on the slot machine/s). Thompson Sampling, through continual redistribution and training, accommodates the idea of exploiting the results of the past, while exploring the results of the future.

The fascinating thing about Thompson Sampling, is that it was developed back in 1933, and largely ignored until more recently. The algorithm (or a rendition of it) has been applied in a number of areas recently, by a growing number of larger-sized companies, to solve interesting issues.

In this book, the problem that employs Thompson Sampling, is one in which a paid Subscription is offered, and the company needs to figure out how to optimize the revenue at the right price point.

Sources:

Weber, Richard (1992), "On the Gittins index for multiarmed bandits", Annals of Applied Probability

A Tutorial on Thompson Sampling Daniel J. Russo1 , Benjamin Van Roy2 , Abbas Kazerouni2 , Ian Osband3 and Zheng Wen4 1Columbia University 2Stanford University 3Google DeepMind 4Adobe Research

Friday, July 21, 2023

Do you need BGP with a VxLAN WAN?

I presumed that VxLAN technology would supercede/replace the need for E-VPNs.

This led me to do some additional research on E-VPN vs VxLAN, and what I am finding, is that there are some benefits to using both together.

This link from Cisco, discusses this:

VXLAN Network with MP-BGP EVPN Control Plane Design Guide

This post lists some specific benefits to using MP-BGP for the Control Plane of a VxLAN tunneled overlay network:

The MP-BGP EVPN protocol is based on industry standards, allowing multivendor interoperability.
It enables control-plane learning of end-host Layer-2 and Layer-3 reachability information, enabling organizations to build more robust and scalable VXLAN overlay networks.
It uses the decade-old MP-BGP VPN technology to support scalable multi-tenant VXLAN overlay networks.
The EVPN address family carries both Layer-2 and Layer-3 reachability information, thus providing integrated bridging and routing in VXLAN overlay networks.
It minimizes network flooding through protocol-based host MAC/IP route distribution and Address Resolution Protocol (ARP) suppression on the local VTEPs.
It provides optimal forwarding for east-west and north-south traffic and supports workload mobility with the distributed anycast function.
It provides VTEP peer discovery and authentication, mitigating the risk of rogue VTEPs in the VXLAN overlay network.
It provides mechanisms for building active-active multihoming at Layer-2.

Wednesday, June 28, 2023

VMWare Storage - Hardware Acceleration Status

Today, we had a customer call in and tell us that they couldn't do Thick provisioning from a vCenter template. We went into vCenter (the GUI), and sure enough, we could only provision Thin virtual machines from it.

But - apparently on another vCenter cluster, they COULD provision Thick virtual machines. There seemed to be no difference between the virtual machines. Note that we are using NFS and not Block or iSCSI or Fibre Channel storage.

We went into vCenter, and lo and behold, we saw this situation...

NOTE: To get to this screen in VMWare's very cumbersome GUI, you have to click on the individual datastore, then click "Configure" then a tab called "Hardware Acceleration" occurs.

So, what we have here, is one datastore that says "Not Supported" on a host, and another datastore in the same datastore cluster that says "Supported" on the exact same host. This sounds bad. This looks bad. Inconsistency. Looks like a problem.

So what IS hardware acceleration when it comes to Storage? To find this out, I located this KnowledgeBase:

Storage Hardware Acceleration

There is also a link for running storage HW acceleration on NAS devices:

Storage Hardware Acceleration on NAS Devices

When these two (above) links are referenced, there are (on left hand side) some additional links as well.

For each storage device and datastore, the vSphere Client display the hardware acceleration support status.

The status values are Unknown, Supported, and Not Supported. The initial value is Unknown.

For block devices, the status changes to Supported after the host successfully performs the offload operation. If the offload operation fails, the status changes to Not Supported. The status remains Unknown if the device provides partial hardware acceleration support.

With NAS, the status becomes Supported when the storage can perform at least one hardware offload operation.

When storage devices do not support or provide partial support for the host operations, your host reverts to its native methods to perform unsupported operations.

NFS = NAS, I am pretty darned sure.

So this is classic VMWare confusion. They are using a "Status" field, and using values of Supported / Non-Supported, when in fact Supported means "Working" and Non-Supported means "Not Working" based on (only) the last operation attempted.

So. Apparently, if a failure on this offload operation occurs, this flag gets turned to Non-Supported, and guess what? That means you cannot do *any* Thick Provisioning.

In contacting VMWare, they want us to re-load the storage plugin. Yikes. Stay Tuned....

VMWare also has some Best Practices for running iSCSI Storage, and the link to that is found at:

VMWare iSCSI Storage Best Practices

Wednesday, April 19, 2023

Colorizing Text in Linux

I went hunting today, for a package that I had used to colorize text. There are tons of those out there of course. But - what if you want to filter the text and colorize based on a set of rules?

There's probably a lot of stuff out there for that, too. Colord for example, runs as a daemon in Linux.

Another package, is grc, found at this GitHub site: https://github.com/garabik/grc

Use Case:

I had a log that was printing information related to exchanges with different servers. I decided to color these so that messages from Server A were green, Server B were blue, etc. In this way, I could do really cool things like suppress messages from Server B (no colorization). Or, I could take Control Plane messages from, say, Server C, and highlight those Yellow.

This came in very handy during a Demo, where people were watching the messages display in rapid succession on a large screen.

Monday, February 27, 2023

Hyperthreading vs Non-Hyperthreading on an ESXi Hypervisor

We started to notice that several VNF (Virtual Network Function) vendors were recommending to turn off (disable) Hyper-threading on hypervisors. But why? They claimed it helped their performance.

Throwing a switch and disabling this, means that the number of cores that are exposed to users, is cut in half. So a 24 core CPU, has 48 cores if Hyper-threading is enabled, and only has 24 cores if it is disabled.

This post isn't meant to go into the depths of Hyper-threading itself. The question we had, was whether disabling it or enabling it, affected performance, and to what degree.

We ran a benchmark that was comprised of three "layers".

Non-Hyperthreaded (24 cores) vs Hyperthreaded (48 cores)
Increasing vCPU of the Benchmark VM (increments of eight: 1,8,16,24)
Each test ran several Sysbench tests with increasing threads (1,2,4,8,16,32)

The servers we are running on, include: Cisco M5 (512G RAM, 24 vCPU)

We collected the results in Excel, and ran a Pivot Char graph on it, and this is what we found (below).

VM with 1,8,16,24 vCPU running Sysbench with increasing threads

on a Hyperthread-disabled system (24) vs Hyperthread-enabled system (48)

It appears to me, that Hyperthreading starts to look promising when two things happen:

vCPU resources on the VM increase past a threshold of about 8 vCPU.
an application is multi-threaded, and is launching 16 or more threads.

Notice that on an 8 vCPU virtual machine, the "magic number" is 8 threads. On a 16 vCPU virtual machine, you do not see hyperthreading become an advantage until 16 threads are launched. On a 24 vCPU system, we start to see hyperthreading become favorable at about 16 threads and higher.

BUT - if the threads are low, between 1 and about 8, the hyperthreading works against you.

Thursday, February 16, 2023

Morpheus API - pyMorpheus Python API Wrapper

I have been working on some API development in the Morpheus CMP tool.

The first thing I do when I need to use an API, is to see if there is a good API wrapper. I found this one API wrapper out on Github, called pyMorpheus.

With this wrapper, I was up and running in absolutely no time, making calls to the API, parsing JSON responses, etc.

The Use Case I am working on, is a "re-conciliator" that will do two things:

Remove Orphaned VMs

Find, and delete (upon user confirmation) those VMs that have had their "rug pulled out" from Morpheus (deleted in vCenter but still sitting in Morpheus as an Instance)

Convert Certain Discovered VMs to Morpheus

This part sorta kinda worked. The call to https://<applianceurl>/servers/id/make-managed did take a Discovered VM and converted it to an instance, with a "VMWare" logo on it.

But I was unable to set advanced attributes of the VMs - Instance Type, Layout, Plan, etc. and this made it only a partial success.

Maybe if we can get the API fixed up a bit, we can get this to work.

One issue, is the "Cloud Sync". When we call the API, we do a cloud sync, to find Discovered VMs. We do the same cloud sync, to determine whether any of the VM's fields in Morpheus change their state, if someone deletes a VM in vCenter (such a state change gives us the indicator that the VM is, in fact, now an orphan). The Cloud Sync is an asynchronous call. You have to wait for an indefinite amount of time, to ensure that the results you are looking for in vCenter, are reflected in Morpheus. It's basically polling, which is not an exact art. For this reason, the reconciliator tool needs to be run as an operations tool, manually, as opposed to some kind of batch scheduled job.

Tuesday, January 17, 2023

Trying to get RSS (Receive Side Scaling) to work on an Intel X710 NIC

Cisco M5 server, with 6 nics on it. The first two are 1G nics that are unused.

The last 4, are:

vmnic2 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic3 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic4 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic5 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

Worth mentioning:

vmnic 2 and 4 are uplinks, using a standard Distributed Switch (virtual switch) for those uplinks.
vmnic 3 and 5 are connected to an N-VDS virtual switch (used with NSX-T) and don't have uplinks.

In ESXi (VMWare Hypervisor, v7.0), we have set the RSS values accordingly:

UPDATED: how we set the RSS Values!

first, make sure that RSS parameters are unset. Because DRSS and RSS should not be set together.

> esxcli system module parameters set -m -i40en -p RSS=""

next, make sure that DRSS parameters are set. We are setting to 4 Rx queues per relevant vmnic.

esxcli system module parameters set -m -i40en -p DRSS=4,4,4,4

now we list the parameters to ensure they took correctly

> esxcli system module parameters list -m i40en
Name           Type          Value    Description
-------------  ------------  -------  -----------
DRSS           array of int           Enable/disable the DefQueue RSS(default = 0 )
EEE            array of int           Energy Efficient Ethernet feature (EEE): 0 = disable, 1 = enable, (default = 1)
LLDP           array of int           Link Layer Discovery Protocol (LLDP) agent: 0 = disable, 1 = enable, (default = 1)
RSS            array of int  4,4,4,4  Enable/disable the NetQueue RSS( default = 1 )
RxITR          int                    Default RX interrupt interval (0..0xFFF), in microseconds (default = 50)
TxITR          int                    Default TX interrupt interval (0..0xFFF), in microseconds, (default = 100)
VMDQ           array of int           Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable (default =8)
max_vfs        array of int           Maximum number of VFs to be enabled (0..128)
trust_all_vfs  array of int           Always set all VFs to trusted mode 0 = disable (default), other = enable

But, we are seeing this when we look at the individual adaptors in the ESXi kernel:

> vsish -e get /net/pNics/vmnic3/rxqueues/info
rx queues info {
   # queues supported:1
   # rss engines supported:0
   # filters supported:0
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0 -> No matching defined enum value found.
   Rx Queue features: 0 -> NONE
}

Nics 3 and 5, connected to the N-VDS virtual switch, only get one single Rx Queue supported, even though the kernel module is configured properly.

> vsish -e get /net/pNics/vmnic2/rxqueues/info
rx queues info {
   # queues supported:9
   # rss engines supported:1
   # filters supported:512
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0x1f -> MAC VLAN VLAN_MAC VXLAN Geneve
   Rx Queue features: 0x482 -> Pair Dynamic GenericRSS

But Nics 2 and 4, which are connected to the standard distributed switch, have 9 Rx Queues configured properly.

Is this related to the virtual switch we are connecting to (meaning we need to be looking at VMWare)? Or, is this somehow related to the i40en driver that is being used (in which case we need to be going to server vendor or Intel who makes the XL710 nic)?

Friday, January 13, 2023

Debugging Dropped Packets on NSX-T E-NVDS

Inside the hypervisor, we have the following nics:

The servers have physical nics as follows:

~~vmnic0 – 1G nic, Intel X550 – Unused~~
~~vmnic1 – 1G nic, Intel X550 - Unused~~
vmnic2 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic3 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic4 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic5 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

The nics connect to the upstream switches (Aristas), and they connect virtually to the virtual switches (discussed right below):

Inside Hypervisor (Host 5 in this specific case):

Distributed vSwitch

Physical Nic Side: vmnic2 and vmnic4

Virtual Side: vmk0 (VLAN 3850) and vmk1 (VLAN 3853)

NSX-T Switch (E-NVDS)

Physical NIC side: vmnic3 and vmnic5 à this is the nic that gets hit when we run the load tests

Virtual Side: 50+ individual segments that VMs connect to, and get assigned a port

Now, in my previous email, I dumped the stats for the physical NIC – meaning, from the “NIC Itself” from the ESXi OS operating system.

But, it is wise also, to take a look at the stats of the physical nic from the perspective of the virtual switch! Remember, vmnic5 is a port on the virtual switch!

So first, we need to figure out what port we need to look at:
net-stats -l

PortNum Type SubType SwitchName MACAddress ClientName

2214592527 4 0 DvsPortset-0 40:a6:b7:51:56:e9 vmnic3

2214592529 4 0 DvsPortset-0 40:a6:b7:51:1b:9d vmnic5 à here we go, port 2214592529 on switch DvsPortset-0 is the port of interest

67108885 3 0 DvsPortset-0 00:50:56:65:96:e4 vmk10

67108886 3 0 DvsPortset-0 00:50:56:65:80:84 vmk11

67108887 3 0 DvsPortset-0 00:50:56:66:58:98 vmk50

67108888 0 0 DvsPortset-0 02:50:56:56:44:52 vdr-vdrPort

67108889 5 9 DvsPortset-0 00:50:56:8a:09:15 DEV-ISC1-Vanilla3a.eth0

67108890 5 9 DvsPortset-0 00:50:56:8a:aa:3f DEV-ISC1-Vanilla3a.eth1

67108891 5 9 DvsPortset-0 00:50:56:8a:9d:b1 DEV-ISC1-Vanilla3a.eth2

67108892 5 9 DvsPortset-0 00:50:56:8a:d9:65 DEV-ISC1-Vanilla3a.eth3

67108893 5 9 DvsPortset-0 00:50:56:8a:fc:75 DEV-ISC1-Vanilla3b.eth0

67108894 5 9 DvsPortset-0 00:50:56:8a:7d:cd DEV-ISC1-Vanilla3b.eth1

67108895 5 9 DvsPortset-0 00:50:56:8a:d4:d8 DEV-ISC1-Vanilla3b.eth2

67108896 5 9 DvsPortset-0 00:50:56:8a:67:6f DEV-ISC1-Vanilla3b.eth3

67108901 5 9 DvsPortset-0 00:50:56:8a:32:1c DEV-MSC1-Vanilla3b.eth0

67108902 5 9 DvsPortset-0 00:50:56:8a:e6:2b DEV-MSC1-Vanilla3b.eth1

67108903 5 9 DvsPortset-0 00:50:56:8a:cc:eb DEV-MSC1-Vanilla3b.eth2

67108904 5 9 DvsPortset-0 00:50:56:8a:7a:83 DEV-MSC1-Vanilla3b.eth3

67108905 5 9 DvsPortset-0 00:50:56:8a:63:55 DEV-MSC1-Vanilla3a.eth3

67108906 5 9 DvsPortset-0 00:50:56:8a:40:9c DEV-MSC1-Vanilla3a.eth2

67108907 5 9 DvsPortset-0 00:50:56:8a:57:8f DEV-MSC1-Vanilla3a.eth1

67108908 5 9 DvsPortset-0 00:50:56:8a:5b:6d DEV-MSC1-Vanilla3a.eth0

/net/portsets/DvsPortset-0/ports/2214592529/> cat stats

packet stats {

pktsTx:10109633317

pktsTxMulticast:291909

pktsTxBroadcast:244088

pktsRx:10547989949 à total packets RECEIVED on vmnic5’s port on the virtual switch

pktsRxMulticast:243731083

pktsRxBroadcast:141910804

droppedTx:228

droppedRx:439933 à This is a lot more than the 3,717 Rx Missed errors, and probably accounts for why MetaSwitch sees more drops than we saw up to this point!

}

So – we have TWO things now to examine here.

Is the Receive Side Scaling configured properly and working?

We configured it, but…we need to make sure it is working and working properly.
We don’t see all of the queues getting packets. Each Rx Queue should be getting its own CPU.

Once packets get into the Ring Buffer and passed through to the VM (poll mode driver picks the packets up off the Ring), they hit the virtual switch.

And the switch is dropping some packets.
Virtual switches are software. As such, they need to be tuned to stretch their capability to keep up with what legacy hardware switches can do.

The NSX-T switch is a powerful switch, but is also a newer virtual switch, more bleeding edge in terms of technology.
I wonder if we are running the latest greatest version of this switch, and if that could help us here.

Now, I looked even deeper into the E-NVDS switch. I went into vsish shell, and started examining any and all statistics that are captured by that networking stack.

Since we are concerned with receives, I looked at the InputStats specifically. I noticed there are several filters – which, I presume is tied to a VMWare Packet Filtering flow, analogous to Netfilter in Linux, or perhaps Berkeley Packet Filter. But, I have no documentation whatsoever on this, and can’t find any, so I did my best to “back into” what I was seeing.

I see the following filters that packets can traverse – traceflow might be packet capture but not sure aside of that.

· ens-slowpath-input

·         traceflow-Uplink-Input:0x43110ae01630

·         vdl2-uplink-in:0x431e78801dd0

·         UplinkDoSwLRO@vmkernel#nover

·         VdrUplinkInput

If we go down into the filters and print the stats out, most of the stats seem to line up (started=passed, etc) except this one, which has drops in it:


/net/portsets/DvsPortset-0/ports/2214592529/inputFilters/vdl2-uplink-in/> cat stats

packet stats {

   pktsIn:31879020

   pktsOut:24269629

   pktsDropped:7609391


/net/portsets/DvsPortset-0/ports/2214592527/inputFilters/vdl2-uplink-in/> cat stats

packet stats {

   pktsIn:24817038

   pktsOut:17952829

   pktsDropped:6864209

That seems like a lot of dropped packets to me (a LOT more than those Rx Missed errors), so this looks like something we need to work with VMWare on because if I understand these stats properly, this suggests an issue on the virtual switch more than the adaptor itself.

Another thing I saw, poking around, was this interesting looking WRONG_VNIC on passthrough status on vmnic3 and vmnic5, the two nics being used in the test here. I think we should maybe ask VMWare about this and run this down also.

/net/portsets/DvsPortset-0/ports/2214592527/> cat status

port {

   port index:15

   vnic index:0xffffffff

   portCfg:

   dvPortId:4dfdff37-e435-4ba4-bbff-56f36bcc0779

   clientName:vmnic3

   clientType: 4 -> Physical NIC

   clientSubType: 0 -> NONE

   world leader:0

   flags: 0x460a3 -> IN_USE ENABLED UPLINK DVS_PORT DISPATCH_STATS_IN DISPATCH_STATS_OUT DISPATCH_STATS CONNECTED

   Impl customized blocked flags:0x00000000

   Passthru status: 0x1 -> WRONG_VNIC

   fixed Hw Id:40:a6:b7:51:56:e9:

   ethFRP:frame routing {

      requested:filter {

         flags:0x00000000

         unicastAddr:00:00:00:00:00:00:

         numMulticastAddresses:0

         multicastAddresses:

         LADRF:[0]: 0x0

         [1]: 0x0

      accepted:filter {

         flags:0x00000000

         unicastAddr:00:00:00:00:00:00:

         numMulticastAddresses:0

         multicastAddresses:

         LADRF:[0]: 0x0

         [1]: 0x0

   filter supported features: 0 -> NONE

   filter properties: 0 -> NONE

   rx mode: 0 -> INLINE

   tune mode: 2 -> invalid

   fastpath switch ID:0x00000000

   fastpath port ID:0x00000004

Grasping Technology