Monday, October 2, 2023

Service Now Integration using pysnow API client for Python

My latest technology initiative, has been doing some first-hand integration to Service Now, using the Service Now API.  The first thing I did, was to load the API calls into PostMan. Once I tested the OAuth 2.0 authentication, and made a couple of test calls, I was ready to proceed with Python.

I searched for a Service Now Python Client, and sure enough, there is one. It is called "pysnow" and it can be installed with the Python pip utility:

>pip install pysnow

Once installed, you can interact with Service Now in a very straightforward manner - although there are some client-specific things one should learn from reading the documentation. Authentication uses OAuth 2.0, and the token re-generation is done as part of the API client, which is convenient. 

Once you have authenticated, you generally bind to a resource first (i.e. a Table), and once you have bound to it, you can then make a call against that resource (i.e. a query).  Data from calls can be accessed using helper functions such as first() or first_or_none().

Here is a snippet (from their documentation) on how the client is used:

import pysnow 
# Create client object
c = pysnow.Client(instance='myinstance', user='myusername', password='mypassword')

# Define a resource, here we'll use the incident table API
incident = c.resource(api_path='/table/incident')

# Query for incidents with state 3
response = incident.get(query={'state': 3})

# Print out the first match, or `None`
print(response.first_or_none())

Service Now, in my mind, is just a huge relational database full of tables. And the API calls are allowing you to retrieve from these tables (GET calls), update these tables (PUT calls), or delete from these tables (DELETE calls). 

You can pass queries as arguments on the GET calls, and the queries are very similar to those you might use with SQL, supporting things such as wildcard with LIKE clauses, etc.

There was one case, where I had to abandon the pysnow client, and use Python Requests. It was a case where one of the API calls required a PATCH call. I had never actually even heard of a PATCH call before encountering this, but it's a valid call - just one that is a bit more rare to encounter and up to now, I had not seen it. The pysnow API did not support a PATCH request, interestingly enough, and after figuring this out, I had to (re) write the API client calls using Python Requests for the PATCH API call.

Aside of this, the only other surprise I had, was the number of fields I was getting back on many of these calls. Some of these records were incredibly large.

Friday, August 18, 2023

Recovering a Corrupted NSX-T Manager

If your NSX-T Manager cluster is running as a cluster of VMs, if one is corrupted, there is a good chance they all are if the issue was related to storage connectivity. Or, maybe it is just one. If you are running a cluster, and don't have a backup to restore, these steps can be used to repair the file system. Mileage varies on repairing file systems, so there is no guarantee this will work, but this is the process to attempt nontheless.

1.  Connect to the console of the appliance.

2.  Reboot the system.

3.  When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. If you wait too long and the boot sequence does not pause, you must reboot the system again. Press e to edit the menu.

4.  Keep the cursor on the Ubuntu selection.

5.  Press e to edit the selected option.

6.  Enter the user name ( root) and the GRUB password for root (not the same as the appliance's user root).Password "VMware1" before release 3.2 and "NSX@VM!WaR10" 3.2 and beyond.

7.  Search for the line starting with "linux" having boot command.

8. Remove all options after root= (Starting from UUID) and add "rw single init=/bin/bash".

9.  Press Ctrl-X to boot.

10.  When the log messages stop, press Enter. You will see the prompt root@(none):/#.

11.  Run following commands to repair the file system.

  • e2fsck -y /dev/sda1
  • e2fsck -y /dev/sda2
  • e2fsck -y /dev/sda3
  • e2fsck -y /dev/mapper/nsx-config
  • e2fsck -y /dev/mapper/nsx-image
  • e2fsck -y /dev/mapper/nsx-var+log
  • e2fsck -y /dev/mapper/nsx-repository
  • e2fsck -y /dev/mapper/nsx-secondary

The Linux XFS File System - How Resilient Is It?

We are using VMWare Datastores, using NFS version 3.x.  The storage was routed, which is never a good thing to do because let's face it, if your VMs all lose their storage simultaneously, that constitutes a disaster. Having dependencies on a router, which can lose its routing prefixes due to a maintenance or configuration problem, is architecturally deficient (polite way of putting it). To solve this, you need to make sure that you don't have routing hops (storage on same segment as storage interface on hypervisor).

So, after our storage routers went AWOL due to a maintenance event, I noticed some VMs came back and appeared to be fine. They had rebooted and were at a login prompt.  Other VMs, however, did not come back, and had some nasty things printing on the console (you could not log into these VMs).


What we noticed, was that any Linux virtual machine running with XFS file system type on boot or root (/boot or /) had this issue of being unrecoverable.  VMs that were using ext3 or ext4 seemed to be able to recover and start running their services - although some were still echoing some messages to the console.

There is a lesson here. That the file system matters when it comes to resiliency in a virtualized environment.

I did some searching around for discussions on file system types, and of course there are many. This one in particular, I found interesting:  ext4-vs-xfs-vs-btrfs-vs-zfs-for-nas


Wednesday, August 9, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Q Learning

The Crash Course in AI presents a fun project for the purpose of developing a familiarization with the principles of Q Learning.  

It presents the situation of Robots in a Maze (Warehouse), with the idea that the AI will learn the optimal path through a maze.

To do this, the following procedure is followed:

  1. Define a Location-to-State Mapping where each location (alphabetic; A, B, C, etc) is correlated to an Integer value (A=0,B=1,C=2,D=3, et al).
  2. Define the Actions (each location is an action, represented by its integer value) which is represented as a array of integers.
  3. Define the Rewards - here, each square in the maze, has certain squares it is adjacent to that constitute a "move". 
     The set of Reward Arrays, is an "array of arrays", and we know that an "array of arrays" is 
     essentially a matrix! Hence, we can refer to this large "rule set" as a Rewards Matrix.  
        
     # Defining the rewards
     R = np.array([
                            [0,1,0,0,0,0,0,0,0,0,0,0], --> A's only valid move is to B
                            [1,0,1,0,0,1,0,0,0,0,0,0], --> B's only valid move is to A, C, F
                            [0,1,0,0,0,0,1,0,0,0,0,0], --> C's only valid move is to B, G
                            [0,0,0,0,0,0,0,1,0,0,0,0], --> D's only valid move is H
                            [0,0,0,0,0,0,0,0,1,0,0,0], --> E's only valid move is to I
                            [0,1,0,0,0,0,0,0,0,1,0,0], --> F's only valid move is to B, J
                            [0,0,1,0,0,0,1,1,0,0,0,0], --> G's only valid move is to C, G, H
                            [0,0,0,1,0,0,1,0,0,0,0,1], --> H's only valid move is to D, G, L
                            [0,0,0,0,1,0,0,0,0,1,0,0], --> I's only valid move is to E, J
                            [0,0,0,0,0,1,0,0,1,0,1,0], --> J's only valid move is to F, I, K
                            [0,0,0,0,0,0,0,0,0,1,0,1], --> K's only valid move is to J, L
                            [0,0,0,0,0,0,0,1,0,0,1,0] --> L's only valid move is to H, K
                         ])

So this array, these "ones and zeroes" govern the "rules of the road" in terms of the maze. In fact, you could draw the maze out graphically based on these rules.

Now - from a simple perspective, armed with this information, you can feed a starting and ending location into the "Engine", and it will compute the optimal path for you. In cases where there are two optimal paths, it may give you one or the other.

But how does it do this? How does it "know"?

This gets into two key concepts, that comprise and feed an equation, known as the Bellman Equation.
  • Temporal Difference - how well (or how fast) the AI (model) is learning
  • Q-Value - this is an indicator of which choices led to greater rewards
If we consider that models like this might have thousands or even millions of X/Y coordinate data points (remember, it is a geographic warehouse model), it is not scalable for the AI to store all of the permutations of these as it works through the model. What this Bellman Equation does, is allow for a Calculus-like coefficient to be used such that we know if we hit coordinate X,Y, what the optimal steps were to reach X,Y.

Basically, as we traverse the maze, before we start, all Q values are (initialized to) zero. As we traverse the maze, the model calculates the Temporal Difference, and if it is high then the model flags it as a Reward, while if it is low, it is flagged as a "frustration". High values early on, are "pleasant surprises" to the model. So - in summary, as the maze is traversed, the TD is calculated, followed by a Q value adjustment (Q Value for the state/action combination to be precise).

Now...before I forget to mention this, the Rewards Matrix needs to be adjusted to reflect the ideal ending location.  For example, if the maze was to begin at point E, and and at point G, the X/Y axis (starting location, ending location) of G would need to have a huge value that would tell the AI to stop there and go no further. You can see this in the coding example of the book:

# Optimize the ending state with the ultimate reward
R_new[ending_state, ending_state] = 1000

I have to admit - I started coding, before I fully read and digested what was in the book. I got tripped up by two variables in the code: Alpha, Gamma. Alpha was coded as .9, while Gamma was coded as .75. I was very confused by these constants; what they were, why they were used. 

I had to go back into the book.
  • Alpha - Learning Rate
  • Gamma - Discount Factor

Hey, this AI stuff - these algorithms, they're all about Mathematics (as well as Statistics and Probability). I am not a Mathematician, and only took intermediate Calculus, so some of us really need to concentrate and put our thinking caps on if we truly want to follow the math behind these models and equations.


Artificial Intelligence Book 1 - Crash Course in AI - Thompson Sampling

I bought this book in Oct 2020. Maybe due to holidays and other distractions, I never picked the book up until 2021, at which point I decided it took mental energy, and set it back down.

Well, now that AI is the rave in 2023, I decided to pick this book up and push the education along.

I love that this book uses Python. It wants you to use some user interface called Colab, which I initially looked at but quickly abandoned in favor of the tried-and-true vi editor.

The book starts off with the Multi-Armed-Armed-Bandit "problem". 

What is that? Well, the name stems from the One-Armed-Bandit; a slot-machine, which "steals" from the players of the machine.  

The Multi-Armed-Bandit, I presume, turns this on its head as it represents a slot machine player that is playing a single machine with multiple N number of arms (or, perhaps a bank of N number of single-armed machines). By using a binary system of rewards (0/1) this problem feeds into a Reinforcement Learning example where the optimal sequence of handle pulls results in the maximum rewards. 

This "use case" of the Multi-Armed-Bandit problem (slot machines or single slot machines with multiple arms), is solved by the use of Thompson Sampling. 

Thompson Sampling (the term Sampling should give this away) is a Statistics-based approach that solves some interesting problems. For example, take the case of the Multi-Armed Bandit problem just described. Just because a slot machine has paid out the most money historically, it does not mean that the particular slot machine will continue to be the best choice for the future gambler (or future pulls of the arms on the slot machine/s).  Thompson Sampling, through continual redistribution and training, accommodates the idea of exploiting the results of the past, while exploring the results of the future.  

The fascinating thing about Thompson Sampling, is that it was developed back in 1933, and largely ignored until more recently. The algorithm (or a rendition of it) has been applied in a number of areas recently, by a growing number of larger-sized companies, to solve interesting issues.

In this book, the problem that employs Thompson Sampling, is one in which a paid Subscription is offered, and the company needs to figure out how to optimize the revenue at the right price point.

Sources: 

Weber, Richard (1992), "On the Gittins index for multiarmed bandits", Annals of Applied Probability

A Tutorial on Thompson Sampling Daniel J. Russo1 , Benjamin Van Roy2 , Abbas Kazerouni2 , Ian Osband3 and Zheng Wen4 1Columbia University 2Stanford University 3Google DeepMind 4Adobe Research

Friday, July 21, 2023

Do you need BGP with a VxLAN WAN?

I presumed that VxLAN technology would supercede/replace the need for E-VPNs.

This led me to do some additional research on E-VPN vs VxLAN, and what I am finding, is that there are some benefits to using both together.

This link from Cisco, discusses this:

VXLAN Network with MP-BGP EVPN Control Plane Design Guide

This post lists some specific benefits to using MP-BGP for the Control Plane of a VxLAN tunneled overlay network:

  1. The MP-BGP EVPN protocol is based on industry standards, allowing multivendor interoperability.
  2.  It enables control-plane learning of end-host Layer-2 and Layer-3 reachability information, enabling organizations to build more robust and scalable VXLAN overlay networks.
  3. It uses the decade-old MP-BGP VPN technology to support scalable multi-tenant VXLAN overlay networks.
  4. The EVPN address family carries both Layer-2 and Layer-3 reachability information, thus providing integrated bridging and routing in VXLAN overlay networks.
  5. It minimizes network flooding through protocol-based host MAC/IP route distribution and Address Resolution Protocol (ARP) suppression on the local VTEPs.
  6. It provides optimal forwarding for east-west and north-south traffic and supports workload mobility with the distributed anycast function.
  7. It provides VTEP peer discovery and authentication, mitigating the risk of rogue VTEPs in the VXLAN overlay network.
  8. It provides mechanisms for building active-active multihoming at Layer-2.

Wednesday, June 28, 2023

VMWare Storage - Hardware Acceleration Status

Today, we had a customer call in and tell us that they couldn't do Thick provisioning from a vCenter template. We went into vCenter (the GUI), and sure enough, we could only provision Thin virtual machines from it.

But - apparently on another vCenter cluster, they COULD provision Thick virtual machines. There seemed to be no difference between the virtual machines. Note that we are using NFS and not Block or iSCSI or Fibre Channel storage.

We went into vCenter, and lo and behold, we saw this situation...


NOTE: To get to this screen in VMWare's very cumbersome GUI, you have to click on the individual datastore, then click "Configure" then a tab called "Hardware Acceleration" occurs.

So, what we have here, is one datastore that says "Not Supported" on a host, and another datastore in the same datastore cluster that says "Supported" on the exact same host. This sounds bad. This looks bad. Inconsistency. Looks like a problem.

So what IS hardware acceleration when it comes to Storage? To find this out, I located this KnowledgeBase:

 Storage Hardware Acceleration

There is also a link for running storage HW acceleration on NAS devices:

Storage Hardware Acceleration on NAS Devices 

When these two (above) links are referenced, there are (on left hand side) some additional links as well.

For each storage device and datastore, the vSphere Client display the hardware acceleration support status.

The status values are Unknown, Supported, and Not Supported. The initial value is Unknown.

For block devices, the status changes to Supported after the host successfully performs the offload operation. If the offload operation fails, the status changes to Not Supported. The status remains Unknown if the device provides partial hardware acceleration support.

With NAS, the status becomes Supported when the storage can perform at least one hardware offload operation.

When storage devices do not support or provide partial support for the host operations, your host reverts to its native methods to perform unsupported operations.

NFS = NAS, I am pretty darned sure. 

So this is classic VMWare confusion. They are using a "Status" field, and using values of Supported / Non-Supported, when in fact Supported means "Working" and Non-Supported means "Not Working" based on (only) the last operation attempted.

So. Apparently, if a failure on this offload operation occurs, this flag gets turned to Non-Supported, and guess what? That means you cannot do *any* Thick Provisioning.

In contacting VMWare, they want us to re-load the storage plugin. Yikes. Stay Tuned....

VMWare also has some Best Practices for running iSCSI Storage, and the link to that is found at:

VMWare iSCSI Storage Best Practices

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...