Wednesday, August 9, 2023

Artificial Intelligence Book 1 - Crash Course in AI - Thompson Sampling

I bought this book in Oct 2020. Maybe due to holidays and other distractions, I never picked the book up until 2021, at which point I decided it took mental energy, and set it back down.

Well, now that AI is the rave in 2023, I decided to pick this book up and push the education along.

I love that this book uses Python. It wants you to use some user interface called Colab, which I initially looked at but quickly abandoned in favor of the tried-and-true vi editor.

The book starts off with the Multi-Armed-Armed-Bandit "problem". 

What is that? Well, the name stems from the One-Armed-Bandit; a slot-machine, which "steals" from the players of the machine.  

The Multi-Armed-Bandit, I presume, turns this on its head as it represents a slot machine player that is playing a single machine with multiple N number of arms (or, perhaps a bank of N number of single-armed machines). By using a binary system of rewards (0/1) this problem feeds into a Reinforcement Learning example where the optimal sequence of handle pulls results in the maximum rewards. 

This "use case" of the Multi-Armed-Bandit problem (slot machines or single slot machines with multiple arms), is solved by the use of Thompson Sampling. 

Thompson Sampling (the term Sampling should give this away) is a Statistics-based approach that solves some interesting problems. For example, take the case of the Multi-Armed Bandit problem just described. Just because a slot machine has paid out the most money historically, it does not mean that the particular slot machine will continue to be the best choice for the future gambler (or future pulls of the arms on the slot machine/s).  Thompson Sampling, through continual redistribution and training, accommodates the idea of exploiting the results of the past, while exploring the results of the future.  

The fascinating thing about Thompson Sampling, is that it was developed back in 1933, and largely ignored until more recently. The algorithm (or a rendition of it) has been applied in a number of areas recently, by a growing number of larger-sized companies, to solve interesting issues.

In this book, the problem that employs Thompson Sampling, is one in which a paid Subscription is offered, and the company needs to figure out how to optimize the revenue at the right price point.

Sources: 

Weber, Richard (1992), "On the Gittins index for multiarmed bandits", Annals of Applied Probability

A Tutorial on Thompson Sampling Daniel J. Russo1 , Benjamin Van Roy2 , Abbas Kazerouni2 , Ian Osband3 and Zheng Wen4 1Columbia University 2Stanford University 3Google DeepMind 4Adobe Research

Friday, July 21, 2023

Do you need BGP with a VxLAN WAN?

I presumed that VxLAN technology would supercede/replace the need for E-VPNs.

This led me to do some additional research on E-VPN vs VxLAN, and what I am finding, is that there are some benefits to using both together.

This link from Cisco, discusses this:

VXLAN Network with MP-BGP EVPN Control Plane Design Guide

This post lists some specific benefits to using MP-BGP for the Control Plane of a VxLAN tunneled overlay network:

  1. The MP-BGP EVPN protocol is based on industry standards, allowing multivendor interoperability.
  2.  It enables control-plane learning of end-host Layer-2 and Layer-3 reachability information, enabling organizations to build more robust and scalable VXLAN overlay networks.
  3. It uses the decade-old MP-BGP VPN technology to support scalable multi-tenant VXLAN overlay networks.
  4. The EVPN address family carries both Layer-2 and Layer-3 reachability information, thus providing integrated bridging and routing in VXLAN overlay networks.
  5. It minimizes network flooding through protocol-based host MAC/IP route distribution and Address Resolution Protocol (ARP) suppression on the local VTEPs.
  6. It provides optimal forwarding for east-west and north-south traffic and supports workload mobility with the distributed anycast function.
  7. It provides VTEP peer discovery and authentication, mitigating the risk of rogue VTEPs in the VXLAN overlay network.
  8. It provides mechanisms for building active-active multihoming at Layer-2.

Wednesday, June 28, 2023

VMWare Storage - Hardware Acceleration Status

Today, we had a customer call in and tell us that they couldn't do Thick provisioning from a vCenter template. We went into vCenter (the GUI), and sure enough, we could only provision Thin virtual machines from it.

But - apparently on another vCenter cluster, they COULD provision Thick virtual machines. There seemed to be no difference between the virtual machines. Note that we are using NFS and not Block or iSCSI or Fibre Channel storage.

We went into vCenter, and lo and behold, we saw this situation...


NOTE: To get to this screen in VMWare's very cumbersome GUI, you have to click on the individual datastore, then click "Configure" then a tab called "Hardware Acceleration" occurs.

So, what we have here, is one datastore that says "Not Supported" on a host, and another datastore in the same datastore cluster that says "Supported" on the exact same host. This sounds bad. This looks bad. Inconsistency. Looks like a problem.

So what IS hardware acceleration when it comes to Storage? To find this out, I located this KnowledgeBase:

 Storage Hardware Acceleration

There is also a link for running storage HW acceleration on NAS devices:

Storage Hardware Acceleration on NAS Devices 

When these two (above) links are referenced, there are (on left hand side) some additional links as well.

For each storage device and datastore, the vSphere Client display the hardware acceleration support status.

The status values are Unknown, Supported, and Not Supported. The initial value is Unknown.

For block devices, the status changes to Supported after the host successfully performs the offload operation. If the offload operation fails, the status changes to Not Supported. The status remains Unknown if the device provides partial hardware acceleration support.

With NAS, the status becomes Supported when the storage can perform at least one hardware offload operation.

When storage devices do not support or provide partial support for the host operations, your host reverts to its native methods to perform unsupported operations.

NFS = NAS, I am pretty darned sure. 

So this is classic VMWare confusion. They are using a "Status" field, and using values of Supported / Non-Supported, when in fact Supported means "Working" and Non-Supported means "Not Working" based on (only) the last operation attempted.

So. Apparently, if a failure on this offload operation occurs, this flag gets turned to Non-Supported, and guess what? That means you cannot do *any* Thick Provisioning.

In contacting VMWare, they want us to re-load the storage plugin. Yikes. Stay Tuned....

VMWare also has some Best Practices for running iSCSI Storage, and the link to that is found at:

VMWare iSCSI Storage Best Practices

Wednesday, April 19, 2023

Colorizing Text in Linux

 I went hunting today, for a package that I had used to colorize text. There are tons of those out there of course. But - what if you want to filter the text and colorize based on a set of rules?

There's probably a lot of stuff out there for that, too. Colord for example, runs as a daemon in Linux.

Another package, is grc, found at this GitHub site: https://github.com/garabik/grc

Use Case: 

I had a log that was printing information related to exchanges with different servers. I decided to color these so that messages from Server A were green, Server B were blue, etc. In this way, I could do really cool things like suppress messages from Server B (no colorization). Or, I could take Control Plane messages from, say, Server C, and highlight those Yellow.  

This came in very handy during a Demo, where people were watching the messages display in rapid succession on a large screen.

Monday, February 27, 2023

Hyperthreading vs Non-Hyperthreading on an ESXi Hypervisor

We started to notice that several VNF (Virtual Network Function) vendors were recommending to turn off (disable) Hyper-threading on hypervisors.  But why? They claimed it helped their performance. 

Throwing a switch and disabling this, means that the number of cores that are exposed to users, is cut in half. So a 24 core CPU, has 48 cores if Hyper-threading is enabled, and only has 24 cores if it is disabled. 

This post isn't meant to go into the depths of Hyper-threading itself. The question we had, was whether disabling it or enabling it, affected performance, and to what degree.

We ran a benchmark that was comprised of three "layers". 

  • Non-Hyperthreaded (24 cores) vs Hyperthreaded (48 cores)
  • Increasing vCPU of the Benchmark VM (increments of eight: 1,8,16,24)
  • Each test ran several Sysbench tests with increasing threads (1,2,4,8,16,32) 

The servers we are running on, include:  Cisco M5 (512G RAM, 24 vCPU)

We collected the results in Excel, and ran a Pivot Char graph on it, and this is what we found (below).

VM with 1,8,16,24 vCPU running Sysbench with increasing threads
on a Hyperthread-disabled system (24) vs Hyperthread-enabled system (48)

It appears to me, that Hyperthreading starts to look promising when two things happen:

  1. vCPU resources on the VM increase past a threshold of about 8 vCPU.
  2. an application is multi-threaded, and is launching 16 or more threads.

Notice that on an 8 vCPU virtual machine, the "magic number" is 8 threads. On a 16 vCPU virtual machine, you do not see hyperthreading become an advantage until 16 threads are launched. On a 24 vCPU system, we start to see hyperthreading become favorable at about 16 threads and higher.

BUT - if the threads are low, between 1 and about 8, the hyperthreading works against you.

Thursday, February 16, 2023

Morpheus API - pyMorpheus Python API Wrapper

I have been working on some API development in the Morpheus CMP tool.

The first thing I do when I need to use an API, is to see if there is a good API wrapper. I found this one API wrapper out on Github, called pyMorpheus.

With this wrapper, I was up and running in absolutely no time, making calls to the API, parsing JSON responses, etc.

The Use Case I am working on, is a "re-conciliator" that will do two things:

  • Remove Orphaned VMs
Find, and delete (upon user confirmation) those VMs that have had their "rug pulled out" from Morpheus (deleted in vCenter but still sitting in Morpheus as an Instance)
  •  Convert Certain Discovered VMs to Morpheus

This part sorta kinda worked.  The call to https://<applianceurl>/servers/id/make-managed did take a Discovered VM  and converted it to an instance, with a "VMWare" logo on it. 

But I was unable to set advanced attributes of the VMs - Instance Type, Layout, Plan, etc. and this made it only a partial success.

Maybe if we can get the API fixed up a bit, we can get this to work.

One issue, is the "Cloud Sync". When we call the API, we do a cloud sync, to find Discovered VMs. We do the same cloud sync, to determine whether any of the VM's fields in Morpheus change their state, if someone deletes a VM in vCenter (such a state change gives us the indicator that the VM is, in fact, now an orphan).  The Cloud Sync is an asynchronous call. You have to wait for an indefinite amount of time, to ensure that the results you are looking for in vCenter, are reflected in Morpheus. It's basically polling, which is not an exact art. For this reason, the reconciliator tool needs to be run as an operations tool, manually, as opposed to some kind of batch scheduled job.


Tuesday, January 17, 2023

Trying to get RSS (Receive Side Scaling) to work on an Intel X710 NIC

 

Cisco M5 server, with 6 nics on it. The first two are 1G nics that are unused. 

The last 4, are:

  • vmnic2 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

  • vmnic3 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

  • vmnic4 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

  • vmnic5 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

Worth mentioning:

  • vmnic 2 and 4 are uplinks, using a standard Distributed Switch (virtual switch) for those uplinks.

  • vmnic 3 and 5 are connected to an N-VDS virtual switch (used with NSX-T) and don't have uplinks.

In ESXi (VMWare Hypervisor, v7.0), we have set the RSS values accordingly:

UPDATED: how we set the RSS Values!

first, make sure that RSS parameters are unset. Because DRSS and RSS should not be set together. 

> esxcli system module parameters set -m -i40en -p RSS=""

next, make sure that DRSS parameters are set. We are setting to 4 Rx queues per relevant vmnic.

esxcli system module parameters set -m -i40en -p DRSS=4,4,4,4 

now we list the parameters to ensure they took correctly

> esxcli system module parameters list -m i40en
Name           Type          Value    Description
-------------  ------------  -------  -----------
DRSS           array of int           Enable/disable the DefQueue RSS(default = 0 )
EEE            array of int           Energy Efficient Ethernet feature (EEE): 0 = disable, 1 = enable, (default = 1)
LLDP           array of int           Link Layer Discovery Protocol (LLDP) agent: 0 = disable, 1 = enable, (default = 1)
RSS            array of int  4,4,4,4  Enable/disable the NetQueue RSS( default = 1 )
RxITR          int                    Default RX interrupt interval (0..0xFFF), in microseconds (default = 50)
TxITR          int                    Default TX interrupt interval (0..0xFFF), in microseconds, (default = 100)
VMDQ           array of int           Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable (default =8)
max_vfs        array of int           Maximum number of VFs to be enabled (0..128)
trust_all_vfs  array of int           Always set all VFs to trusted mode 0 = disable (default), other = enable

But, we are seeing this when we look at the individual adaptors in the ESXi kernel:

> vsish -e get /net/pNics/vmnic3/rxqueues/info
rx queues info {
   # queues supported:1
   # rss engines supported:0
   # filters supported:0
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0 -> No matching defined enum value found.
   Rx Queue features: 0 -> NONE
}

Nics 3 and 5, connected to the N-VDS virtual switch, only get one single Rx Queue supported, even though the kernel module is configured properly.

> vsish -e get /net/pNics/vmnic2/rxqueues/info
rx queues info {
   # queues supported:9
   # rss engines supported:1
   # filters supported:512
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0x1f -> MAC VLAN VLAN_MAC VXLAN Geneve
   Rx Queue features: 0x482 -> Pair Dynamic GenericRSS

But Nics 2 and 4, which are connected to the standard distributed switch, have 9 Rx Queues configured properly.

Is this related to the virtual switch we are connecting to (meaning we need to be looking at VMWare)? Or, is this somehow related to the i40en driver that is being used (in which case we need to be going to server vendor or Intel who makes the XL710 nic)?

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...