Grasping Technology: VMWare

Showing posts with label VMWare. Show all posts

Friday, April 4, 2025

SLAs using Zabbix in a VMware Environment

Zabbix 7 introduced some better support for SLAs. It also had better support for VMware.

VMware, of course now owned by Broadcom, has prioritized their Aria Operations (VROPS) monitoring suite over any of the alternative monitoring solutions (of which there is no shortage). Usually open source solutions have a limited life cycle as developers leave the project and move on to the next zen thing. Zabbix is still widely popular after many years. They got it mostly right the first time, and it absolutely excels at monitoring Linux.

To monitor VMware, it relies on VMware templates - and it used to present "objects" like datastores, as hosts. In version 7, it no longer does this, and instead tie the datastores as attributes of the true hosts - hypervisors, virtual machines, etc. This makes it a bit harder to monitor a datastore in an of itself - getting free space, used space, etc. if you want to do that. But - in version 7 there are now all kinds of hardware sensors and stuff that were not available in version 5. There are more metrics (items), more triggers that fire out of the box, etc.

One big adjustment in v7 is the support for SLAs. I decided to give it a shot.

The documentation only deals with a 3 node cluster. Such as a back-end cluster. That is not what I wanted.

What I wanted, was to monitor a cluster of hypervisors in each of multiple datacenters.

To do this, I started with SLAs:

Rollup SLA - Quarterly
Rollup SLA - Weekly
Rollup SLA - Daily

Then I created a Service

Rollup - Compute Platform

Underneath this, I created a Service for each data center. I used two tags on each one of these, one for datacenter and the other for platform (to future-proof in the even we use multiple platforms in a datacenter). Using an example of two datacenters, it looked like this.

Datacenter Alpha Compute

datacenter=alpha
platform=accelerated

Datacenter Beta Compute

datacenter=beta
platform=accelerated

These services have really nothing defined in them except the tags, and I assigned a weight of 1 to each of them (equal weight - we assume all datacenters are equally important).

Underneath these datacenter services, we defined some sub-services.

Datacenter Alpha Compute

Health

Yellow
Red

Memory Usage
CPU

CPU Utilization
CPU Usage
CPU Ready

Datastore

Read Latency
Write Latency
Multipath Links

Restart

We went into the trigger prototypes, and made sure that the datacenter, cluster and platform were set as tags so that all problems would have these tags coming in, necessary for the Problem Tag filter. We also had to add some additional tags to differentiate between severity warning and critical (we used level=warning for warnings, and level=high for anything higher than a warning).

On the problem tags filter, we wanted to catch only problems for our datacenters and this specific platform, so we used those two filters on every tag. In this example below, we have a cpu utilization service - a sub-service of CPU, which in turn is a sub-service of a datacenter.

This worked fairly well, until we started doing some creative things.

First, we found that all of the warnings were impacting the SLAs. What we were attempting to do, was to put in some creative rules, such as:

If 3 hypervisors or more in a cluster have a health = yellow, dock the SLA
Use a weight of 9 on a health red, and a weight 3 on as situation where 3 hypervisors in a cluster have a health of yellow.

THIS DID NOT WORK. Why? Because unless every single hypervisor was a sub-service, there was no way to make it work because the rules all applied to child services. We couldn't have all of the hypervisors be a sub-service - too difficult to maintain, too many of them, and we were using Discovery which meant that they could appear or disappear at any time. We needed to do SLAs at the cluster level or the datacenter level, not individual servers (even though indeed we monitor individual servers, but they are defined through discovery).

So, we had to remove all warnings from the SLAs

They were affecting the SLA too drastically (many hypervisors hit health=yellow for a while and then recover). We had to revert to just the red ones and assume that a health=red affects availability (it doesn't truly affect availability necessarily, but it does in certain cases).
We could not make the rules work without adding every single hypervisor in every single datacenter as a sub-service which simply wasn't feasible.

The problem we now face, is that the way the SLAs roll up, the rollup SLA value is essentially the lowest of the SLA values underneath it.

Platform SLA (weight 0) = 79.7 - huh?

Datacenter A (weight 1) = 79.7
Datacenter B (weight 1) = 99.2
Datacenter C (weight 1) = 100

The platform SLA should be an average, I think, of the 3 datacenters if they are all equal-weighted. But that is not what we are observing.

The good news though, is that if Datacenter A has a problem with health=red, the length of time that problem exists seems to be counting against the SLA properly. And this is a good thing and a decent tactic for examining an SLA.

The next thing we plan to implement, is a separation between two types of SLAs:

Availability (maybe we rename this health)
Performance

So a degradation in cpu ready, for example, would impact the performance SLA, but not the availability SLA. Similar for read/write latency on a datastore.

I think in a clustered hypervisor environment, it is much more about performance than availability. The availability might consider the network, the ability to access storage, and whether the hypervisor is up or down. The problem is that we are monitoring individual hypervisors, and not the VMware clusters themselves, which are no longer presented as distinct monitor-able objects in Zabbix 7.

But I think for next steps, we will concentrate more on resource usage, congestion, and performance than availability.

Sunday, January 19, 2025

NUMA PreferHT VM setting on a Hyperthread-Enabled ESXi Hypervisor

This could be a long post, because things like NUMA can get complicated.

For background, we are running servers - hypervisors - that have 24 cores. There are two chips - wafers as I like to refer to them - each with 12 cores, giving a total of 24 physical cores.

When you enable hyperthreading, you get 48 cores, and this is what is presented to the operating system and cpu scheduler (somewhat - more on this later). But - you don't get an effective doubling of cores when you enable hyperthreading. What is really happening, is that the 24 cores are "cut in half" so that another 24 cores can be "fit in", giving you 48 logical cores.

Worth mentioning also, is that each (now half) core, has a "sibling" - and this also matters from a scheduling perspective when you see things like cpu pinning used - because if you pin something to a specific core, then that "sibling" cannot be used for something else. For example, if you enabled hyperthreading, the cores would look like:

0 | 1

2 | 3

4 | 5

... and so on. So if someone pinned to core 4, core 5 is also "off the table" now from a scheduling perspective because pinning is a physical core concept, not a logical core concept.

So with this background, we had a tenant who wanted to enable a "preferHT" setting. This setting can be applied to an entire hypervisor by setting numa.PreferHT=1, affecting all VMs deployed on it.

Or, one can selectively add this setting to a particular or specific virtual machine by going into the Advanced Settings and configuring numa.vcpu.preferHT=TRUE.

In our case, it was the VM setting being requested - not the hypervisor setting. Now, this tenant is the "anchor tenant" on the platform, and their workloads are very latency sensitive. So it was important to jump through this hoop when it was requested. First, we tested the setting by powering a VM off and adding the setting, then powering the VM back on. No problems with this. We then migrated the VM to another hypervisor, and had no issues with that either. Aside of that, though, how do you know that the VM setting "took" - meaning that it was picked up and recognized?

It turns out, that there are a couple of ways to do this:

1. esxtop

When you load esxtop, it is going to show you cpu by default. But if you hit the "m" key, it goes into a "memory view". If you go into memory view by hitting "m" and then hit the "f" key, a list of fields will show up. One of them, is NUMA Statistics. So by selecting this, you get a ton of interesting information about NUMA. The settings you are most interested in, are going to be:

NHN - Current home node for the virtual machine or resource pool - in our case, this was 0 or 1 (we had two numa nodes, as there is usually one per physical cpu socket).

NMIG - Number of NUMA migrations between two snapshot samples

NRMEM - (NUMA Remote Memory): Amount of remote memory allocated to the virtual machine, in MB

NLMEM (NUMA Local Memory) - Amount of local memory allocated to the virtual machine, in MB

L%D - this shows the amount of memory that is Localized. You want this number to be 100% but seeing the number in the 90s is probably okay also because it is showing that the memory access is not traversing a NUMA bus, which adds latency.

GST_NDx (Guest Node x): Guest memory being allocated for the VM on NUMA node x, where x is the node number

MEMSZ (Memory Size): Total amount of physical memory allocated to a virtual machine

2. vmdumper command

I found this command on a blog post - which I will list in my sources at the end of this blog post. This useful command, can show you a lot of interesting information about how NUMA is working "under the hood" (in practice). It can show you a Logical Processor to NUMA Node Map, it can show you how many home nodes are utilized for a given VM, and show you the assignment of NUMA clients to the respective NUMA nodes.

One of the examples covered in this blog post refers to the situation where a VM has 12 vCPUs on a 10 core system, and then goes down and shows what it would look like if the VM had 10 vCPU instead.

Sources:

http://www.staroceans.org/ESXi_VMkernel_NUMA_Constructs.htm

https://frankdenneman.nl/2010/02/03/sizing-vms-and-numa-nodes/

https://frankdenneman.nl/2010/10/07/numa-hyperthreading-and-numa-preferht/

https://docs.pexip.com/server_design/vmware_numa_affinity.htm

https://docs.pexip.com/server_design/numa_best_practices.htm#hyperthreading

https://knowledge.broadcom.com/external/article?legacyId=2003582

Wednesday, June 28, 2023

VMWare Storage - Hardware Acceleration Status

Today, we had a customer call in and tell us that they couldn't do Thick provisioning from a vCenter template. We went into vCenter (the GUI), and sure enough, we could only provision Thin virtual machines from it.

But - apparently on another vCenter cluster, they COULD provision Thick virtual machines. There seemed to be no difference between the virtual machines. Note that we are using NFS and not Block or iSCSI or Fibre Channel storage.

We went into vCenter, and lo and behold, we saw this situation...

NOTE: To get to this screen in VMWare's very cumbersome GUI, you have to click on the individual datastore, then click "Configure" then a tab called "Hardware Acceleration" occurs.

So, what we have here, is one datastore that says "Not Supported" on a host, and another datastore in the same datastore cluster that says "Supported" on the exact same host. This sounds bad. This looks bad. Inconsistency. Looks like a problem.

So what IS hardware acceleration when it comes to Storage? To find this out, I located this KnowledgeBase:

Storage Hardware Acceleration

There is also a link for running storage HW acceleration on NAS devices:

Storage Hardware Acceleration on NAS Devices

When these two (above) links are referenced, there are (on left hand side) some additional links as well.

For each storage device and datastore, the vSphere Client display the hardware acceleration support status.

The status values are Unknown, Supported, and Not Supported. The initial value is Unknown.

For block devices, the status changes to Supported after the host successfully performs the offload operation. If the offload operation fails, the status changes to Not Supported. The status remains Unknown if the device provides partial hardware acceleration support.

With NAS, the status becomes Supported when the storage can perform at least one hardware offload operation.

When storage devices do not support or provide partial support for the host operations, your host reverts to its native methods to perform unsupported operations.

NFS = NAS, I am pretty darned sure.

So this is classic VMWare confusion. They are using a "Status" field, and using values of Supported / Non-Supported, when in fact Supported means "Working" and Non-Supported means "Not Working" based on (only) the last operation attempted.

So. Apparently, if a failure on this offload operation occurs, this flag gets turned to Non-Supported, and guess what? That means you cannot do *any* Thick Provisioning.

In contacting VMWare, they want us to re-load the storage plugin. Yikes. Stay Tuned....

VMWare also has some Best Practices for running iSCSI Storage, and the link to that is found at:

VMWare iSCSI Storage Best Practices

Tuesday, January 17, 2023

Trying to get RSS (Receive Side Scaling) to work on an Intel X710 NIC

Cisco M5 server, with 6 nics on it. The first two are 1G nics that are unused.

The last 4, are:

vmnic2 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic3 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic4 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
vmnic5 - 10G nic, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

Worth mentioning:

vmnic 2 and 4 are uplinks, using a standard Distributed Switch (virtual switch) for those uplinks.
vmnic 3 and 5 are connected to an N-VDS virtual switch (used with NSX-T) and don't have uplinks.

In ESXi (VMWare Hypervisor, v7.0), we have set the RSS values accordingly:

UPDATED: how we set the RSS Values!

first, make sure that RSS parameters are unset. Because DRSS and RSS should not be set together.

> esxcli system module parameters set -m -i40en -p RSS=""

next, make sure that DRSS parameters are set. We are setting to 4 Rx queues per relevant vmnic.

esxcli system module parameters set -m -i40en -p DRSS=4,4,4,4

now we list the parameters to ensure they took correctly

> esxcli system module parameters list -m i40en
Name           Type          Value    Description
-------------  ------------  -------  -----------
DRSS           array of int           Enable/disable the DefQueue RSS(default = 0 )
EEE            array of int           Energy Efficient Ethernet feature (EEE): 0 = disable, 1 = enable, (default = 1)
LLDP           array of int           Link Layer Discovery Protocol (LLDP) agent: 0 = disable, 1 = enable, (default = 1)
RSS            array of int  4,4,4,4  Enable/disable the NetQueue RSS( default = 1 )
RxITR          int                    Default RX interrupt interval (0..0xFFF), in microseconds (default = 50)
TxITR          int                    Default TX interrupt interval (0..0xFFF), in microseconds, (default = 100)
VMDQ           array of int           Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable (default =8)
max_vfs        array of int           Maximum number of VFs to be enabled (0..128)
trust_all_vfs  array of int           Always set all VFs to trusted mode 0 = disable (default), other = enable

But, we are seeing this when we look at the individual adaptors in the ESXi kernel:

> vsish -e get /net/pNics/vmnic3/rxqueues/info
rx queues info {
   # queues supported:1
   # rss engines supported:0
   # filters supported:0
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0 -> No matching defined enum value found.
   Rx Queue features: 0 -> NONE
}

Nics 3 and 5, connected to the N-VDS virtual switch, only get one single Rx Queue supported, even though the kernel module is configured properly.

> vsish -e get /net/pNics/vmnic2/rxqueues/info
rx queues info {
   # queues supported:9
   # rss engines supported:1
   # filters supported:512
   # active filters:0
   # filters moved by load balancer:0
   RX filter classes: 0x1f -> MAC VLAN VLAN_MAC VXLAN Geneve
   Rx Queue features: 0x482 -> Pair Dynamic GenericRSS

But Nics 2 and 4, which are connected to the standard distributed switch, have 9 Rx Queues configured properly.

Is this related to the virtual switch we are connecting to (meaning we need to be looking at VMWare)? Or, is this somehow related to the i40en driver that is being used (in which case we need to be going to server vendor or Intel who makes the XL710 nic)?

Tuesday, January 10, 2023

VMWare NSX-T Testing - Dropped Packets

We have been doing some performance testing with a voice system.

In almost all cases, these tests are failing. They are failing for two reasons:

Rx Missed counters on the physical adaptors of the hypervisors that are used to send the test traffic. These adaptors are connected to the E-NVDS virtual switch on one side, and to an upstream Arista data center switch on the other.
Dropped Packets - mostly media (RTP UDP), with less than 2% of the drops being RTCP traffic (TCP).

Lately, I used the"Performance Best Practices for VMWare vSphere 7.0" guide, as a method for trying to improve the dropped packets were seeing.

We attempted several things that were mentioned in this document:

ESxi NiC - enable Receive Side Scaling (RSS)

Actually, to be technical, we enabled DRSS (Default Queue RSS) rather than the RSS (NetQ RSS) which the i40en driver also supported for this Intel X710 adaptor.

LatencySensitivity=High - and we checked "Reserve all Memory" on the checkbox
Interrupt Coalescing

Disabling it, to see what affect disabling it had
Setting it from its rate-based scheme (the default, rbc) to static with 64 packets per interrupt

We didn't really see any noticeable improvement from the Receive Side Scaling or the Latency Sensitivity settings, which was a surprise, actually. We did see some perhaps minor improvement on the interrupt coalescing when we set it to static.

Tuesday, June 7, 2022

VMWare Network Debugging - Trex Load Generation and Ring Buffer Overflow

We began running Trex Traffic Generator testing, sending load to a couple of virtual machines running on ESXi vSphere-managed hypervisors, and are running into some major problems.

First, the Trex Traffic Generator:

Cent7 OS virtual machine

3 ports

eth0 used for ssh connectivity and to run Trex and Trex Console (with screen utility)
eth1 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)
eth2 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)

4 cores

the VM actually has 6, but two are used for running OS and Trex Admin
Traffic Tests utilize 4 cores

Next, the Device(s) Under Test (DUT):

Juniper vSRX which is a router VM (based on JUNOS but Berkeley Unix under the hood?)
Standard CentOS7 Virtual Machine

We ran the stateless imix test, at 20% and 100% line utilization.

We noticed that the Trex VM was using 80-90% core usage in the test (Trex Stats from console), and was using 20-25% line utilization, sending 4Gbps per port (8Gbps total) to the DUT virtual machines.

On the receiving side, the router was only processing about 1/4 to 1/6 of the packets sent by Trex. The Cent7 VM, also, could not receive more than about 3.5Gbps maximum.

So what is happening? This led us to a Deep Dive, into the VMWare Statistics.

By logging into the ESXi host that the receiving VM was running on, we could first fine out what virtual switch and port the VM interface was assigned to, by running:

# net-stats -l

This produces a list, like this:

PortNum          Type SubType SwitchName       MACAddress         ClientName
50331650            4       0 DvsPortset-0     40:a6:b7:51:18:60 vmnic4
50331652            4       0 DvsPortset-0     40:a6:b7:51:1e:fc vmnic2
50331654            3       0 DvsPortset-0     40:a6:b7:51:1e:fc vmk0
50331655            3       0 DvsPortset-0     00:50:56:63:75:bd vmk1
50331663            5       9 DvsPortset-0     00:50:56:8a:af:c1 P6NPNFVNDPKVMA.eth1
50331664            5       9 DvsPortset-0     00:50:56:8a:cc:74 P6NPNFVNDPKVMA.eth2
50331669            5       9 DvsPortset-0     00:50:56:8a:e3:df P6NPNFVNRIV0009.eth0
67108866            4       0 DvsPortset-1     40:a6:b7:51:1e:fd vmnic3
67108868            4       0 DvsPortset-1     40:a6:b7:51:18:61 vmnic5
67108870            3       0 DvsPortset-1     00:50:56:67:c5:b4 vmk10
67108871            3       0 DvsPortset-1     00:50:56:65:2d:92 vmk11
67108873            3       0 DvsPortset-1     00:50:56:6d:ce:0b vmk50
67108884            5       9 DvsPortset-1     00:50:56:8a:80:3c P6NPNFVNDPKVMA.eth0

A couple of nifty commands, will show you the statistics:
# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/clientStats
port client stats {
   pktsTxOK:115
   bytesTxOK:5582
   droppedTx:0
   pktsTsoTxOK:0
   bytesTsoTxOK:0
   droppedTsoTx:0
   pktsSwTsoTx:0
   droppedSwTsoTx:0
   pktsZerocopyTxOK:0
   droppedTxExceedMTU:0
   pktsRxOK:6595337433
   bytesRxOK:2357816614826
   droppedRx:2934191332 <-- lots of dropped packets
   pktsSwTsoRx:0
   droppedSwTsoRx:0
   actions:0
   uplinkRxPkts:0
   clonedRxPkts:0
   pksBilled:0
   droppedRxDueToPageAbsent:0
   droppedTxDueToPageAbsent:0
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
   LRO pkts rx ok:0
   LRO bytes rx ok:0
   pkts rx ok:54707478
   bytes rx ok:19544123192
   unicast pkts rx ok:54707448
   unicast bytes rx ok:19544121392
   multicast pkts rx ok:0
   multicast bytes rx ok:0
   broadcast pkts rx ok:30
   broadcast bytes rx ok:1800
   running out of buffers:9325862
   pkts receive error:0
   1st ring size:4096 <-- this is a very large ring buffer size!
   2nd ring size:256
   # of times the 1st ring is full:9325862 <-- WHY packets are being dropped
   # of times the 2nd ring is full:0
   fail to map a rx buffer:0
   request to page in a buffer:0
   # of times rx queue is stopped:0
   failed when copying into the guest buffer:0
   # of pkts dropped due to large hdrs:0
   # of pkts dropped due to max number of SG limits:0
   pkts rx via data ring ok:0
   bytes rx via data ring ok:0
   Whether rx burst queuing is enabled:0
   current backend burst queue length:0
   maximum backend burst queue length so far:0
   aggregate number of times packets are requeued:0
   aggregate number of times packets are dropped by PktAgingList:0
   # of pkts dropped due to large inner (encap) hdrs:0
   number of times packets are dropped by burst queue:0
   number of packets delivered by burst queue:0
   number of packets dropped by packet steering:0
   number of packets dropped due to pkt length exceeds vNic mtu:0 <-- NOT the issue!
}

We were also able to notice that this VM had an Rx queue, per vCPU added to the VM (no additional settings to the VM settings were made to this specific Cent7 VM):

# vsish -e ls /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues
0/
1/
2/
3/
4/
5/
6/
7/

Each of the queues, can be dumped individually, to check Ring Buffer size (we did this and they were all 4096):

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/1/status
status of a vmxnet3 vNIC rx queue {
   intr index:1
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:33
   next2Use in ring1:0
   next2Write:1569
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/7/status
status of a vmxnet3 vNIC rx queue {
   intr index:7
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:1923
   next2Use in ring1:0
   next2Write:3458
}

So, that is where we are. We see the problem. Now, to fix it - that might be a separate post altogether.

Some additional knowledgebase sources of information on troubleshooting in VMWare environments:

MTU Problem

https://kb.vmware.com/s/article/75213

Ring Buffer Problem

https://kb.vmware.com/s/article/2039495

https://vswitchzero.com/2017/09/26/vmxnet3-rx-ring-buffer-exhaustion-and-packet-loss/

Tuesday, April 12, 2022

DPDK Testing using TestPMD on a VMWare Virtual Machine

Testing and verifying DPDK is NOT easy. And it is even more challenging in VM environments.

After investing in VMWare hypervisors that supposedly run DPDK, we wanted to test and verify that a) it worked, and b) the performance was as advertised.

Below is a list of steps we took to get the host and a DPDK-enabled VM ready:

Hypervisor(s)

Enabled the ixgben_ens drivers on the host. There are some ESXI CLI commands you can run to ensure that these are loaded and working.

VM Settings

VMXNet3 adaptor types on VM
Latency Sensitivity = High (sched.cpu.latencySensitivity=TRUE)
Hugepages enabled in VM (sched.mem.lpage.enable1GPage TRUE
Reserve all Guest Memory
Enable Multiple Cores for High I/O Workloads (ethernetX.ctxPerDev=”1” )
CPU Reservation
NUMA Affinity (numaNodeAffinity=X)

After this, I launched the VM. I was smart enough to launch the VM with 3 NICs on it.

eth0 - used as a management port, for ssh and such.
eth1 - this one to be used for DPDK testing
eth2 - this one to be used for DPDK testing

Launching a VM (i.e. a RHEL Linux VM) with these settings, does NOT mean that you are ready for DPDK!! You still need a DPDK-compiled application on your OS. DPDK applications need to use DPDK-enabled NIC drivers on the VM, and on a Linux VM, these drivers are typically run as kernel modules. There are several different types and kinds of DPDK drivers (kernel modules), such as vfio, uio-pci-generic, igb_uio, et al.

To prepare your VM for testing, we decided to install DPDK, and then run the TestPMD application.

Installing DPDK

To get DPDK, you can go to dpdk.org, and download the drivers, which comes as a tar.gz file that can be unpacked. Or, there is a github site that you can use the clone the directories and files.

# git clone http://dpdk.org/git/dpdk

It is important to read the instructions when building DPDK, because the old-style "make configure", "make", "make install" process has been replaced by fancier build tools like meson and ninja that you need to install. I chose to install by going to the top of the directory tree, and typing:

# meson -Dexamples=all build

This does not actually compile the code. It sets the table for you to use Ninja to build the code. So the next step was to type:

# ninja

Followed by:

# ninja install

The "ninja install" puts a resultant set of DPDK executables (some ELF, some Python), in /usr/local/bin directory (maybe installs some stuff in other places too).

Right away, I hit a snag. When I tried to run dpdk_setup.py, to bind the VM The kernel module igb_uio.ko was nowhere to be found.

I was completely at a loss about this, until I realized that some other DPDK packages (test load generators) compile DPDK and the igb_uio.ko drivers, either by including them outright, or copying the sources into the build process. Trex, for example, builds the drivers. And so does a package called DTS. After I decided to git clone the DTS package, I stumbled upon some documentation in an archived package called DTS (DPDK Testing Suite). In the DTS package, in the /opt/github/dts/doc/dts_gsg/usr_guide there is a file called igb_uio.rst which describes how to compile the igb_uio.ko drivers for use with DTS. This was the missing link. The section of the file up front, described that the drivers have been moved into a different github repository - and are now separated from DPDK!

Get Source Code - note: assumption is that you are doing this in /opt directory.
---------------

Get igb_uio::

   git clone http://dpdk.org/git/dpdk-kmods
   git clone git://dpdk.org/dpdk-kmods

Get DPDK::

   git clone git://dpdk.org/dpdk
   git clone http://dpdk.org/git/dpdk

The author of this igb_uio.rst file described the process that can be used to fuse DPDK and the drivers back together into a single build - the way it used to be. How convenient. Here is how that is done.

Integrate igb_uio into DPDK
---------------------------

Assume you have cloned the dpdk and dpdk-kmods source code
in opt/dpdk and opt/dpdk-kmods.

Step 1
# Copy dpdk-kmods/linux/igb_uio/ to dpdk/kernel/linux/:

[root@dts linux]# cp -r /opt/dpdk-kmods/linux/igb_uio /opt/dpdk/kernel/linux/

you should see igb_uio in your output:

[root@dts linux]# ls /opt/dpdk/kernel/linux/
igb_uio kni meson.build

Step 2:
# enable igb_uio build in meson:

since we have copied the directory over to /opt/dpdk, we will edit the meson.build there.

*   add igb_uio in /opt/dpdk/kernel/linux/meson.build subdirs as below:

     subdirs = ['kni', 'igb_uio']

NOTE: this is an important step not to miss because it will not build if you don't do this.

Step 3:
*   create a file of meson.build in /opt/dpdk/kernel/linux/igb_uio/ as below:

     # SPDX-License-Identifier: BSD-3-Clause
     # Copyright(c) 2017 Intel Corporation

     mkfile = custom_target('igb_uio_makefile',
             output: 'Makefile',
             command: ['touch', '@OUTPUT@'])

     custom_target('igb_uio',
             input: ['igb_uio.c', 'Kbuild'],
             output: 'igb_uio.ko',
             command: ['make', '-C', kernel_dir + '/build',
                     'M=' + meson.current_build_dir(),
                     'src=' + meson.current_source_dir(),
                     'EXTRA_CFLAGS=-I' + meson.current_source_dir() +
                             '/../../../lib/librte_eal/include',
                     'modules'],
             depends: mkfile,
             install: true,
             install_dir: kernel_dir + '/extra/dpdk',
             build_by_default: get_option('enable_kmods'))

How wonderful. To recap, here is what we did:

copy the source files from dpdk-kmods into the proper directory of dpdk
snap in the proper meson build file (which the author graciously provides)
uninstall (previous build, assuming you built DPDK before doing all of this)
rebuild
reinstall

Step 3:

# cd /opt/dpdk/build

# ninja uninstall

Step 4:
# ninja

Step 5:

# ninja install

A quick find command shows that the kernel module was built.

[root@acdcchndnfvdpk0001 dpdk]# find . -print | grep ko
./build/drivers/net/octeontx/base/libocteontx_base.a.p/octeontx_pkovf.c.o
./build/lib/librte_table.a.p/table_rte_table_hash_cuckoo.c.o
./build/lib/librte_hash.a.p/hash_rte_cuckoo_hash.c.o
./kernel/linux/igb_uio/igb_uio.ko
./kernel/linux/igb_uio/.igb_uio.ko.cmd
./drivers/net/octeontx/base/octeontx_pkovf.c
./drivers/net/octeontx/base/octeontx_pkovf.h
./lib/hash/rte_cuckoo_hash.h
./lib/hash/rte_cuckoo_hash.c
./lib/table/rte_table_hash_cuckoo.h
./lib/table/rte_table_hash_cuckoo.c

Now, we have something we can use to bind our adaptors to the drivers!!!

You can bind the adaptors to the drivers using a couple of different methods. You can use a utility that is supplied by DPDK to do it (dpdk-devbind.py), or you can also use a nifty Linux utility called driverctl, which I prefer (this needs to be installed typically with a package manager as it generally does not roll onto the OS with a default installation).

A script I use to do the binding looks like this:

# cat bind-pci.sh
#!/bin/bash

lshw -class network -businfo | grep pci

while :
do
   echo "Linux Interface to override (e.g. p1p1, p1p2, p1p3, p1p4):"
   read iface
   if [ ${iface} == "skip" ]; then
      break
   fi
   lshw -class network -businfo | grep pci | grep ${iface}
   if [ $? -eq 0 ]; then
      pci=`lshw -class network -businfo | grep pci | grep ${iface} | awk '{printf $1}' | cut -f2 -d"@"`
      echo "We will override the kernel driver with igb_uio for PCI address: ${pci}"
      driverctl set-override ${pci} igb_uio
      break
   fi
done

When you run this script, you can check to see if the binding was successful by running a DPDK command:

# python3 /usr/local/bin/dpdk-devbind.py --status

And this command will show you whether the binding worked or not.

Network devices using DPDK-compatible driver
============================================
0000:0b:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3
0000:13:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3

Network devices using kernel driver
===================================
0000:03:00.0 'VMXNET3 Ethernet Controller 07b0' if=eth0 drv=vmxnet3 unused=igb_uio *Active*

NOTE: To remove a driver binding, "driverctl unset-override ${pci address}" would be used. In which case, the driver will now become visible to the Linux OS in the Virtual Machine again.

So we now have one adaptor that the Linux Networking Kernel sees (eth0), but the two adaptors Linux saw prior to the binding (eth1, and eth2), have now been "reassigned" to DPDK. And the OS does not see them at all, actually, anymore.

If we run an ifconfig, or an "ip a" command to see the Linux network interfaces in the VM, this is what it now looks like.

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b7:83:1a brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.10/24 brd 192.168.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:feb7:831a/64 scope link
       valid_lft forever preferred_lft forever

NOTE: No eth1 or eth2 shows up in Linux anymore, as they have been handed over to DPDK, which bypasses the Linux Kernel entirely.

Okay, now we have adaptors set up for DPDK. Now what? In our next step, we will do some simple verification with TestPMD.

Tuesday, March 1, 2022

VMWare Clustered File Systems - VMFS5 vs VMFS6

A nice table that describes the differences between VMWare's VMFS5 and the new VMFS 6.

Source: http://www.vmwarearena.com/difference-between-vmfs-5-vmfs-6/

For the difference in 512n versus 512e:

VMFSsparse:

VMFSsparse is a virtual disk format used when a VM snapshot is taken or when linked clones are created off the VM. VMFSsparse is implemented on top of VMFS and I/Os issued to a snapshot VM are processed by the VMFSsparse layer. VMFSsparse is essentially a redo-log that grows from empty (immediately after a VM snapshot is taken) to the size of its base VMDK (when the entire VMDK is re-written with new data after the VM snapshotting). This redo-log is just another file in the VMFS namespace and upon snapshot creation the base VMDK attached to the VM is changed to the newly created sparse VMDK.

SEsparse (space efficient):

SEsparse is a new virtual disk format that is similar to VMFSsparse (redo-logs) with some enhancements and new functionality. One of the differences of SEsparse with respect to VMFSsparse is that the block size is 4KB for SEsparse compared to 512 bytes for MFSsparse. Most of the performance aspects of VMFSsparse discussed above—impact of I/O type, snapshot depth, physical location of data, base VMDK type, etc.—applies to the SEsparse format also.

Monday, October 4, 2021

The first Accelerated VNF on our NFV platform

I haven't posted anything since April but that isn't because I haven't been busy.

We have our new NFV Platform up and running, and it is NOT on OpenStack. It is NOT on VMWare VIO. It also, is NOT on VMWare Telco Cloud!

We are using ESXi, vCenter, NSX-T for the SD-WAN, and Morpheus as a Cloud Management solution. Morpheus has a lot of different integrations, and a great user interface that gives tenants a place to log in and call home and self-manage their resources.

The diagram below depicts what this looks like from a Reference Architecture perspective.

The OSS, which is not covered in the diagram, is a combination of Zabbix and VROPS, both working in tandem to ensure that the clustered hosts and management functions are behaving properly.

The platform is optimized with E-NVDS, which is also referred to commonly as Enhanced Datapath which requires special DPDK drivers to be loaded on the ESXi hosts, for starters, as well as some configuration in the hypervisors. There are also settings to be made in the hypervisors to ensure that the E-NVDS is configured properly (separate upcoming post).

Now that the platform is up and running, it is time to start discussing workload types. There are a number of Workload Categories that I tend to use:

Enterprise Workloads - Enterprise Applications, 3-Tier Architectures, etc.
Telecommunications Workloads

Control Plane Workloads
Data Plane Workloads

Control Plane workloads are have more tolerances for latency and system resources than Data Plane Workloads do.

Why? Because Control Plane workloads are typically TCP-based, frequently use APIs (RESTful), and tend to be more periodic in their behavior (periodic updates). Most of the time, when you see issues related to Control Plane, it is related to back-hauling a lot of measurements and statistics (Telemetry Data). But generally speaking, this data in of itself does not have stringent requirements.

From a VM perspective, there are a few key things you need to do to ensure your VNF behaves as a true VNF and not as a standard workload VM. These include:

setting Latency Sensitivity to High, which turns off interrupts and ensures that poll mode drivers are used.
Enable Huge Pages on the VM by going into VM Advanced Settings and adding the parameter: sched.mem.lpage.enable1GHugePage = TRUE

Note: Another setting worth checking, although we did not actually set this parameter ourselves, is: sched.mem.pin = TRUE

Note: Another setting, sched.mem.maxmemctl ensures that ballooing is turned off. We do NOT have this setting, but it was mentioned to us, and we are researching this setting.

One issue we seemed to continually run into, was a vCenter alert called Virtual Machine Memory Usage, displaying in vCenter as a red banner with "Acknowledge and Reset to Green" links. The VM was in fact running, but vCenter seemed to have issues with it. The latest change we made that seems to have fixed this error, was to check the "Reserve all guest memory (All locked)" option checkbox.

This checkbox to Reserve all guest memory seemed intimidating at first, because the concern was that the VM could reserve all memory on the host. That is NOT what this setting does!!! What it does, is allow the VM to reserve all of its memory up-front - but just the VM memory that is specified (i.e. 24G). If the VM has has HugePages enabled, it makes sense that one would want the entire allotment of VM memory to memory to be reserved up front and be contiguous. When we enabled this, our vCenter alerts disappeared.

Lastly, we decided to change DRS to Manual in VM Overrides. To find this setting amongst the huge number of settings hidden in vCenter, you go to the Cluster (not the Host, not the VM, not the Datacenter) and the option for VM Overrides is there, and you have four options:

None
Manual
Partial
Full

The thinking here, is that VMs with complex settings may not play well with vMotion. I will be doing more research on DRS for VNFs before considering setting this (back) to Partial or Full.

Monday, April 26, 2021

Tenancy is Critical on a Cloud Platform

With this new VMWare platform, it was ultimately decided to go with ESXi hypervisors, managed by vCenter, and NSX-T.

During the POC, it was pointed out that this combination of solutions had some improvements and enhancements over OpenStack (DRS, vMotion, et al). But one thing seemed to be overlooked, and we pointed it out: Tenancy

VMWare attempts to address Tenancy with Vertical Stack point solutions, like vCloud Director (positioned at Service Providers), or vRealize Automation. The latter, is going through a complete transformation in its latest version. These solutions are also expensive. And, if you don't have the budget, what are your options??

One option is to set up Resource Pools and Folders in vCenter. Not the cleanest solution because you cannot set policies, workflows, etc.

What else can you do? Well, you can use a Cloud Management solution.

We had Cloudify as an Orchestrator. And we evaluated that as a Cloud Management solution. But what we found in the end, was that Cloudify excelled at complex orchestration, but it was not designed and built, ground-up, to be a Cloud Management Platform.

It seemed that this (lack of) Tenancy seemed to become apparent to everyone all at once - once the platform came up on VMWare. And, with Cloudify we lacked the Blueprint development to do the scores to hundreds of tasks that we needed to have. It needed integrations with NSX-T, vCenter, and a host of other solutions.

We looked at a couple of other solutions, and settled on a solution called Morpheus.

I will blog a bit more about Morpheus in upcoming posts. I have been very hands-on with it lately.

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization

Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the ~~packets~~ FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/

Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Grasping Technology