Thursday, November 10, 2022

Zabbix log file parsing - IPtables packet drops

I recently had to create some iptables rules for a clustered system, restricting traffic from RabbitMQ, Percona and other services to just the nodes participating in the cluster.  The details of this could be a blog post in and of itself.

I set up logging on my IPTables drops, which was initially for debugging. I had planned on removing those logs. But, once I got all the rules working perfectly, the amount of logging drops down to almost zero. The only thing that gets logged at that point, are things you would want to know about. So I decided to leave the logging for dropped packets in play (if you do this, make sure your logs wrap and manage your disk space!).

When you put up a firewall that governs not only inbound traffic but also outbound traffic, you learn a LOT about what your system is receiving and sending, and it is always interesting to see traffic being dropped. I had to continually investigate traffic that was being dropped (especially outbound), and whitelist those services once I discovered what they were. NTP, DNS, all the normal things, but also some services that don't come to mind easily.

Logging onto systems to see what traffic is being dropped is a pain. You might be interested and do it initially, but eventually you will get tired of having to follow an operational routine.

I decided to do something more proactive. Send dropped packets to Zabbix.

To do this, here are the steps:

1. Add 2 new items to the hosts that employed the IPTables rules.

  • Iptables Dropped Packet Inbound
  • Iptables Dropped Packet Outbound

2.  Configure your items 

Configure your rules properly, as shown below (example is Inbound drops)

Zabbix Log File Item


First, and this is important, note the Type. It is Zabbix agent (active), not the default "Zabbix agent". This is because a log file check requires an active agent, and will not work if Zabbix is configured as a passive agent. There are plenty of documents and blogs on the web that discuss the difference between Active and Passive.

A passive agent configuration means Zabbix will open the TCP connection and fetch back the measurements it wants on port 10050. An active agent, means that the monitored node will take the responsibility of initiating the connection to the Zabbix server, on port 10051. So passive is "over and back" (server to target and back to server), while active is "over" (target to server). There are firewall repercussions for sure, as the traffic flow is different source to destination, and the two approaches use different ports.

Note that the 2nd parameter on the key, is the search string in the log. This can be a regular expression if you want to grab only a portion of the full line. If you use a regular expression, you can put it in tickmarks. I imagine a lot of people don't get this 2nd parameter correct which leads to troubleshooting. And, I think many people don't realize that you don't need to be putting in carets and such, to get what you need. For instance, I initially had "^.*IPtables Input Dropped.*$" in this field, when I realized later I only needed "IPtables Input Dropped". You only need to be slicing and dicing with complex regular expressions, if you want the output to be trimmed up.

The third parameter to be concerned with, is the Type of Information. This is often overlooked. It needs to be Log. Overlook this, which is easy to do, and it won't work!

I went with an update interval of 1m (1 minute), and a history storage period of 30d (thirty days). This was to be conservative on space lest we get a ton of log activity.

Also note that the item is Disabled (Enabled unchecked). This is because - we can not actually turn this on yet! We need to configure the agent for Active! THEN, this item can be verified and tested.

3. Next, you need to configure your agent for Active.

I wasn't sure initially, if a Zabbix agent had to be one or the other (Active or Passive mutual exclusive). I had my agent initially set to Passive. In Passive Mode, Zabbix reaches out on an interval, and gathers up measurements. Passive is the default when you install a Zabbix agent. 

To configure your agent to be Active, the only thing I needed to do, was to put a Server=x.x.x.x address in the Active section of the zabbix_agentd.conf file in /etc/zabbix directory of the monitored VM.

Don't forget to restart the zabbix-agent service!

4. Next, you need to make sure the firewalls on both sides are allowing Zabbix

On the monitored node (running Zabbix Agent), I was running iptables - not FirewallD. So I had to add an iptables rules for port 10051. 

iptables -A OUTPUT -p tcp -d ${ZABBIX} -m tcp --dport 10051 -j ACCEPT

On the Zabbix server itself, which happens to be running FirewallD, we simply added port 10051 to the /etc/firewalld/services/zabbix.xml file:

 <?xml version="1.0" encoding="utf-8"?>
<service>
  <short>Zabbix</short>
  <description>Allow services for Zabbix server and agent</description>
  <port protocol="tcp" port="10050"/>
  <port protocol="tcp" port="10051"/>
</service>

But you are not done when you add this! Make sure you restart the firewall, which can be done by restarting the service, or, if you don't want a gap in coverage, you can "reload" the rules on a running firewall with:

# firewall-cmd --reload

which generates "success" if the firewall restarts properly with the newly added modifications.

5. Now it is time to go back and enable your items!

Go back to the Zabbix GUI, select Configuration-->Hosts, and choose the host(s) that you added your 2 new items for (IPtables-Input-Dropped, IPtables-Output-Dropped). Select Items on each one of these hosts, choose the item and click the Enabled checkbox.

6. Check for Data

Make sure you wait a bit, because the interval time we set is 1 minute. 

After a reasonable length of time (suggested 5 minutes), go into the Zabbix GUI and select Monitoring-->Latest Data. In the hosts field, put one of the hosts into the field to filter the items for that particular host.

Find the two items, which are probably at the last page (the end), in a section called "other", since manually added Items tend to be grouped in the "other" category. 

On the right hand side, you should see "History". When you click History, your log file entries show up in the GUI!

Friday, October 28, 2022

Moving a LVM file system to a new disk in Linux

I had to dive back into Linux disk partitioning, file systems, and volumes when I got an alert from Zabbix that a cluster of 3 VMs were running out of space. As the alert from Zabbix said disk space was greater than 88 percent, I grew concerned and took a look.

In the labs, we had 3 x CentOS7 Virtual Machines, each deployed with a 200G VMDK file.  But inside the VM, in the Linux OS, there were logical volumes (centos-root, centos-swap, centos-home) that were mounted as XFS file systems on a 30G partition. There was no separate volume for /var (centos-var). And /var was the main culprit of the disk space usage. 

The decision was made to put /var on a separate disk as a good practice, because the var file system was used to store large virtual machine images.

The following steps were taken to move the /var file system to the new disk:

1. Add new Disk in vCenter to VM - create new VMDK file (100G in this particular case)

2. If the disk is seen, a /dev/sdb will be present in the Linux OS of the virtual machine. We need to create a partition on it (/dev/sdb1).
 
# fdisk /dev/sdb

n is the option to create a new partition, then p for selecting primary, then a bunch of useless question for this case, like the partition number, first and last cylinder, just use the default options.
This will create a Linux primary partition, you will need to use the command t in order to change the partition type to 8e (Linux LVM).
Then w will write everything to the disk and exit from fdisk.
# fdisk -l /dev/sdb

Will return something like this:

Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 20971519 20969472 10G 8e Linux LVM

3. Add device to physical volume (this creates a partition)
# pvcreate /dev/sdb1

NOTE: to delete a device from a physical volume, use vgreduce first, then pvremove!
vgreduce centos /dev/sdb1
pvremove /dev/sdb1

4. display volume group
# vgdisplay

--- Volume group ---
VG Name centos
[... more detail …]

5. display physical volumes in volume group
 
pvdisplay -C --separator '  |  ' -o pv_name,vg_name

6. Extend the volume group so it can contain the new disk (partition)

# vgextend centos /dev/sdb1

You will get info like this:
VG Size 29.75 GiB
PE Size 4.00 MiB
Total PE 7617
Alloc PE / Size 5058/ 19.75 GiB
Free PE / Size 2559 / 10 Gib

7. Create new logical volume

NOTE: this command can be tricky. You either need to know extents, and semantics, or you can keep is simple. Such as:
# lvcreate -n var -l 100%FREE centos

8. Create file system - NOTE that XFS is the preferred type, not ext4!
# mkfs -t xfs /dev/centos/var

9. Mount the new target var directory as newvar
# mkdir /mnt/newvar
# mount /dev/centos/var /mnt/newvar

10. Copy the files

NOTE: Lots of issues can occur during this, depending on what directory you are copying (i.e. var is notorious because of run and lock dirs).

I found this command to work:
# cp -apxv /var/* /mnt/newvar

Another one people seem to like, is the rsync command, but this one below I attempted hung:
# rsync -avHPSAX /var/ /mnt/newvar/

11. You can do a diff, or try to, to see how sane the copy went:
# diff -r /var /mnt/newvar/

12. Update fstab for reboot
/dev/mapper/centos-var /var xfs defaults 0 0

Note that we used the logical volume centos-var here, not centos (the volume group). LVM calls the volumes centos-swap, centos-home, etc.

13. Move the old /var on old root file system
# mv /var /varprev

14.Rename current var, create a new folder and remount
# mkdir /var
# mount /var

15. Use the df command to bring all the mounts
# df -h | grep /dev/

16. Decide whether you want to remove the old var file system and reclaim that disk space.

NOTE: Do not do this until you’re damned sure the new one is working fine. I recommend rebooting the system, inspecting all services that need to be running, etc.  

Now, the only thing left to consider now, is that after we have moved /var to a new 100G VMDK disk, what do we do about the fact that we now have a 200G boot/swap/root disk that is only using a small fraction of 200G in space now? Well, shrinking disks is even MORE daunting, and is not the topic of this post. But, if I decide to reclaim some space, expect another post that documents how I tackled that effort (or attempted to). 

For now, no more alerts about running out of space on a root file system is good news, and this VM can now run peacefully for quite a while.

Tuesday, June 7, 2022

VMWare Network Debugging - Trex Load Generation and Ring Buffer Overflow

We began running Trex Traffic Generator testing, sending load to a couple of virtual machines running on ESXi vSphere-managed hypervisors, and are running into some major problems.

First, the Trex Traffic Generator:

  • Cent7 OS virtual machine
  • 3 ports
    • eth0 used for ssh connectivity and to run Trex and Trex Console (with screen utility)
    • eth1 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)
    • eth2 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)
  • 4 cores 
    • the VM actually has 6, but two are used for running OS and Trex Admin
    • Traffic Tests utilize 4 cores

Next, the Device(s) Under Test (DUT):

  1. Juniper vSRX which is a router VM (based on JUNOS but Berkeley Unix under the hood?)
  2. Standard CentOS7 Virtual Machine
     

We ran the stateless imix test, at 20% and 100% line utilization.

We noticed that the Trex VM was using 80-90% core usage in the test (Trex Stats from console), and was using 20-25% line utilization, sending 4Gbps per port (8Gbps total) to the DUT virtual machines.

On the receiving side, the router was only processing about 1/4 to 1/6 of the packets sent by Trex.  The Cent7 VM, also, could not receive more than about 3.5Gbps maximum.

So what is happening? This led us to a Deep Dive, into the VMWare Statistics.

By logging into the ESXi host that the receiving VM was running on, we could first fine out what virtual switch and port the VM interface was assigned to, by running:

# net-stats -l

This produces a list, like this:

PortNum          Type SubType SwitchName       MACAddress         ClientName
50331650            4       0 DvsPortset-0     40:a6:b7:51:18:60  vmnic4
50331652            4       0 DvsPortset-0     40:a6:b7:51:1e:fc  vmnic2
50331654            3       0 DvsPortset-0     40:a6:b7:51:1e:fc  vmk0
50331655            3       0 DvsPortset-0     00:50:56:63:75:bd  vmk1
50331663            5       9 DvsPortset-0     00:50:56:8a:af:c1  P6NPNFVNDPKVMA.eth1
50331664            5       9 DvsPortset-0     00:50:56:8a:cc:74  P6NPNFVNDPKVMA.eth2
50331669            5       9 DvsPortset-0     00:50:56:8a:e3:df  P6NPNFVNRIV0009.eth0
67108866            4       0 DvsPortset-1     40:a6:b7:51:1e:fd  vmnic3
67108868            4       0 DvsPortset-1     40:a6:b7:51:18:61  vmnic5
67108870            3       0 DvsPortset-1     00:50:56:67:c5:b4  vmk10
67108871            3       0 DvsPortset-1     00:50:56:65:2d:92  vmk11
67108873            3       0 DvsPortset-1     00:50:56:6d:ce:0b  vmk50
67108884            5       9 DvsPortset-1     00:50:56:8a:80:3c  P6NPNFVNDPKVMA.eth0

A couple of nifty commands, will show you the statistics:
# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/clientStats
port client stats {
   pktsTxOK:115
   bytesTxOK:5582
   droppedTx:0
   pktsTsoTxOK:0
   bytesTsoTxOK:0
   droppedTsoTx:0
   pktsSwTsoTx:0
   droppedSwTsoTx:0
   pktsZerocopyTxOK:0
   droppedTxExceedMTU:0
   pktsRxOK:6595337433
   bytesRxOK:2357816614826
   droppedRx:2934191332 <-- lots of dropped packets
   pktsSwTsoRx:0
   droppedSwTsoRx:0
   actions:0
   uplinkRxPkts:0
   clonedRxPkts:0
   pksBilled:0
   droppedRxDueToPageAbsent:0
   droppedTxDueToPageAbsent:0
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
   LRO pkts rx ok:0
   LRO bytes rx ok:0
   pkts rx ok:54707478
   bytes rx ok:19544123192
   unicast pkts rx ok:54707448
   unicast bytes rx ok:19544121392
   multicast pkts rx ok:0
   multicast bytes rx ok:0
   broadcast pkts rx ok:30
   broadcast bytes rx ok:1800
   running out of buffers:9325862
   pkts receive error:0
   1st ring size:4096 <-- this is a very large ring buffer size!
   2nd ring size:256
   # of times the 1st ring is full:9325862 <-- WHY packets are being dropped
   # of times the 2nd ring is full:0
   fail to map a rx buffer:0
   request to page in a buffer:0
   # of times rx queue is stopped:0
   failed when copying into the guest buffer:0
   # of pkts dropped due to large hdrs:0
   # of pkts dropped due to max number of SG limits:0
   pkts rx via data ring ok:0
   bytes rx via data ring ok:0
   Whether rx burst queuing is enabled:0
   current backend burst queue length:0
   maximum backend burst queue length so far:0
   aggregate number of times packets are requeued:0
   aggregate number of times packets are dropped by PktAgingList:0
   # of pkts dropped due to large inner (encap) hdrs:0
   number of times packets are dropped by burst queue:0
   number of packets delivered by burst queue:0
   number of packets dropped by packet steering:0
   number of packets dropped due to pkt length exceeds vNic mtu:0 <-- NOT the issue!
}

We were also able to notice that this VM had an Rx queue, per vCPU added to the VM (no additional settings to the VM settings were made to this specific Cent7 VM):

# vsish -e ls /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues
0/
1/
2/
3/
4/
5/
6/
7/

Each of the queues, can be dumped individually, to check Ring Buffer size (we did this and they were all 4096):

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/1/status
status of a vmxnet3 vNIC rx queue {
   intr index:1
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:33
   next2Use in ring1:0
   next2Write:1569
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/7/status
status of a vmxnet3 vNIC rx queue {
   intr index:7
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:1923
   next2Use in ring1:0
   next2Write:3458
}

So, that is where we are. We see the problem. Now, to fix it - that might be a separate post altogether.


Some additional knowledgebase sources of information on troubleshooting in VMWare environments:

  • MTU Problem

            https://kb.vmware.com/s/article/75213

  • Ring Buffer Problem

            https://kb.vmware.com/s/article/2039495

            https://vswitchzero.com/2017/09/26/vmxnet3-rx-ring-buffer-exhaustion-and-packet-loss/

Tuesday, May 3, 2022

T-Rex Traffic Generator - Stateless vs Stateful

I am in the beginning of learning the T-Rex Traffic Generator. Cisco developed this initially, but it is now an open source traffic generator. With all traffic generators, there is a learning curve associated with it. 

The first major question I had, was the modes that T-Rex can work in:

  • Stateless (STL)
  • Stateful (STF)
  • Advanced Stateful (ASTF)

There are two T-Rex doc pages that discuss these distinctions, but they are not written from a comparative perspective. I will list those links here.

Trex Website: Trex Stateless 

Trex Website: Trex Stateful

While this has good information, however, it was a discussion on Reddit that I found most useful:

Reddit Discussion: STF vs. STL vs. ASTF

In the event that this Reddit thread becomes archived, I will (re) post that discussion here:

----------------------------------------------------------------------------------------------------------------------------

Stateless STL - there is no IP stack so it can't communicate with another standard IP node. The framed packets are pre-built and just pumped out the NIC. Because there is no normal dynamic protocol stack STL mode is run between a TRex NIC pair where they just pass the framed packets between each other and track statistics.

Stateful [A]STF - there is an actual TCP stack running with some L7 support so the stream can communicate to a non t-rex node; or through a stateful firewall with NAT or load balancer etc.

More info and a quick comparison table is here - https://trex-tgn.cisco.com/trex/doc/trex_stateless.html#_stateful_vs_stateless

There is a fairly active community at https://groups.google.com/g/trex-tgn The developers are usually very responsive and will patch bugs usually within a day or two.

t-rex is an engineering tool that seems to be run by the developers and engineers so the documentation can be a little frustrating and the learning curve can be steep. It is however a flexible, powerful and extremely cost effective tool when compared to commercial equivalents.

Then making use of the API combined with your imagination you can also build things other than just stress testing hardware.

---------------------------------------------------------------------------------------------------------------------------- 


Thursday, April 14, 2022

IP MAC Discovery on NSX-T

We had a deployment where two customer VMs were deployed as an Active Standby cluster. And the failover wasn't working when they tested it.

I had already deployed a fully working pair of Active-Standby Virtual Machines using KeepaliveD, so I knew that VRRP worked. Now, I am not sure that the customer is using VRRP per se, but the concept of Active Standby failover remains a constant whether both of us were using a strict RFC-compliant VRRP or not. 

 So what was the difference between these customer VMs, and our VMs?

Well, the difference was that I was running my VMs on VLAN-backed network segments that were jacked into (legacy) vCenter / ESXi Distributed Port Groups. The customer's VMs, were jacked into NSX-T virtual switches (overlay segments).

So after re-verifying my VRRP failover (which worked flawlessly in both multicast and unicast peering configurations), the problem seemed to be traced back to NSX-T.

Was it Mac Spoofing? Was it a Firewall? NSX-T does run an Overlay Firewall! And these Firewalls are at the segment level, but also the Transport Zone (Tier 1 router) level.  Sure enough, we realized that the Tier 1 Firewall was dropping packets on failover attempts.

After much testing, it was concluded that it was related to TOFU on the IP Discovery Switching Profile.

From this VMWare link, we get some insight on this:

Understanding IP Discovery Switching Profile

By default, the discovery methods ARP snooping and ND snooping operate in a mode called trust on first use (TOFU). In TOFU mode, when an address is discovered and added to the realized bindings list, that binding remains in the realized list forever. TOFU applies to the first 'n' unique <IP, MAC, VLAN> bindings discovered using ARP/ND snooping, where 'n' is the binding limit that you can configure. You can disable TOFU for ARP/ND snooping. The methods will then operate in trust on every use (TOEU) mode. In TOEU mode, when an address is discovered, it is added to the realized bindings list and when it is deleted or expired, it is removed from the realized bindings list. DHCP snooping and VM Tools always operate in TOEU mode.

So guess what? After disabling this profile, and effectively disabling TOFU mode, TOEU mode kicked in and lo and behold, the customer's failover started working. 

Tuesday, April 12, 2022

DPDK Testing using TestPMD on a VMWare Virtual Machine

 

Testing and verifying DPDK is NOT easy. And it is even more challenging in VM environments.

After investing in VMWare hypervisors that supposedly run DPDK, we wanted to test and verify that a) it worked, and b) the performance was as advertised.

Below is a list of steps we took to get the host and a DPDK-enabled VM ready:

  • Hypervisor(s)
    • Enabled the ixgben_ens drivers on the host. There are some ESXI CLI commands you can run to ensure that these are loaded and working. 
  • VM Settings
    • VMXNet3 adaptor types on VM
    •  Latency Sensitivity = High (sched.cpu.latencySensitivity=TRUE)
    • Hugepages enabled in VM (sched.mem.lpage.enable1GPage TRUE
    • Reserve all Guest Memory
    • Enable Multiple Cores for High I/O Workloads (ethernetX.ctxPerDev=”1” )
    • CPU Reservation
    • NUMA Affinity (numaNodeAffinity=X)

After this, I launched the VM. I was smart enough to launch the VM with 3 NICs on it.

  1. eth0 - used as a management port, for ssh and such.
  2. eth1 - this one to be used for DPDK testing
  3. eth2 - this one to be used for DPDK testing

Launching a VM (i.e. a RHEL Linux VM) with these settings, does NOT mean that you are ready for DPDK!! You still need a DPDK-compiled application on your OS. DPDK applications need to use DPDK-enabled NIC drivers on the VM, and on a Linux VM, these drivers are typically run as kernel modules. There are several different types and kinds of DPDK drivers (kernel modules), such as vfio, uio-pci-generic, igb_uio, et al.

To prepare your VM for testing, we decided to install DPDK, and then run the TestPMD application.

Installing DPDK

To get DPDK, you can go to dpdk.org, and download the drivers, which comes as a tar.gz file that can be unpacked. Or, there is a github site that you can use the clone the directories and files.

# git clone http://dpdk.org/git/dpdk

 It is important to read the instructions when building DPDK, because the old-style "make configure", "make", "make install" process has been replaced by fancier build tools like meson and ninja that you need to install. I chose to install by going to the top of the directory tree, and typing:

# meson -Dexamples=all build

This does not actually compile the code. It sets the table for you to use Ninja to build the code. So the next step was to type:

# ninja

Followed by:

# ninja install

The "ninja install" puts a resultant set of DPDK executables (some ELF, some Python), in /usr/local/bin directory (maybe installs some stuff in other places too).

Right away, I hit a snag. When I tried to run dpdk_setup.py, to bind the VM The kernel module igb_uio.ko was nowhere to be found.

I was completely at a loss about this, until I realized that some other DPDK packages (test load generators) compile DPDK and the igb_uio.ko drivers, either by including them outright, or copying the sources into the build process. Trex, for example, builds the drivers. And so does a package called DTS. After I decided to git clone the DTS package, I stumbled upon some documentation in an archived package called DTS (DPDK Testing Suite). In the DTS package, in the /opt/github/dts/doc/dts_gsg/usr_guide there is a file called igb_uio.rst which describes how to compile the igb_uio.ko drivers for use with DTS. This was the missing link. The section of the file up front, described that the drivers have been moved into a different github repository - and are now separated from DPDK!

Get Source Code - note: assumption is that you are doing this in /opt directory.
---------------

Get igb_uio::

   git clone http://dpdk.org/git/dpdk-kmods
   git clone git://dpdk.org/dpdk-kmods


Get DPDK::

   git clone git://dpdk.org/dpdk
   git clone http://dpdk.org/git/dpdk

The author of this igb_uio.rst file described the process that can be used to fuse DPDK and the drivers back together into a single build - the way it used to be. How convenient. Here is how that is done.

Integrate igb_uio into DPDK
---------------------------

Assume you have cloned the dpdk and dpdk-kmods source code
in opt/dpdk and opt/dpdk-kmods.

Step 1
# Copy dpdk-kmods/linux/igb_uio/ to dpdk/kernel/linux/:

    [root@dts linux]# cp -r /opt/dpdk-kmods/linux/igb_uio /opt/dpdk/kernel/linux/

you should see igb_uio in your output:

    [root@dts linux]# ls /opt/dpdk/kernel/linux/
    igb_uio  kni  meson.build

Step 2:
# enable igb_uio build in meson:

since we have copied the directory over to /opt/dpdk, we will edit the meson.build there.

*   add igb_uio in /opt/dpdk/kernel/linux/meson.build subdirs as below:

     subdirs = ['kni', 'igb_uio']

NOTE: this is an important step not to miss because it will not build if you don't do this.

Step 3:
*   create a file of meson.build in /opt/dpdk/kernel/linux/igb_uio/ as below:

     # SPDX-License-Identifier: BSD-3-Clause
     # Copyright(c) 2017 Intel Corporation

     mkfile = custom_target('igb_uio_makefile',
             output: 'Makefile',
             command: ['touch', '@OUTPUT@'])

     custom_target('igb_uio',
             input: ['igb_uio.c', 'Kbuild'],
             output: 'igb_uio.ko',
             command: ['make', '-C', kernel_dir + '/build',
                     'M=' + meson.current_build_dir(),
                     'src=' + meson.current_source_dir(),
                     'EXTRA_CFLAGS=-I' + meson.current_source_dir() +
                             '/../../../lib/librte_eal/include',
                     'modules'],
             depends: mkfile,
             install: true,
             install_dir: kernel_dir + '/extra/dpdk',
             build_by_default: get_option('enable_kmods'))

How wonderful. To recap, here is what we did:

  1. copy the source files from dpdk-kmods into the proper directory of dpdk
  2. snap in the proper meson build file (which the author graciously provides)
  3. uninstall (previous build, assuming you built DPDK before doing all of this) 
  4. rebuild
  5. reinstall

Step 3:

# cd /opt/dpdk/build

# ninja uninstall

Step 4:
# ninja

Step 5:

# ninja install

A quick find command shows that the kernel module was built.

[root@acdcchndnfvdpk0001 dpdk]# find . -print | grep ko
./build/drivers/net/octeontx/base/libocteontx_base.a.p/octeontx_pkovf.c.o
./build/lib/librte_table.a.p/table_rte_table_hash_cuckoo.c.o
./build/lib/librte_hash.a.p/hash_rte_cuckoo_hash.c.o
./kernel/linux/igb_uio/igb_uio.ko
./kernel/linux/igb_uio/.igb_uio.ko.cmd
./drivers/net/octeontx/base/octeontx_pkovf.c
./drivers/net/octeontx/base/octeontx_pkovf.h
./lib/hash/rte_cuckoo_hash.h
./lib/hash/rte_cuckoo_hash.c
./lib/table/rte_table_hash_cuckoo.h
./lib/table/rte_table_hash_cuckoo.c

Now, we have something we can use to bind our adaptors to the drivers!!! 

You can bind the adaptors to the drivers using a couple of different methods. You can use a utility that is supplied by DPDK to do it (dpdk-devbind.py), or you can also use a nifty Linux utility called driverctl, which I prefer (this needs to be installed typically with a package manager as it generally does not roll onto the OS with a default installation). 

A script I use to do the binding looks like this:

# cat bind-pci.sh
#!/bin/bash

lshw -class network -businfo | grep pci

while :
do
   echo "Linux Interface to override (e.g. p1p1, p1p2, p1p3, p1p4):"
   read iface
   if [ ${iface} == "skip" ]; then
      break
   fi
   lshw -class network -businfo | grep pci | grep ${iface}
   if [ $? -eq 0 ]; then
      pci=`lshw -class network -businfo | grep pci | grep ${iface} | awk '{printf $1}' | cut -f2 -d"@"`
      echo "We will override the kernel driver with igb_uio for PCI address: ${pci}"
      driverctl set-override ${pci} igb_uio
      break
   fi
done 

When you run this script, you can check to see if the binding was successful by running a DPDK command:

# python3 /usr/local/bin/dpdk-devbind.py --status

And this command will show you whether the binding worked or not.

Network devices using DPDK-compatible driver
============================================
0000:0b:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3
0000:13:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3


Network devices using kernel driver
===================================
0000:03:00.0 'VMXNET3 Ethernet Controller 07b0' if=eth0 drv=vmxnet3 unused=igb_uio *Active*

NOTE: To remove a driver binding, "driverctl unset-override ${pci address}" would be used. In which case, the driver will now become visible to the Linux OS in the Virtual Machine again.

So we now have one adaptor that the Linux Networking Kernel sees (eth0), but the two adaptors Linux saw prior to the binding (eth1, and eth2), have now been "reassigned" to DPDK. And the OS does not see them at all, actually, anymore. 

If we run an ifconfig, or an "ip a" command to see the Linux network interfaces in the VM, this is what it now looks like.

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b7:83:1a brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.10/24 brd 192.168.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:feb7:831a/64 scope link
       valid_lft forever preferred_lft forever

NOTE: No eth1 or eth2 shows up in Linux anymore, as they have been handed over to DPDK, which bypasses the Linux Kernel entirely.

Okay, now we have adaptors set up for DPDK. Now what? In our next step, we will do some simple verification with TestPMD.

Friday, March 4, 2022

ESXi is NOT Linux

ESXi is not built upon the Linux kernel, but uses an own VMware proprietary kernel (the VMkernel) and software, and it misses most of the applications and components that are commonly found in all Linux distributions.

Because ESXi uses "-ix" commands (Unix, Linux, POSIX), it "looks and smells" like Linux, but in fact, these commands are similar to the package CygWin that one can run on a Windows system to get a Linux terminal and command line interpreter. ESXi does not use CygWin, however. They run something called BusyBox.

BusyBox is something used on a lot of small-factor home networking gear. PfSense for example, runs Berkeley Unix (BSD). But many small routers (Ubiquiti EdgeMax comes to mind) use different chipsets, different OS kernels, and then use BusyBox to abstract this kernel away from users by providing a common interface - meaning users don't need to learn a whole slew of new OS commands.

 ESXi has a LOT of things that Linux does NOT have:

1. File systems VMFS6 for example is the newest revision of VMFS.

2. Process Scheduler - and algorithms

3. Kernel hooks that tools like esxtop use (think system activity reporting in Unix and Linux) 

 

This article (the source for this post), discusses some nice facts in comparing ESXi to Linux:

ESXi-is-not-based-on-Linux

I learned some interesting things from this article, such as:

ESXi even uses the same binary format for executables (ELF) than Linux does, so it is really not a big surprise anymore that you can run some Linux binaries in an ESXi shell - provided that they are statically linked or only use libraries that are also available in ESXi! (I exploited this "feature" when describing how to run HP's hpacucli tool in ESXi and when building the ProFTPD package for ESXi).

...You cannot use binary Linux driver modules in ESXi. Lots of Linux device drivers can be adapted to ESXi though by modifying their source code and compiling them specifically for ESXi. That means that the VMkernel of ESXi implements a sub-set of the Linux kernel's driver interfaces, but also extends and adapts them to its own hypervisor-specific needs.

In my opinion this was another very clever move of the VMware ESXi architects and developers, because it makes it relatively easy to port an already existing Linux driver of a hardware device to ESXi. So the partners that produce such devices do not need to develop ESXi drivers from scratch. And it also enables non-commercial community developers to write device drivers for devices that are not supported by ESXi out-of-the-box!

There is a PDF download of the ESXi architecture, which can be downloaded here:

 https://www.vmware.com/techpapers/2007/architecture-of-vmware-esxi-1009.html

Tuesday, March 1, 2022

VMWare Clustered File Systems - VMFS5 vs VMFS6

 

 A nice table that describes the differences between VMWare's VMFS5 and the new VMFS 6.

Source: http://www.vmwarearena.com/difference-between-vmfs-5-vmfs-6/


For the difference in 512n versus 512e:


VMFSsparse:

VMFSsparse is a virtual disk format used when a VM snapshot is taken or when linked clones are created off the VM. VMFSsparse is implemented on top of VMFS and I/Os issued to a snapshot VM are processed by the VMFSsparse layer. VMFSsparse is essentially a redo-log that grows from empty (immediately after a VM snapshot is taken) to the size of its base VMDK (when the entire VMDK is re-written with new data after the VM snapshotting). This redo-log is just another file in the VMFS namespace and upon snapshot creation the base VMDK attached to the VM is changed to the newly created sparse VMDK.

SEsparse (space efficient):

SEsparse is a new virtual disk format that is similar to VMFSsparse (redo-logs) with some enhancements and new functionality. One of the differences of SEsparse with respect to VMFSsparse is that the block size is 4KB for SEsparse compared to 512 bytes for MFSsparse. Most of the performance aspects of VMFSsparse discussed above—impact of I/O type, snapshot depth, physical location of data, base VMDK type, etc.—applies to the SEsparse format also.

Wednesday, February 9, 2022

Jinja2 Templating in Ansible

Lately, I have been playing around with Jinja2 Templating in Ansible. Let me explain the context of that.

With the Morpheus CMP solution, it has an Automation Workflow engine that can be used to run Tasks, or whole sets of Workflows, in a variety of different technologies (scripting languages, Ansible, Chef, Puppet, et al).

To access the variables about your Virtual Machine, say after you launch it, you put tags into your script to reference variables. The tags can have subtle differences in syntax, depending on whether it is a bash script, a Python script, or an Ansible playbook.

This post, is related to Ansible specifically.

If you are needing an explicit specific value, the tag in an Ansible playbook would look as follows:
    - name: "set fact hostname"
      set_fact:
        dnsrecord: |
          {{ morpheus["instance"]["hostname"] | trim }}

Really strange, and confusing, syntax. Not to mention, this pipe to a supposed function called trim.

What language is this? I thought it was groovy, or some kind of groovy scripting language - at first. Then, I thought it was a form of Javascript. Finally, after some web research, I have come to learn that this markup is Ansible's Jinja2 scripting language.

First, I had to understand how Morpheus worked. I realized that I could use a Jinja2 tag to dump the entire object (in JSON) about a launched virtual machine (tons of data actually). Once I understood how Morpheus worked, and the JSON it generates, I was able to go to work snagging values that I needed in my scripts.

But - eventually, my needs (use cases) became more complex. I needed to loop through all of the interfaces of a virtual machine! How do you do THAT??

Well, I discovered that to do more sophisticated logic structures (i.e. like loops), the markup and tagging is different, and the distinctions are important. You can wind up pulling your hair out if you don't understand them.

Let's take an example where we loop through a VM's interfaces with Jinja2.

In this example, we loop through all interfaces of a virtual machine. But - we use an if statement to only grab the first interface's ip address. 

Note: To be optimized, we should break after we get that first ip address, but breaking out of loops is not straightforward in Jinja2, and there are only a handful of interfaces, so we will let the loop continue on, albeit wastefully.

    - name: set fact morpheusips
      set_fact:
         morpheusips: |
           {% for interface in morpheus['instance']['container']['server']['interfaces'] %}
             {% if loop.first %}
                {{ interface['ipAddress'] }}
             {% endif %}
           {% endfor %}

Note that an explicit specific value - has NO percent signs in the tag!

But, the "logic", like for loops, if statements, et al, those DO use percent signs in the tag!

This is extremely important to understand!

Now, the variable we get - morpheusips - is a string, which contains leading and trailing spaces, and has newlines - including an annoying newline at the end of the string which wreaked havoc when I needed to convert this string to an array (using the split function).  

I found myself having to write MORE code, to clean this up, and found more useful Jinja2 functions for doing this kind of string manipulation and conversion (to an array).

    - name: "Replace newlines and tabs with commas so we can split easier"
      set_fact:
         commasep: "{{ morpheusips | regex_replace('[\\r\\n\\t]+',',') | trim }}"


    - name: "Remove comma at the end of the string"
      set_fact:
         notrailcomma: "{{ commasep | regex_replace(',$','') | trim }}"

    - name: "convert the ip delimeter string to a list so we can iterate it"
      set_fact:
         morpheusiplst: "{{ notrailcomma.split(',') }}"

    - name: "Loop and Print variable out for morpheusiplst"
      ansible.builtin.debug:
         var: morpheusiplst

I am NOT a guru, or a SME, on Jinja2 Templating. But, this is a blog to share what I have been poking around with as I get used to it to solve some problems.


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...