Showing posts with label OpenVSwitch. Show all posts
Showing posts with label OpenVSwitch. Show all posts

Tuesday, October 13, 2020

DPDK Hands-On - Part IX - Launching a VM with DPDK Ports from a bash script

When one launches a virtual machine on Linux, you can configure your virtual machine hypervisor to run as a Type 1 hypervisor (qemu), or a Type 2 hypervisor (qemu-kvm). This post won't go into the details of the differences between those.

The options one can pass into qemu or qemu-kvm can be daunting.

But, if you want to launch a virtual machine just for the purpose of testing a DPDK adaptor, this can be done in a lightweight manner.

Remember, that a vhostuser port connects to a socket on the OpenVSwitch. This is the original, legacy manner in which DPDK was implemented between VMs and OpenVSwitch. 

The drawback to this, of course, is that if the switch were to die, the VMs would be stranded with no connectivity (a bad thing). 

This is why they came out with vhostuserclient ports, where OpenVSwitch connects to a socket that is managed by qemu when it starts the VM. This way, if the switch dies or is restarted, ports on the switch will reconnect to the VM-managed sockets. If a VM dies, it only removes its own sockets.

So we will cover the scenarios of launching a VM in both modes; vhostuser, and vhostuserclient.

VhostUser

With a vhostuser port, OpenStack creates the socket that qemu connects to. So this port needs to be created ahead of time on OpenVSwitch (see DPDK Hands-On Part VIII - creating DPDK ports).

Below is a snippet of a bash script that cranks a VM after some parameters are defined.

if  [ $PORT_TYPE == "vhost-user" ]; then
   # vhost-user - openvswitch binds to socket and acts as server
   export VHOST_SOCK_DIR=/var/run/openvswitch

   export VHOST_PORT="${VHOST_SOCK_DIR}/vhostport1"

   # the vhostuser does not have the server parameter.
   /usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
   -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
   -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
   -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
   -chardev socket,id=char${MAC},path=${VHOST_PORT} \
   -netdev type=vhost-user,id=default,chardev=char${MAC},vhostforce,queues=2
\
   -device virtio-net-pci,mac=00:00:00:00:00:0${MAC}, netdev=default, mrg_rxbuf=off, mq=on,  vectors=6
 #1>qemu.kvm.vhostuser.out 2>&1

fi

So let's discuss these parameters:

- The Numa Node parameter tells you how to specify your cpus. A good link on this can be found at https://futurewei-cloud.github.io/ARM-Datacenter/qemu/how-to-configure-qemu-numa-nodes/ 

Now on my VM, I only have one Numa slot, which equates to one Numa Node (Numa Node 0), with 4 cores on it. So with the -smp sockets=1 cores=2 directive, I can specify 2 cores for my Virtual Machine.

- Mem Path

This parameter is specified to put the VM on hugepages. Any VM using DPDK ports should be backed by HugePages.

- Socket Parameters

The socket parameters are important. OpenVSwitch, if it creates the socket, will place the socket in /var/run/openvswitch by default. You have to create the port there manually, then tell the VM to connect to it specifically by the path and socket name. 

- Multi queuing

You will notice that mq=on, and 2 queues are specified. Having more than one queue for sending and receiving data unlocks a bottleneck. 

NOTE: I am not clear, actually, if by specifying 2 on "queues=2", that this means 2 x Tx queues AND 2 x Rx queues. I need to look into that and perhaps update this post (or if someone can comment on this, great).

The proof is in the pudding on this. When you launch the VM, you will need to check TWO places to ensure your networking initialized properly:

First, check /var/log/openvswitch/ovs-vswitchd.log

Second, check the output of the VM. Which means that you would need to redirect your output to a file for stdout and stderr and analyze accordingly. The green highlighted section above shows this. But this does affect your start stop of the VM, so I comment this out when running in normal mode.

VhostUserClient

The vhostuserclient works very similar to vhostuser, except that the socket is managed by qemu rather than OpenVSwitch!


# vhost-user-client - qemu binds to socket and acts as server
export VHOST_SOCK_DIR="/var/lib/libvirt/qemu/vhost_sockets"

export VHOST_PORT="${VHOST_SOCK_DIR}/dpdkvhostclt1"

echo "Starting VM..."
/usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
  -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
  -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
  -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
  -chardev socket,id=char1,path=${VHOST_PORT},server \
  -netdev type=vhost-user,id=default,chardev=char1,vhostforce,queues=2 \
  -device virtio-netpci,mac=00:00:00:00:00:0${MAC},netdev=default,mrg_rxbuf=off,mq=on,vectors=6

Note the difference between the "chardev" directive above, in comparison with vhostuser "chardev" directive. The vhostuserclient directive has the additional "server" parameter included!!! This is the KEY to specifying that QEMU owns the socket, as opposed to OpenVSwitch.

One might wonder, how this works in practice. If you create a VM and it binds to a socket, how does OpenVSwitch know to connect to it? The answer is, OpenVSwitch won't - until you create the port on OpenVSwitch in vhostuserclient mode! At which point, OVS will know it is the client, and create the connection to the qemu process!

So a script that launches a VM with vhostuserclient interfaces, should probably create the OVS port first, then launch the VM. Otherwise, the VM won't have the connectivity it needs in a timely manner when it boots up (i.e. for DHCP to determine its IP Address and settings).

Tuesday, September 29, 2020

DPDK Hands-On - Part VIII - Creating Virtual DPDK Ports on OpenVSwitch

Before we get into the procedure of adding virtual ports to the switch, it is important to understand the two types of DPDK virtual ports, and their differences.

In the earlier versions of DPDK+OVS, virtual interfaces were defined with a type called vhostuser. These interfaces would connect to OpenVSwitch. Meaning, from a socket perspective, that OpenVSwitch managed the socket. More technically, OVS binds to a socket in /var/run/openvswitch/<portname>  behaving as a server, while the VMs connect to this socket as a client.

There is a fundamental flaw in this design. A rather major one! Picture a situation where a dozen virtual machines launch with ports that are connected to the OVS, and the OVS is rebooted! All of those sockets are destroyed on the OVS, leaving all of the VMs "stranded".

To address this flaw, the socket model needed to be reversed. The virtual machine (i.e. qemu) needed to act as the server, and the switch needed to be the client! 

Hence, a new port type was created: vhostuserclient.

A more graphical and elaborative explanation of this can be found on this link:

https://software.intel.com/content/www/us/en/develop/articles/data-plane-development-kit-vhost-user-client-mode-with-open-vswitch.html 

Now because there are two sides to any socket connection, it makes sense that BOTH sides need to be configured properly with the proper port type for this communication to work.

This post deals with simply adding the right DPDK virtual port type (vhostuser or vhostuserclient) to the switch. But configuring the VM properly is also necessary, and will be covered in a follow-up post after this one is published.

I think the easiest way to show how these two port types are added, with some discussion.

VhostUser

To add a vhostuser port, the following command can be run:

# ovs-vsctl add-port br-tun dpdkvhost1 -- set Interface dpdkvhost1 type=dpdkvhostuser ofport_request=2

It is as simple as adding a port to a bridge, giving it a name, and using the appropriate type for a legacy virtual DPDK port (dpdkvhostuer). We also give it port number 2 (in our earlier post, we added a physical DPDK PCI NIC to port 1 so we will assume port 1 is used by that).

Notice, that there is no socket information included in this. OpenVSwitch will create a socket, by default, in /var/run/openvswitch/<portname> once the vhostuser port is added to the switch. 

NOTE: The OVS socket location can be overridden but for simplicity we will assume default location. Another issue is socket permissions. When the VM launches under a different userid such as qemu, the socket will need to be writable by qemu!

The virtual machine, with a vhostuser interface defined on it, will need to be instructed where to connect; what socket to connect to. So because the VM needs to know where to connect to, it actually makes the OVS configuration somewhat simpler in this model because OVS will create the socket where it is configured to do so, the default being in /var/run/openvswitch.

So after adding a port to the bridge, we can do a quick show on our bridge to ensure it created properly.

# ovs-vsctl show 

 Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser

        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

With this configuration, we can do a test between a physical interface and a virtual interface, or the virtual interface can attempt to reach something outside of the host (i.e. a ping test to a default gateway and/or an internet address). With this configuration, a virtual machine could also attempt a DHCP request to obtain its IP address for the segment is on if a DHCP server indeed exists.

If we wanted to test between two virtual machines, another such interface would need to be added:

# ovs-vsctl add-port br-tun dpdkvhost2 -- set Interface dpdkvhost2 type=dpdkvhostuser ofport_request=3

And, this would result in the following configuration:

  Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhost2"
            Interface "dpdkvhost2"
                type: dpdkvhostuser
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser

        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

With this configuration, TWO virtual machines would connect to their respective OVS switch sockets:

VM1 connects to OVS socket for vhostusr1 --> /var/run/openvswitch/vhostusr1

VM2 connects to OVS socket for vhostusr2 --> /var/run/openvswitch/vhostusr2

Thanks to the PCI port we added earlier these two VMs "reach outside" to request an IP Address, and ping each other on the same segment if they both have an IP address.

dpdkvhost1      dpdkvhost2                          

      =|==========|=

      OVS Bridge (br-tun)

      ======|======

                dpdk0

                    |

        Upstream Router         

VhostUserClient

This configuration looks similar to the vhostuser configuration, but with a subtle difference. In this case, the VM is the server in the client server socket model, so the OVS port, as a client, needs to know where the socket it in order to connect to it!

# ovs-vsctl add-port br-tun dpdkvhostclt1 -- set Interface dpdkvhostclt1 type=dpdkvhostuserclient "options:vhost-server-path=/var/lib/libvirt/qemu/vhost_sockets/dpdkvhostclt1" ofport_request=4

In this directive, the only thing that changes is the addition of the parameter telling OVS where the socket is to connect to, and of course the type of port needs to be set to dpdkvhostuserclient (instead of the vhostuser).

And, if we run out ovs-vsctl show command, we will see that the port looks similar to the vhostuser ports, except for two differences:

  • the type is now vhostuserclient, rather than vhostuser
  • the option parameter which instructs OVS (the socket client) where to connect to.

Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhostclt1"
            Interface "dpdkvhostclt1"
                type: dpdkvhostuserclient
                options: {vhost-server-path="/var/lib/libvirt/qemu/vhost_sockets/dpdkvhostclt1"}

        Port "dpdkvhost2"
            Interface "dpdkvhost2"
                type: dpdkvhostuser
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

Setting up Flows

Just because we have added these port, does not necessarily mean they'll work after creation. The next step, is to enable flows (rules) for traffic forwarding between these ports.

Setting up switch flows is an in-depth topic in and of itself, and one we won't cover in this post. There are advanced OpenVSwitch tutorials on Flow Programming (OpenFlow).

The first thing you can generally do, if you don't have special flow requirements that you're aware of, is to set the traffic processing to "normal", as seen below for the br-tun bridge/switch.

# ovs-ofctl add-flow br-tun actions=normal 

This should give normal L2/L3 packet processing. But, if you can't ping or your network forwarding behavior isn't as desired, you may need to program more detailed or sophisticated flows.

For simplicity, I can show you a couple of examples of how one could attempt to enable some traffic to flow between ports:

allows you to ping from the bridges out to the host on their PCI interfaces... 

# ovs-ofctl add-flow br-tun in_port=LOCAL,actions=output:dpdk0

# ovs-ofctl add-flow br-prv in_port=LOCAL,actions=output:dpdk1

allows you to forward packets to the proper VM when they come into the host.

# ovs-ofctl add-flow br-tun ip_dst=192.168.30.202,actions=output:dpdkvhost1

# ovs-ofctl add-flow br-prv ip_dest=192.168.20.202,actions=output:dpdkvhost0

To debug the packet flows, you can dump them with the "dump-flows" command. There is a similarity between iptables rules (iptables -nvL) and openvswitch flows, and debugging is somewhat similar in that you can dump flows, and look for packet counts.

# ovs-ofctl dump-flows br-prv
 cookie=0xd2e1f3bff05fa3bf, duration=153844.320s, table=0, n_packets=0, n_bytes=0, priority=2,in_port="phy-br-prv" actions=drop
 cookie=0xd2e1f3bff05fa3bf, duration=153844.322s, table=0, n_packets=10224168, n_bytes=9510063469, priority=0 actions=NORMAL

In the example above, we have two flows on the bridge br-prv. And we do not see any packets being dropped. So, presumably, anything connected to this bridge should be able to communicate from a flow perspective.

After setting these kinds of flows, ping tests and traffic verification tests will need to be done. 

I refer to this as "port plumbing" and these rules indeed can get very advanced, sophisticated and complex - potentially. 

If you are launching a VM on Linux, via KVM (a script usually), or using Virsh Manager (which drives off of an xml file that describes the VM), you will need to set these "port plumbing" rules up manually, and you would probably start with the basic normal processing unless you want to do something sophisticated.

If you are using OpenStack, however, OpenStack does a lot of things automatically, and the things it does is influenced by your underlying OpenStack configuration (files). For example, if you are launching a DPDK VM on an OpenStack that is using OpenVSwitch, each compute node that will be running a neutron-openvswitch-agent service. This service,  is actually a Ryu OpenFlow Controller, and when you start this service, it plumbs ports on behalf of OpenStack Neutron on the basis of your Neutron configuration. So you may look at your flows with just OpenVSwitch running and see a smaller subset of flows than you would, if the neutron-openvswitch-agent were running! I may get into some of this in a subsequent post, if time allows.

Wednesday, June 17, 2020

DPDK Hands-On - Part VII - Creating a Physical DPDK NIC Port on OpenVSwitch

First off, you can NOT add a DPDK port for a physical NIC onto OpenVSwitch, if the NIC is not using a DPDK driver! And that driver must be working!!!

Now, when I got started, I was using vfio as my DPDK driver. Why? Because, per the DPDK and DPDK+OVS websites, vfio is the newer driver and the one everyone is encouraged to use.

In order to use a poll mode driver (on Linux), the appropriate kernel module(s) need(s) to be loaded. Once this kernel module is loaded, you can use a couple of methods (at least) to tell Linux to use the vfio driver for your PCI NIC. The two tools I use, are:

  • driverctl - this ships with the OS (CentOS at least)
  • dpdk-devbind - to use this utility, dpdk-tools package needs to be installed, or if you have downloaded and compiled DPDK by source code you can use it that way.

Using driverctl, you can set your driver by "overriding" the default driver, as follows:

# driverctl set-override 0000:01:00.0 vfio-pci

or, if you are using uio_pci_generic, 

# driverctl set-override 0000:01:00.0 uio_pci_generic

NOTE: I had to use uio_pci_generic with my cards. The vfio driver was registering an error that I finally found in the openvswitch log.

I find the dpdk-devbind tool best for checking status, but you can also use driverctl for that also. I will show both commands and their output. The driverctl is not as clean in its output format as the dpdk-devbind command is.

# driverctl -v list-devices | grep -i net
0000:00:19.0 e1000e (Ethernet Connection I217-LM)
0000:01:00.0 uio_pci_generic [*] (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:01:00.1 uio_pci_generic [*] (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:03:00.0 e1000e (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:03:00.1 e1000e (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))

 # dpdk-devbind --status

Network devices using DPDK-compatible driver
============================================
0000:01:00.0 '82571EB Gigabit Ethernet Controller 105e' drv=uio_pci_generic unused=e1000e
0000:01:00.1 '82571EB Gigabit Ethernet Controller 105e' drv=uio_pci_generic unused=e1000e


Network devices using kernel driver
===================================
0000:00:19.0 'Ethernet Connection I217-LM 153a' if=em1 drv=e1000e unused=uio_pci_generic
0000:03:00.0 '82571EB Gigabit Ethernet Controller 105e' if=p1p1 drv=e1000e unused=uio_pci_generic
0000:03:00.1 '82571EB Gigabit Ethernet Controller 105e' if=p1p2 drv=e1000e unused=uio_pci_generic

After you load your driver for the NIC of choice, you need to check a few things to ensure that the driver is working properly. Otherwise, you might pull your hair out trying to debug problems, without realizing that the underlying driver is the culprit.

Next, we will add the port to OpenVSwitch, but before you do, there are two important things to make sure are done:
 
1. make sure the openvswitch you are running has been compiled for DPDK, and initialized.

The best way to do this, is to dump the switch parameters, as shown below. I have highlighted all of the settings you should typically encounter for a properly compiled and configured DPDK-OpenVSwitch.

# ovs-vsctl list open_vswitch
_uuid               : 2d46de50-e5b8-47be-84b4-a7e85ce29526
bridges             : [14049db7-07eb-4b00-89d6-b05201d2978e, 221b954d-c2dc-48b2-923b-86a87765ed7b, db7a8c3c-5a03-4063-a5e8-23686f319473]
cur_cfg             : 936
datapath_types      : [netdev, system]
db_version          : "7.16.1"
dpdk_initialized    : true
dpdk_version        : "DPDK 18.11.8"
external_ids        : {hostname=maschinen, rundir="/var/run/openvswitch", system-id="35b95ef5-fd71-491f-8623-5ccbbc1eca6b"}
iface_types         : [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, erspan, geneve, gre, internal, "ip6erspan", "ip6gre", lisp, patch, stt, system, tap, vxlan]
manager_options     : [a4329e63-5b63-4e8b-8dac-0e9bc8492c28]
next_cfg            : 936
other_config        : {dpdk-init="true", dpdk-lcore-mask="0x2", dpdk-socket-limit="2048", dpdk-socket-mem="1024", pmd-cpu-mask="0xC"}
ovs_version         : "2.11.1"
ssl                 : []
statistics          : {}
system_type         : centos
system_version      : "7"

2. make sure that your specific bridge, is created for DPDK (netdev datapath rather than system)

Trying to add a PCI NIC that is using poll mode drivers (vfio or uio) to an OpenVSwitch bridge that does not have the datapath set correctly (to netdev) will result in an error when you add the port. This error will be reported both on an "ovs-vsctl show" command, and also reflected in the log file /var/log/openvswitch/ovs-vswitchd.log file.

This was one of the most common and confusing errors to debug, actually. In newer versions of OVS, they have added the datapath to the "ovs-vsctl show" command. My switch, however, predates this patch, so I have to examine my datapath another way:

# ovs-vsctl list bridge br-tun | grep datapath
datapath_id         : "0000001b21c57204"
datapath_type       : netdev
datapath_version    : "<built-in>"

If your datapath on the bridge is NOT using netdev, you can change it by changing your datapath.

# ovs-vsctl set bridge br-tun datapath_type=netdev

So if your switch looks good, and your bridge is set with the right netdev datapath type, adding your NIC should be successful with the following command:

# ovs-vsctl add-port br-tun dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:01:00.0 ofport_request=1

Let me comment a bit on this command to shed some additional insight.

1. I always add my dpdk NICs - whether they are physical or virtual ports - with the name "dpdk" as a prefix as a matter of practice.

2.  A physical PCI NIC is going to use type dpdk. Virtual interfaces, do NOT use that type, and use vhostuser of vhostuserclient types. The distinction between these two is a subsequent discussion.

3. The dpdk-devargs is where you map the port to a specific PCI address. You need to be SURE you are using the correct one, and not inadvertently switching these. A lot of mistakes are made by using the wrong PCI addresses, especially in cases where there are multiple ports on a NIC card where one may be 0000:01:00.0 and another 0000:01:00.1!

4. The ofport_request where you are giving your port a port number on the switch. If you set this to 1, and 1 is taken, you will get an error. But in general, I try to always make a physical NIC port 1. It is extremely rare to add more than one physical NIC to a bridge.

So after adding your NIC to the bridge, you can check the status of it with the ovs-vsctl show command:

# ovs-vsctl show

    Bridge br-tun
        fail_mode: standalone
        Port "dpdk0" --> you will see an error here if the port did not add correctly!
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}

        Port br-tun
            Interface br-tun
                type: internal

Another useful command, is to dump your specific bridge, by port number. We can use the "ofctl" command, rather than "vsctl" command.

# ovs-ofctl show br-tun
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000001b21c57204
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(dpdk0): addr:00:1b:21:c5:72:04
     config:     0
     state:      0
     current:    1GB-FD AUTO_NEG
     speed: 1000 Mbps now, 0 Mbps max

 LOCAL(br-tun): addr:00:1b:21:c5:72:04 --> LOCAL port maps to the bridge port in ovs-vsctl command.
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

And this concludes our post on adding a physical DPDK PCI NIC to an OpenVSwitch bridge.

DPDK Hands-On - Part VI - Configuring OpenVSwitch DPDK Parameters

In earlier posts, we talked about the fact that we needed to check our compatibility with NUMA, HugePages, and IOMMU. And when we decided we had that compatibility, we "enabled" these technologies by passing them into the Linux kernel through grub.

But - now that everything is enabled, OpenVSwitch needs to be configured with a number of parameters that, to the layman's eye, look quite scary and intimidating.  Fortunately, there is a website that elaborates on these parameters much better than I could, so I will list that here:    OVS-DPDK Parameters

The first thing to do, is initialize OpenVSwitch for DPDK. Without this, OpenVSwitch is deaf,dumb and blind as to DPDK-related directives.

# ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true

Next, we will reserve memory, in Hugepages, for DPDK sockets. The command below will reserve 1 Hugepage with an upper limit of 2 Hugepages (each Hugepage = 1G).

# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024"
# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="2048"

NOTE: The Hugepages use can be checked a couple of ways. See prior post on Hugepages.

Next, we instruct OVS to place DPDK threads on specific cores the way we want to. We will do that by specifying a CPU mask for each of two settings. First, let's discuss the OVS directives (settings):

  • dpdk-lcore-mask - lcore are the threads that manage the poll mode driver threads
  • pmd-cpu-mask

The dpdk-lcore-mask is a core bitmask that is used during DPDK initialization and it is where the non-datapath OVS-DPDK threads such as handler and revalidator threads run.
 
The pmd-cpu-mask is a core bitmask that sets which cores are used by OVS-DPDK for datapath packet processing.

 
The masks in particular are quite esoteric, and need to be understood properly.

Calculating Mask Values

I found a pmd (poll mode driver) mask calculator out on GitHub, at the following link:

Another calculator I found, is at this link:

So let's take our box as an example. We have one Numa Node, with 4 cores. One thread per core. This makes things a bit simpler since we don't have to consider sibling pairs.

This mask will set the lcore to run on CPU core 0.
# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x2
This mask will set the pmd threads to run on CPU cores 2 and 3.
# ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu-mask="0xC"

It will be EASY to see if your pmd cores are set correctly, because if you run top or htop, these cpus will be consumed 100% because the poll mode drivers consume those cores polling for packets!

So per the screen shot below, we can see that cores 2-3 are fully consumed. This is a classic example of why affinity rules matter! The OS scheduler would never attempt to place a process on a cpu that was fully consumed, but if you wanted your OS to run on core 0 (along with the lcore threads), you only have one core remaining (on this system anyway), to run VMs unless you wanted to share cores 0-1 for userspace processes that are launched.

NOTE: in htop, cores 2-3 are shown as 3-4 because htop starts with 1 and not with 0.



Monday, May 11, 2020

DPDK Hands-On - Part IV - Setting Up the DPDK Package


This post is where the fun really started.

I failed to mention, probably, in any of my earlier posts, that I was using CentOS7 as my operating system. Why? Because I am more familiar with the Red Hat Linux than I am with Ubuntu. Going forward, it will be difficult to ride multiple horses with Linux, because RHEL and Ubuntu (especially in 18.04) are deviating quite a bit on System Administration and Tools.

So - when you go to the dpdk.org website, all of the instructions are written in such a way that they assume that you will download and compile the packages. This sounded scary to me. I mean, not really, because I have compiled packages in the past, but I know that when you compile packages from scratch, you typically don't get conveniences like SystemD unit files to set up the service for easy start/stop/restart.  We are in 2020 now. Why can't I just use yum and install the damned package?

As it turns out, on CentOS7, there are yum repositories for DPDK. And, when you run "yum install dpdk", you can get a package that is labeled 18.11.2 as the release. After installing this package, I quickly figured out that there are no scripts, tools or utilities. So, having some general experience with CentOS 7 and its package convention, I looked for the complement packages "devel and tools", and indeed, located those and installed them with "yum install dpdk-tools" and "yum install dpdk-devel".

NOTE: I only really needed the dpdk-tools, as I was not doing any dataplane kit development at this point.

But, things were going smooth so far. Knowing that VFIO is the new driver, I decided to go ahead and load the kernel module for VFIO.

Now, as I was looking over a couple of guys' scripts for doing this (both of them were using DPDK documentation as a basis), I saw that one fellow was installing two packages. Since we were dealing with PCI (more on this later), and Kernel Modules, it made sense to install these:
  1. pci_utils
  2. kernel-devel
After all, these are just tools and utilities and there should be no risk in installing these on your system.

Next, in his script, I saw him changing permissions on a file in /dev - and this really had me concerned. Why would someone be doing this? Is this a bug that he is fixing? 

Turns out, these are instructions on the dpdk.org website, which I found later after looking at his script.


#chmod a+x /dev/vfio
#chmod 0666 /dev/vfio/*

So basically, what this guy was doing, was pretty much following the standard "cookbook" for binding VFIO drivers. The VFIO drivers, as mentioned in Part II, are poll mode drivers that are installed on your system when you install DPDK.

The link for binding NICs can be found at: https://dpdk-guide.gitlab.io/dpdk-guide/setup/binding.html

This documentation gives you two methods for binding your NICs.
  1. driverctl - which was a utility I had not heard of and had to install (e.g. yum install driverctl)
  2. a DPDK utility script, written in Python, called dpdk-devbind.py
The script I was following as a reference was using #1, driverctl, and I figured I would use that instead of the utility script.

NOTE: Later, I decided to start using the utility script and found it to be wonderful

But first - before you bind your NIC, you need to load a kernel module.

I am familiar with kernel modules, so to load the module, you can use modprobe (or insmod, but modprobe is preferred). Again, this is via the DPDK website above on binding NICs.

If all goes well loading the vfio_pci kernel module, this module will in fact load 2 more kernel modules, and what I wound up with when I did "lsmod | grep vfio" was this:

# lsmod | grep vfio
vfio_pci               41412  2
vfio_iommu_type1       22440  1
vfio                   32657  6 vfio_iommu_type1,vfio_pci
irqbypass              13503  8 kvm,vfio_pci

So - there is that irqbypass - to disable interrupt processing. And, there is an IOMMU module that gets loaded, probably the result of having IOMMU enabled on the kernel command line!

It looks like the DPDK is loaded nicely. Now, it is time to load the drivers.

Let's talk about that in a Part V.

DPDK Hands-On - Part III - Huge Pages

The last post discussed IOMMU, which was something I had never heard of. So I had to research it.

Next, in the DPDK Getting Started Guide, http://doc.dpdk.org/spp/setup/getting_started.html, the following is stated about the requirement for HugePages.

Hugepages must be enabled for running DPDK with high performance. Hugepage support is required to reserve large amount size of pages, 2MB or 1GB per page, to less TLB (Translation Lookaside Buffers) and to reduce cache miss. Less TLB means that it reduce the time for translating virtual address to physical.

This SOUNDS like a requirement, but honestly, I am not sure if this is a requirement, or just an optimization. Will DPDK not function at all without HugePages? There is another guide that also discusses Huge Pages.  https://dpdk-guide.gitlab.io/dpdk-guide/setup/hugepages.html .

Why chance it? Let's go ahead and set up HugePages. But - how do you know if your system even supports HugePages?  As it turns out, if the kernel has support for HugePages, you can run this command, and it will allow you to see if not only if it is supported, but the statistics regarding the size and use of your HugePages.

So - we have 16G of memory in this T1700, 4G of which are allocated to HugePages. Hugepages in Linux can be set by passing the parameters into the kernel, as shown below:

  1. default_hugepagesz=1G
  2. hugepage_sz=1G
  3. hugepages=4
  4. transparent_hugepage=never 
NOTE: I am not fully abreast of transparent hugepages, but I see that a lot of people recommend turning this off, which is why I have it disabled above.
 
In the file /etc/default/grub, these can be added, and then the following command run to make it stick:
#grub2mkconfig -O /boot/grub2/grub.cfg 

NOTE: This assumes Grub is being used as your bootloader.
 
And, just as with IOMMU kernel command line parameters, after a reboot you can verify that these parameters are set by running:

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=4102ab69-f71a-4dd0-a14e-8695aa230a0d ro rhgb quiet iommu=pt intel_iommu=on default_hugepagesz=1G hugepagesz=1G hugepages=4 transparent_hugepage=never LANG=en_US.UTF-8gePages_Surp:        0
Hugepagesize:    1048576 kB

After booting the system up with Hugepages, you can check the summary of Hugepages with the following command:

# grep -i HugePages_ /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:       4
HugePages_Free:        3
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB

When we run this command, we can see that out of 4 those, 3 are free. So why is one of them in use???

The answer to why 1G (1 Hugepage) is used, has to do with the initialization of OpenVSwitch. The initialization of OpenVSwitch as the following directives:

ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="2048"

This tells OpenVSwitch to grab 1 Hugepage at startup, with a limit of 2 Hugepages. 

NOTE: Calculating the number of Hugepages you need in OVS, is a topic in and of itself - the subject of a separate post.

Another nifty command, is the ability to see where your Hugepages are distributed, across your NUMA nodes!

# numactl --hardware
# echo ""
# numastat -cm | egrep 'Node|Huge' 
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 15954 MB
node 0 free: 1808 MB
node distances:
node 0
0: 10

Node 0 Total
AnonHugePages 0 0
HugePages_Total 8192 8192
HugePages_Free 7168 7168
HugePages_Surp 0 0

In this example above, we only have a single NUMA node with 4 cores on it. But had this been a 2 x NUMA Node system with 3 cores apiece, you would be able to see how Hugepages are allocated across your NUMA nodes.


DPDK Hands-On - Part II - Poll Mode Drivers and IOMMU


In our last post Hands On with DPDK - Part I we chose a box to try and install DPDK on.

This box, was a circa 2015 Dell T-1700. A bit long in the tooth (it is now 2020), and it is not and never was, a data center grade server.

And, looking forward, this will bite us. But will help us learn a LOT about DPDK due to the persistence and troubleshooting.

So - to get started, I did something rather unconventional. Rather than read all of the documentation (there is a LOT of documentation), I took a cursory look at the dpdk.org site (Getting Started), and then went looking for a couple of blogs where someone else tried to get DPDK working with OVS.

Poll Mode Drivers

Using DPDK requires using a special type of network interface card driver known as a poll mode driver. This means that the driver has to be available (custom compiled and installed with rpm, or perhaps pre-compiled and installed with package managers like yum). 

Poll Mode drivers continuously poll for packets, as opposed to using the classic interrupt-driven approach that the standard vendor drivers use. Using interrupts to process packets is considered less efficient than polling for packets. But - to poll for packets continuously is cpu intensive, so there is a trade-off! 

There are two poll mode drivers listed on the dpdk.org website:
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

  1. UIO (legacy)
    1. uio_pci_generic
    2. igb_uio
  2. VFIO (current recommended driver)

The DPDK website has this to say about the two driver families (UIO and VFIO). 

"VFIO is the new or next-gen poll mode driver, that is a more robust and secure driver in comparison to the UIO driver, relying on IOMMU protection". 

So perhaps it makes sense to discuss IOMMU, as it will need to be disabled for UIO drivers, and enabled for VFIO drivers.

 IOMMU

Covering IOMMU would be a blog series in its own right. So I will simply list the Wikipedia site on IOMMU.  Wikipedia IOMMU Link

What does IOMMU have to do with DPDK? DPDK has this to say in their up-front pre-requisites for DPDK.

"An input-output memory management unit (IOMMU) is required for safely driving DMA-capable hardware from userspace and because of that it is a prerequisite for using VFIO. Not all systems have one though, so you’ll need to check that the hardware supports it and that it is enabled in the BIOS settings (VT-d or Virtualization Technology for Directed I/O on Intel systems)"
 

So there you have it. It took getting down to the poll mode drivers, but IOMMU provides memory security...but for the newer-generation VFIO drivers. Without this security, one rogue NIC could affect the memory for all NICs, or jeopardize the memory of the system in general.

So - how do you enable IOMMU?

Well, first you need to make sure your system even supports IOMMU.

To do this, you can do one of two things (suggested: do both) - Linux system assumed here.
  1. Check and make sure there is a file called /sys/class/iommu 
  2. type (as root) dmesg | grep IOMMU
On #2, you should see something like this
IOMMU                                                          
[    0.000000] DMAR: IOMMU enabled
[    0.049734] DMAR-IR: IOAPIC id 8 under DRHD base  0xfbffc000 IOMMU 0
[    0.049735] DMAR-IR: IOAPIC id 9 under DRHD base  0xfbffc000 IOMMU 0
Now in addition to this, you will need to edit your kernel command line so that two IOMMU directives can be passed in:  iommu=pt intel_iommu=on
 
The typical way these directives are added is using the grub2 utility.

NOTE: Many people forget that once they add the parameters, they need to do a mkconfig to actually apply these parameters!!!

After adding these kernel parameters, you can check your kernel command line by running the following command:

# cat /proc/cmdline

And you should see your iommu parameters showing up:

BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=4102ab69-f71a-4dd0-a14e-8695aa230a0d ro rhgb quiet iommu=pt intel_iommu=on

Next Step: Part III - Huge Pages

Saturday, May 9, 2020

DPDK Hands-On - Part I - Getting Started


I decided to try and enable DPDK on my computer.

This computer is a Dell T1700 Precision, circa 2015, which is a very very nice little development workstation server.

The VERY FIRST thing anyone needs to do, with DPDK, is ensure that their server has supported NICs. It all starts with the NIC cards. You cannot do DPDK without DPDK-compatible NICs.

There is a link at the DPDK website, which shows the list of NICs that are (or should be, as it always comes down to the level of testing, right?) compatible with DPDK.
That website is: DPDK Supported NICs

This T-1700 has an onboard NIC, and two ancillary NIC cards that ARE listed as DPDK-compatible NICs.These NICs are listed as:
82571EB/82571GB Gigabit Ethernet Controller and are part of the Intel e1000e family of NICs.

I was excited that I could use this server without having to invest and install in new NIC cards!

Let's first start, with specs on the computer. First, our CPU specifications.

CPU:
# lscpu
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:           Little Endian
CPU(s):               4
On-line CPU(s) list:  0-3
Thread(s) per core:   1
Core(s) per socket:   4
Socket(s):            1
NUMA node(s):         1
Vendor ID:            GenuineIntel
CPU family:           6
Model:                60
Model name:           Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
Stepping:             3
CPU MHz:              1183.471
CPU max MHz:          3900.0000
CPU min MHz:          800.0000
BogoMIPS:             6983.91
Virtualization:       VT-x
L1d cache:            32K
L1i cache:            32K
L2 cache:             256K
L3 cache:             6144K
NUMA node0 CPU(s):    0-3
Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

Let's take a look at our NUMA capabilities on this box. It says up above, we have one Numa Node. There is a utility called numactl on Linux, and we will run that with the "-H" option to get more information.

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 16019 MB
node 0 free: 7554 MB
node distances:
node  0
 0: 10

From this, we see we have 1 Numa Node. Numa Nodes equate to CPU sockets. And since we have one CPU socket, we have one Numa Node. All 4 cores of the CPU are on this node (Node 0 per above). Having just one Numa Node is not an optimal scenario for DPDK testing, but as long we are NUMA-capable, we can proceed. 

Next, we will look at Memory.


Memory:
# lsmem --summary
Memory block size:      128M
Total online memory:     16G
Total offline memory:     0B

16G memory. Should be more than enough for this exercise.

So how to get started?

Obviously the right way, would be to sit and read reams of documentation from both DPDK and OpenVSwitch. But, what fun is that? Booooring. I am one of those people who like to start running and run my head into the wall.

So, I did some searching, and found a couple of engineers who had scripts that enabled DPDK. I decided to study these, pick them apart, and use them as a basis to get started. I saw a lot of stuff in these scripts that had me googling stuff - IOMMU, HugePages, CPU and Masking, PCI, Poll Mode Drivers, etc.

In order to fully comprehend what was needed to enable DPDK, I would have to familiarize myself with these concepts. Then, hopefully, I could tweak this script, or even write new scripts, and get DPDK working on my box. That's the strategy.

I did realize, as time went on, that the scripts were essentially referring back to the DPDK and OpenVSwitch websites, albeit at different points in time as the content on these sites changes release by release.

Monday, March 30, 2020

How to run OpenStack on a single server - using veth pair



I decided I wanted to implement OpenStack using OpenVSwitch. On one server.

The way I decided to do this, was to spin up a KVM virtual machine (VM) as an OpenStack controller, and have it communicate to the bare metal CentOS7 Linux host (that runs the KVM hypervisor libvirt/qemu).

I did not realize how difficult this would be, until I realized that OpenVSwitch cannot leverage Linux bridges (bridges on the host).

OpenVSwitch allows you to create, delete and otherwise manipulate bridges - but ONLY bridges that are under the control of OpenVSswitch. So, if you happen to have a bridge on the Linux host (we will call it br0), you cannot snap that bridge into OpenVSwitch.

What you would normally do, is to create a new bridge on OpenVSwitch (i.e. br-ex), and migrate your connections from br0, to br-ex.

That's all well and good - and straightforward, most of the time. But, if you want to run a virtual machine (i.e. an OpenStack Controller VM), and have that virtual machine communicate to OpenStack Compute processes running on the bare metal host, abandoning the host bridges becomes a problem.

Virt-Manager, does NOT know anything about OpenVSwitches, nor OpenVSwitch bridges that OpenVSwitch controls. So when you create your VM, if everything is under an OpenVSwitch bridge (i.e. br-ex), Virt-Manager will only provide you a series of macvtap interfaces (macvtap, and  for that matter macvlan, are topics in and of themselves that we won't get into here).

So. I did not want to try and use macvtap interfaces - and jump through hoops to get that to communicate to the underlying host (yes, there are some tricks with macvlan that can do this presumably, but the rabbit hole was getting deeper).

As it turns out, "you can have your cake, and eat it too". You can create a Linux bridge (br0), and plumb that into OpenVSwitch with a veth pair. A veth pair is used just for this very purpose. It is essentially a virtual patch cable between two bridges, since you cannot join bridges (joining bridges is called cascading bridges, and this is not allowed in Linux Networking).

So here is what we wound up doing.


Wednesday, January 1, 2020

SDN- NFV Certified from Metro Ethernet Forum

It has been a while since I have blogged any updates, so I'll knock out a few!

First, I just completed the course and certification from Metro Ethernet Forum for SDN-NFV.

This was a 3 day course, and it was surprisingly hands-on as it focused heavily on OpenFlow and OpenDaylight. I was always wanting to learn more about these, so I found this quite rewarding.

One interesting stumbling block in the labs was the fact that there is a -O option that needs to be used to specify the proper version of OpenFlow. 

The course seemed to focus on the use case and context of using OpenFlow (and OpenDaylight) to configure switches - but not "everything else" out there in the field of networking that could be configured with something like OpenFlow.

For example, it was my understanding that the primary beneficiary of something like OpenFlow (and OpenDaylight) was in the Wireless (802.11x) domain, where people had scores, hundreds or even thousands of access points that had to be configured or upgraded, and it was extremely difficult to this by hand.

But, the course focused on switches - OpenVSwitch switches to be precise. And that was probably because the OpenVSwitch keeps things simple enough for the course and instructor.

Problem is, in my shop, everyone is using Juniper switches, and Juniper does not play ball with OpenFlow and OpenVSwitch. So I'm not sure how much this can or will be put to use in our specific environment. I do, however, use OpenVSwitch in my own personal OpenVSwitch-based OpenStack environment, and since OpenVSwitch works well with DPDK and VPP, this knowledge can come in handy as I need to start doing more sophisticated things with packet flows.

Nontheless, I found the course interesting and valuable. And the exam also centered around the ETSI-MANO Reference Architecture. I had familiar with this architecture, but like all exams like this, I missed questions because of time, or overthinking things, or picking the wrong of two correct answers (not the best answer), et al. But, I passed the exam, and I guess that's what matters most.

Tuesday, September 24, 2019

Vector Packet Processing - Part I

Yesterday, I was reading up on something called Vector Packet Processing (VPP). I had not heard of this, nor the organization called Fd.io (pronounced Fido), which can be found at the following link: http://fd.io

Chasing links to get more up to speed, I found this article, which does a very good indoctrination on these newer networking technologies, which  have emerged to support virtualization, due to the overhead (and redundancy) associated with forwarding packets from NICs, to virtualization hosts, and into the virtual machines.

https://software.intel.com/en-us/articles/an-overview-of-advanced-server-based-networking-technologies

I like how the article progresses from the old-style interrupt processing, to OpenVSwitch (OVS), to SR-IOV, to DPDK, and then, finally, to VPP.

I am familiar with OpenVSwitch, which I came into contact with OpenStack, which had OpenVswitch drivers (and required you to install OpenVSwitch on the controller and compute nodes).

I was only familiar with SR-IOV because I stumbled upon it and took the time to read up on what it was. I think it was a virtual Palo Alto Firewall that had SR-IOV NIC Types, if I'm not mistaken. I spent some time trying to figure out if these servers I am running support SR-IOV and they don't seem to have it enabled, that's for sure. Whether they support it would take more research.

And DPDK I had read up on, because a lot of hardware vendors were including FastPath Data switches that were utilizing DPDK for their own in-house virtual switches, or using the DPDK-OpenVSwitch implementation.

But Vector Packet Processing (VPP), this somehow missed me. So I have been doing some catch-up on VPP, which I won't go into detail on in this post or share additional resources on such a large topic. But the link above to Fido is essentially touting VPP.

UPDATE:
I found this link, which is also spectacularly written:
https://www.metaswitch.com/blog/accelerating-the-nfv-data-plane

And, same blog with another link for those wanting the deep dive into VPP:
https://www.metaswitch.com/blog/fd.io-takes-over-vpp

Thursday, June 6, 2019

The Network Problem From Hell - Fixed - Circuitous Routing



Life is easy when you use a single network interface adaptor.  But when you start using multiple adaptors, you start running into complexities because packets can start taking multiple paths. 

One particular thing most network engineers want to avoid, is situations where a packet leaves through door #1 (e.g. NIC 1), and arrives through door #2.  To fix this, though, requires some more advanced network techniques and tricks (separate routing tables per NIC, and corresponding rules to direct packets to use those separate routing tables).

So, I had this problem where an OpenStack-managed virtual machine stopped working because it could not reach OpenStack itself, which was running on the SAME machine that the virtual machine was running on. It was driving me insane.

I thought the problem might be iptables on the host machine. I disabled those. Nope. 

I thought the problem might be OpenVSwitch. I moved the cable to a new NIC, and changed the bridge the virtual machine was using. Nope.

Compounding the problem, was that the OpenStack Host could ping the virtual machine. But the virtual machine could not ping the host. Why would it work one way, and not the other?

The Virtual Machine could ping the internet. It could ping the IP of the OpenStack router. It could ping the router that the host was connected to.

OpenStack uses Linux IP Namespaces, and in our case was using the Neutron OpenVSwitch Agent. An examination of these showed that the networking seemed to be configured just as it showed up in the Horizon Dashboard "Network Topology" visual interface.

One thing that is worth mentioning, is that the bridge mappings for provider networks is in the ml2_conf.ini file, and the openvswitch_agent.ini file. But the EXTERNAL OpenStack networks use a bridge based on a parameter setting in the l3_agent.ini file! So if the l3_agent.ini file has a bridge setting of, say, "br-ex" for external networks, and you don't have that bridge correspondingly configured in the other files, OpenStack will give you a message when you create the external network that it cannot reach the external network. We did run into this when trying to create different external networks on different bridges to solve the problem.

At wits end, I finally called over one of the more advanced networking guys in the company, and we began troubleshooting it using tcpdump. We finally realized that when the VM pinged the OpenStack host, the ICMP request packets were arriving on the expected NIC (em1 below), but no responses were going out on em1. When we changed tcpdump to use "any" interface, we saw no responses at all. Clearly the host was dropping the packets. But iptables was flushed! WHO/HOW were the packets getting dropped? (to be honest, we still aren't sure about this - more research required on that). But - we did figure out that the problem was a "circuitous routing" problem.

We figured maybe reverse path filtering was causing the issue. So we disabled that in the kernel. Didn't fix it.  

Finally, we realized that what was happening, is that the VM sends all of its packets through the external network bridge, which was attached to a 172.22.0.0/24 network, and the packet went to the router, which routed it to its 172.20.0.0/24 port, and then to the host machine. But because the host machine had TWO NICs on BOTH those networks, the host machine did not send replies back the same way they came in. It sent the replies to its em2 NIC which was bridged to br-compute. And it was HERE that the packets were getting dropped. Since that NIC is managed by OpenVSwitch, we believe a loop-prevention flow rule in OpenVSwitch, or perhaps Spanning Tree Protocol, caused the packets to get dropped.

Illustration of the Circuitous Routing Issue & How it was solved

The final solution was to put in a host route, so that any packet to that particular VM would be sent outside of the host, upstream to the router, and back in through the appropriate 172.22.0.0/24 port on the host/bridge, to the OpenStack Router, where it would be NATd back to the 192.168.100.19 IP of the virtual machine. 

Firewalls and Routing. These two seem to be where most problems occur in networking.

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...