Grasping Technology

Tuesday, October 13, 2020

DPDK Hands-On - Part IX - Launching a VM with DPDK Ports from a bash script

When one launches a virtual machine on Linux, you can configure your virtual machine hypervisor to run as a Type 1 hypervisor (qemu), or a Type 2 hypervisor (qemu-kvm). This post won't go into the details of the differences between those.

The options one can pass into qemu or qemu-kvm can be daunting.

But, if you want to launch a virtual machine just for the purpose of testing a DPDK adaptor, this can be done in a lightweight manner.

Remember, that a vhostuser port connects to a socket on the OpenVSwitch. This is the original, legacy manner in which DPDK was implemented between VMs and OpenVSwitch.

The drawback to this, of course, is that if the switch were to die, the VMs would be stranded with no connectivity (a bad thing).

This is why they came out with vhostuserclient ports, where OpenVSwitch connects to a socket that is managed by qemu when it starts the VM. This way, if the switch dies or is restarted, ports on the switch will reconnect to the VM-managed sockets. If a VM dies, it only removes its own sockets.

So we will cover the scenarios of launching a VM in both modes; vhostuser, and vhostuserclient.

VhostUser

With a vhostuser port, OpenStack creates the socket that qemu connects to. So this port needs to be created ahead of time on OpenVSwitch (see DPDK Hands-On Part VIII - creating DPDK ports).

Below is a snippet of a bash script that cranks a VM after some parameters are defined.

if [ $PORT_TYPE == "vhost-user" ]; then
# vhost-user - openvswitch binds to socket and acts as server
export VHOST_SOCK_DIR=/var/run/openvswitch

export VHOST_PORT="${VHOST_SOCK_DIR}/vhostport1"

   # the vhostuser does not have the server parameter.
   /usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
   -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
   -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
   -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
   -chardev socket,id=char${MAC},path=${VHOST_PORT} \
   -netdev type=vhost-user,id=default,chardev=char${MAC},vhostforce,queues=2 \
   -device virtio-net-pci,mac=00:00:00:00:00:0${MAC}, netdev=default, mrg_rxbuf=off, mq=on, vectors=6
#1>qemu.kvm.vhostuser.out 2>&1

So let's discuss these parameters:

- The Numa Node parameter tells you how to specify your cpus. A good link on this can be found at https://futurewei-cloud.github.io/ARM-Datacenter/qemu/how-to-configure-qemu-numa-nodes/

Now on my VM, I only have one Numa slot, which equates to one Numa Node (Numa Node 0), with 4 cores on it. So with the -smp sockets=1 cores=2 directive, I can specify 2 cores for my Virtual Machine.

- Mem Path

This parameter is specified to put the VM on hugepages. Any VM using DPDK ports should be backed by HugePages.

- Socket Parameters

The socket parameters are important. OpenVSwitch, if it creates the socket, will place the socket in /var/run/openvswitch by default. You have to create the port there manually, then tell the VM to connect to it specifically by the path and socket name.

- Multi queuing

You will notice that mq=on, and 2 queues are specified. Having more than one queue for sending and receiving data unlocks a bottleneck.

NOTE: I am not clear, actually, if by specifying 2 on "queues=2", that this means 2 x Tx queues AND 2 x Rx queues. I need to look into that and perhaps update this post (or if someone can comment on this, great).

The proof is in the pudding on this. When you launch the VM, you will need to check TWO places to ensure your networking initialized properly:

First, check /var/log/openvswitch/ovs-vswitchd.log

Second, check the output of the VM. Which means that you would need to redirect your output to a file for stdout and stderr and analyze accordingly. The green highlighted section above shows this. But this does affect your start stop of the VM, so I comment this out when running in normal mode.

VhostUserClient

The vhostuserclient works very similar to vhostuser, except that the socket is managed by qemu rather than OpenVSwitch!

# vhost-user-client - qemu binds to socket and acts as server
export VHOST_SOCK_DIR="/var/lib/libvirt/qemu/vhost_sockets"

export VHOST_PORT="${VHOST_SOCK_DIR}/dpdkvhostclt1"

echo "Starting VM..."
/usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
-m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
-numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
-object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
-chardev socket,id=char1,path=${VHOST_PORT},server \
-netdev type=vhost-user,id=default,chardev=char1,vhostforce,queues=2 \
-device virtio-netpci,mac=00:00:00:00:00:0${MAC},netdev=default,mrg_rxbuf=off,mq=on,vectors=6

Note the difference between the "chardev" directive above, in comparison with vhostuser "chardev" directive. The vhostuserclient directive has the additional "server" parameter included!!! This is the KEY to specifying that QEMU owns the socket, as opposed to OpenVSwitch.

One might wonder, how this works in practice. If you create a VM and it binds to a socket, how does OpenVSwitch know to connect to it? The answer is, OpenVSwitch won't - until you create the port on OpenVSwitch in vhostuserclient mode! At which point, OVS will know it is the client, and create the connection to the qemu process!

So a script that launches a VM with vhostuserclient interfaces, should probably create the OVS port first, then launch the VM. Otherwise, the VM won't have the connectivity it needs in a timely manner when it boots up (i.e. for DHCP to determine its IP Address and settings).

Tuesday, September 29, 2020

DPDK Hands-On - Part VIII - Creating Virtual DPDK Ports on OpenVSwitch

Before we get into the procedure of adding virtual ports to the switch, it is important to understand the two types of DPDK virtual ports, and their differences.

In the earlier versions of DPDK+OVS, virtual interfaces were defined with a type called vhostuser. These interfaces would connect to OpenVSwitch. Meaning, from a socket perspective, that OpenVSwitch managed the socket. More technically, OVS binds to a socket in /var/run/openvswitch/<portname> behaving as a server, while the VMs connect to this socket as a client.

There is a fundamental flaw in this design. A rather major one! Picture a situation where a dozen virtual machines launch with ports that are connected to the OVS, and the OVS is rebooted! All of those sockets are destroyed on the OVS, leaving all of the VMs "stranded".

To address this flaw, the socket model needed to be reversed. The virtual machine (i.e. qemu) needed to act as the server, and the switch needed to be the client!

Hence, a new port type was created: vhostuserclient.

A more graphical and elaborative explanation of this can be found on this link:

https://software.intel.com/content/www/us/en/develop/articles/data-plane-development-kit-vhost-user-client-mode-with-open-vswitch.html

Now because there are two sides to any socket connection, it makes sense that BOTH sides need to be configured properly with the proper port type for this communication to work.

This post deals with simply adding the right DPDK virtual port type (vhostuser or vhostuserclient) to the switch. But configuring the VM properly is also necessary, and will be covered in a follow-up post after this one is published.

I think the easiest way to show how these two port types are added, with some discussion.

VhostUser

To add a vhostuser port, the following command can be run:

# ovs-vsctl add-port br-tun dpdkvhost1 -- set Interface dpdkvhost1 type=dpdkvhostuser ofport_request=2

It is as simple as adding a port to a bridge, giving it a name, and using the appropriate type for a legacy virtual DPDK port (dpdkvhostuer). We also give it port number 2 (in our earlier post, we added a physical DPDK PCI NIC to port 1 so we will assume port 1 is used by that).

Notice, that there is no socket information included in this. OpenVSwitch will create a socket, by default, in /var/run/openvswitch/<portname> once the vhostuser port is added to the switch.

NOTE: The OVS socket location can be overridden but for simplicity we will assume default location. Another issue is socket permissions. When the VM launches under a different userid such as qemu, the socket will need to be writable by qemu!

The virtual machine, with a vhostuser interface defined on it, will need to be instructed where to connect; what socket to connect to. So because the VM needs to know where to connect to, it actually makes the OVS configuration somewhat simpler in this model because OVS will create the socket where it is configured to do so, the default being in /var/run/openvswitch.

So after adding a port to the bridge, we can do a quick show on our bridge to ensure it created properly.

# ovs-vsctl show

Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

With this configuration, we can do a test between a physical interface and a virtual interface, or the virtual interface can attempt to reach something outside of the host (i.e. a ping test to a default gateway and/or an internet address). With this configuration, a virtual machine could also attempt a DHCP request to obtain its IP address for the segment is on if a DHCP server indeed exists.

If we wanted to test between two virtual machines, another such interface would need to be added:

# ovs-vsctl add-port br-tun dpdkvhost2 -- set Interface dpdkvhost2 type=dpdkvhostuser ofport_request=3

And, this would result in the following configuration:

Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhost2"
            Interface "dpdkvhost2"
                type: dpdkvhostuser
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

With this configuration, TWO virtual machines would connect to their respective OVS switch sockets:

VM1 connects to OVS socket for vhostusr1 --> /var/run/openvswitch/vhostusr1

VM2 connects to OVS socket for vhostusr2 --> /var/run/openvswitch/vhostusr2

Thanks to the PCI port we added earlier these two VMs "reach outside" to request an IP Address, and ping each other on the same segment if they both have an IP address.

dpdkvhost1 dpdkvhost2

=|==========|=

OVS Bridge (br-tun)

======|======

dpdk0

Upstream Router

VhostUserClient

This configuration looks similar to the vhostuser configuration, but with a subtle difference. In this case, the VM is the server in the client server socket model, so the OVS port, as a client, needs to know where the socket it in order to connect to it!

# ovs-vsctl add-port br-tun dpdkvhostclt1 -- set Interface dpdkvhostclt1 type=dpdkvhostuserclient "options:vhost-server-path=/var/lib/libvirt/qemu/vhost_sockets/dpdkvhostclt1" ofport_request=4

In this directive, the only thing that changes is the addition of the parameter telling OVS where the socket is to connect to, and of course the type of port needs to be set to dpdkvhostuserclient (instead of the vhostuser).

And, if we run out ovs-vsctl show command, we will see that the port looks similar to the vhostuser ports, except for two differences:

the type is now vhostuserclient, rather than vhostuser
the option parameter which instructs OVS (the socket client) where to connect to.

Bridge br-tun
        fail_mode: standalone
        Port "dpdkvhostclt1"
            Interface "dpdkvhostclt1"
                type: dpdkvhostuserclient
                options: {vhost-server-path="/var/lib/libvirt/qemu/vhost_sockets/dpdkvhostclt1"}
        Port "dpdkvhost2"
            Interface "dpdkvhost2"
                type: dpdkvhostuser
        Port "dpdkvhost1"
            Interface "dpdkvhost1"
                type: dpdkvhostuser
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

Setting up Flows

Just because we have added these port, does not necessarily mean they'll work after creation. The next step, is to enable flows (rules) for traffic forwarding between these ports.

Setting up switch flows is an in-depth topic in and of itself, and one we won't cover in this post. There are advanced OpenVSwitch tutorials on Flow Programming (OpenFlow).

The first thing you can generally do, if you don't have special flow requirements that you're aware of, is to set the traffic processing to "normal", as seen below for the br-tun bridge/switch.

# ovs-ofctl add-flow br-tun actions=normal

This should give normal L2/L3 packet processing. But, if you can't ping or your network forwarding behavior isn't as desired, you may need to program more detailed or sophisticated flows.

For simplicity, I can show you a couple of examples of how one could attempt to enable some traffic to flow between ports:

allows you to ping from the bridges out to the host on their PCI interfaces...

# ovs-ofctl add-flow br-tun in_port=LOCAL,actions=output:dpdk0

# ovs-ofctl add-flow br-prv in_port=LOCAL,actions=output:dpdk1

allows you to forward packets to the proper VM when they come into the host.

# ovs-ofctl add-flow br-tun ip_dst=192.168.30.202,actions=output:dpdkvhost1

# ovs-ofctl add-flow br-prv ip_dest=192.168.20.202,actions=output:dpdkvhost0

To debug the packet flows, you can dump them with the "dump-flows" command. There is a similarity between iptables rules (iptables -nvL) and openvswitch flows, and debugging is somewhat similar in that you can dump flows, and look for packet counts.

# ovs-ofctl dump-flows br-prv
cookie=0xd2e1f3bff05fa3bf, duration=153844.320s, table=0, n_packets=0, n_bytes=0, priority=2,in_port="phy-br-prv" actions=drop
cookie=0xd2e1f3bff05fa3bf, duration=153844.322s, table=0, n_packets=10224168, n_bytes=9510063469, priority=0 actions=NORMAL

In the example above, we have two flows on the bridge br-prv. And we do not see any packets being dropped. So, presumably, anything connected to this bridge should be able to communicate from a flow perspective.

After setting these kinds of flows, ping tests and traffic verification tests will need to be done.

I refer to this as "port plumbing" and these rules indeed can get very advanced, sophisticated and complex - potentially.

If you are launching a VM on Linux, via KVM (a script usually), or using Virsh Manager (which drives off of an xml file that describes the VM), you will need to set these "port plumbing" rules up manually, and you would probably start with the basic normal processing unless you want to do something sophisticated.

If you are using OpenStack, however, OpenStack does a lot of things automatically, and the things it does is influenced by your underlying OpenStack configuration (files). For example, if you are launching a DPDK VM on an OpenStack that is using OpenVSwitch, each compute node that will be running a neutron-openvswitch-agent service. This service, is actually a Ryu OpenFlow Controller, and when you start this service, it plumbs ports on behalf of OpenStack Neutron on the basis of your Neutron configuration. So you may look at your flows with just OpenVSwitch running and see a smaller subset of flows than you would, if the neutron-openvswitch-agent were running! I may get into some of this in a subsequent post, if time allows.

Thursday, August 13, 2020

NIC Teaming vs Active-Active NIC Bonding - differences - which is better?

I just went down a path of discovery trying to fully understand the differences between Bonding and NIC Teaming.

Bonding, of course, has a concept of "bonding modes" that allow you to use NICs together for a failover purpose in active-standby mode, and even active-active failover. When using these modes, however, the focus is not gluing the NICs together to achieve linear increases in bandwidth (i.e. 10G + 10G = 20G). To get into true link aggregation, you need to use different bonding modes that are specifically for that purpose. I will include a link that discusses in detail the bonding modes in Linux:

Linux Bonding Modes

So what is the difference, between using Bonding Mode 4 (LACP Link Aggregation), or Bonding Mode 6 (Adaptive Load Balancing), and NIC Teaming?

I found a great link that covers the performance differences between the two.

https://www.redhat.com/en/blog/if-you-bonding-you-will-love-teaming

At the end of the day, it comes down to the drivers and how well they're written of course. But for Red Hat 7 at least, we can see the following.

The performance is essentially "six of one half dozen of another" on RHEL 7, on a smaller machine with a 10G Fiber interface. But if you look carefully, while NIC Teaming provides you gains on smaller packet sizes, as packet sizes start to get large (64KB or higher), the Bonding starts to give you some gains.

I'll include the link and the screen snapshot. Keep in mind, I did not run this benchmark myself here. I am citing an external source for this information, found at the link below:

Friday, August 7, 2020

LACP not working on ESXi HOST - unless you use a Distributed vSwitch

Today we were trying to configure two NICs on an ESXi host in an Active-Active state, such that they would participate in a LAG using LACP with one NIC connected to one TOR (Top of Rack) Switch and the other connected to another separate TOR switch.

It didn't work.

There was no way to "bond" the two NICs (as you would typically do in Linux). ESXi only supported NIC Teaming. Perhaps only the most advanced networking folks realize, that NIC Teaming is not the same as NIC Bonding (we won't get into the weeds on that). And NIC Teaming and NIC Bonding are not the same as Link Aggregation.

So after configuring NIC Teaming, and enabling the second NIC on vSwitch0, poof! Lost connectivity.

Why? Well, ESXi runs Cisco Discovery Protocol (CDP). Not LACP, which the switch requires. So without LACP, there is no effective LAG, and the switch gets confused.

Finally, we read that in order to use LACP, you needed to use vDS - vmWare Distributed Switch.

Huh? Another product? To do something we could do on a Linux box with no problems whatsoever?

Turns out, that to run vDS, you need to run vCenter Server. So they put the Distributed Switch on vCenter Server?

Doesn't that come at a performance cost? Just so they can charge licensing?

I was not impressed that I needed to use vCenter Server just to put 2 NICs on a box on a Link Aggregation Group.

Wednesday, June 17, 2020

DPDK Hands-On - Part VII - Creating a Physical DPDK NIC Port on OpenVSwitch

First off, you can NOT add a DPDK port for a physical NIC onto OpenVSwitch, if the NIC is not using a DPDK driver! And that driver must be working!!!

Now, when I got started, I was using vfio as my DPDK driver. Why? Because, per the DPDK and DPDK+OVS websites, vfio is the newer driver and the one everyone is encouraged to use.

In order to use a poll mode driver (on Linux), the appropriate kernel module(s) need(s) to be loaded. Once this kernel module is loaded, you can use a couple of methods (at least) to tell Linux to use the vfio driver for your PCI NIC. The two tools I use, are:

driverctl - this ships with the OS (CentOS at least)
dpdk-devbind - to use this utility, dpdk-tools package needs to be installed, or if you have downloaded and compiled DPDK by source code you can use it that way.

Using driverctl, you can set your driver by "overriding" the default driver, as follows:

# driverctl set-override 0000:01:00.0 vfio-pci

or, if you are using uio_pci_generic,

# driverctl set-override 0000:01:00.0 uio_pci_generic

NOTE: I had to use uio_pci_generic with my cards. The vfio driver was registering an error that I finally found in the openvswitch log.

I find the dpdk-devbind tool best for checking status, but you can also use driverctl for that also. I will show both commands and their output. The driverctl is not as clean in its output format as the dpdk-devbind command is.

# driverctl -v list-devices | grep -i net
0000:00:19.0 e1000e (Ethernet Connection I217-LM)
0000:01:00.0 uio_pci_generic [*] (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:01:00.1 uio_pci_generic [*] (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:03:00.0 e1000e (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))
0000:03:00.1 e1000e (82571EB/82571GB Gigabit Ethernet Controller D0/D1 (copper applications) (PRO/1000 PT Dual Port Server Adapter))

# dpdk-devbind --status

Network devices using DPDK-compatible driver
============================================
0000:01:00.0 '82571EB Gigabit Ethernet Controller 105e' drv=uio_pci_generic unused=e1000e
0000:01:00.1 '82571EB Gigabit Ethernet Controller 105e' drv=uio_pci_generic unused=e1000e

Network devices using kernel driver
===================================
0000:00:19.0 'Ethernet Connection I217-LM 153a' if=em1 drv=e1000e unused=uio_pci_generic
0000:03:00.0 '82571EB Gigabit Ethernet Controller 105e' if=p1p1 drv=e1000e unused=uio_pci_generic
0000:03:00.1 '82571EB Gigabit Ethernet Controller 105e' if=p1p2 drv=e1000e unused=uio_pci_generic

After you load your driver for the NIC of choice, you need to check a few things to ensure that the driver is working properly. Otherwise, you might pull your hair out trying to debug problems, without realizing that the underlying driver is the culprit.

Next, we will add the port to OpenVSwitch, but before you do, there are two important things to make sure are done:

1. make sure the openvswitch you are running has been compiled for DPDK, and initialized.

The best way to do this, is to dump the switch parameters, as shown below. I have highlighted all of the settings you should typically encounter for a properly compiled and configured DPDK-OpenVSwitch.

# ovs-vsctl list open_vswitch

_uuid               : 2d46de50-e5b8-47be-84b4-a7e85ce29526
bridges             : [14049db7-07eb-4b00-89d6-b05201d2978e, 221b954d-c2dc-48b2-923b-86a87765ed7b, db7a8c3c-5a03-4063-a5e8-23686f319473]
cur_cfg             : 936
datapath_types      : [netdev, system]
db_version          : "7.16.1"
dpdk_initialized    : true
dpdk_version        : "DPDK 18.11.8"
external_ids        : {hostname=maschinen, rundir="/var/run/openvswitch", system-id="35b95ef5-fd71-491f-8623-5ccbbc1eca6b"}
iface_types         : [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, erspan, geneve, gre, internal, "ip6erspan", "ip6gre", lisp, patch, stt, system, tap, vxlan]
manager_options     : [a4329e63-5b63-4e8b-8dac-0e9bc8492c28]
next_cfg            : 936
other_config        : {dpdk-init="true", dpdk-lcore-mask="0x2", dpdk-socket-limit="2048", dpdk-socket-mem="1024", pmd-cpu-mask="0xC"}
ovs_version         : "2.11.1"
ssl                 : []
statistics          : {}
system_type         : centos
system_version      : "7"

2. make sure that your specific bridge, is created for DPDK (netdev datapath rather than system)

Trying to add a PCI NIC that is using poll mode drivers (vfio or uio) to an OpenVSwitch bridge that does not have the datapath set correctly (to netdev) will result in an error when you add the port. This error will be reported both on an "ovs-vsctl show" command, and also reflected in the log file /var/log/openvswitch/ovs-vswitchd.log file.

This was one of the most common and confusing errors to debug, actually. In newer versions of OVS, they have added the datapath to the "ovs-vsctl show" command. My switch, however, predates this patch, so I have to examine my datapath another way:

# ovs-vsctl list bridge br-tun | grep datapath
datapath_id         : "0000001b21c57204"
datapath_type       : netdev
datapath_version    : "<built-in>"

If your datapath on the bridge is NOT using netdev, you can change it by changing your datapath.

# ovs-vsctl set bridge br-tun datapath_type=netdev

So if your switch looks good, and your bridge is set with the right netdev datapath type, adding your NIC should be successful with the following command:

# ovs-vsctl add-port br-tun dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:01:00.0 ofport_request=1

Let me comment a bit on this command to shed some additional insight.

1. I always add my dpdk NICs - whether they are physical or virtual ports - with the name "dpdk" as a prefix as a matter of practice.

2. A physical PCI NIC is going to use type dpdk. Virtual interfaces, do NOT use that type, and use vhostuser of vhostuserclient types. The distinction between these two is a subsequent discussion.

3. The dpdk-devargs is where you map the port to a specific PCI address. You need to be SURE you are using the correct one, and not inadvertently switching these. A lot of mistakes are made by using the wrong PCI addresses, especially in cases where there are multiple ports on a NIC card where one may be 0000:01:00.0 and another 0000:01:00.1!

4. The ofport_request where you are giving your port a port number on the switch. If you set this to 1, and 1 is taken, you will get an error. But in general, I try to always make a physical NIC port 1. It is extremely rare to add more than one physical NIC to a bridge.

So after adding your NIC to the bridge, you can check the status of it with the ovs-vsctl show command:

# ovs-vsctl show

    Bridge br-tun
        fail_mode: standalone
        Port "dpdk0" --> you will see an error here if the port did not add correctly!
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:01:00.0"}
        Port br-tun
            Interface br-tun
                type: internal

Another useful command, is to dump your specific bridge, by port number. We can use the "ofctl" command, rather than "vsctl" command.

# ovs-ofctl show br-tun
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000001b21c57204
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
1(dpdk0): addr:00:1b:21:c5:72:04
     config:     0
     state:      0
     current:    1GB-FD AUTO_NEG
     speed: 1000 Mbps now, 0 Mbps max
LOCAL(br-tun): addr:00:1b:21:c5:72:04 --> LOCAL port maps to the bridge port in ovs-vsctl command.
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

And this concludes our post on adding a physical DPDK PCI NIC to an OpenVSwitch bridge.

DPDK Hands-On - Part VI - Configuring OpenVSwitch DPDK Parameters

In earlier posts, we talked about the fact that we needed to check our compatibility with NUMA, HugePages, and IOMMU. And when we decided we had that compatibility, we "enabled" these technologies by passing them into the Linux kernel through grub.

But - now that everything is enabled, OpenVSwitch needs to be configured with a number of parameters that, to the layman's eye, look quite scary and intimidating. Fortunately, there is a website that elaborates on these parameters much better than I could, so I will list that here: OVS-DPDK Parameters

The first thing to do, is initialize OpenVSwitch for DPDK. Without this, OpenVSwitch is deaf,dumb and blind as to DPDK-related directives.

# ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true

Next, we will reserve memory, in Hugepages, for DPDK sockets. The command below will reserve 1 Hugepage with an upper limit of 2 Hugepages (each Hugepage = 1G).

# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024"
# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="2048"

NOTE: The Hugepages use can be checked a couple of ways. See prior post on Hugepages.

Next, we instruct OVS to place DPDK threads on specific cores the way we want to. We will do that by specifying a CPU mask for each of two settings. First, let's discuss the OVS directives (settings):

dpdk-lcore-mask - lcore are the threads that manage the poll mode driver threads
pmd-cpu-mask

The dpdk-lcore-mask is a core bitmask that is used during DPDK initialization and it is where the non-datapath OVS-DPDK threads such as handler and revalidator threads run.

The pmd-cpu-mask is a core bitmask that sets which cores are used by OVS-DPDK for datapath packet processing.

Source: https://developers.redhat.com/blog/2017/06/28/ovs-dpdk-parameters-dealing-with-multi-numa/

The masks in particular are quite esoteric, and need to be understood properly.

Calculating Mask Values

I found a pmd (poll mode driver) mask calculator out on GitHub, at the following link:

Poll Mode Driver Mask Calculator - Written in Python

Another calculator I found, is at this link:

Mask Calculator - OVS DPDK Parameters

So let's take our box as an example. We have one Numa Node, with 4 cores. One thread per core. This makes things a bit simpler since we don't have to consider sibling pairs.

This mask will set the lcore to run on CPU core 0.

# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x2

This mask will set the pmd threads to run on CPU cores 2 and 3.

# ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu-mask="0xC"

It will be EASY to see if your pmd cores are set correctly, because if you run top or htop, these cpus will be consumed 100% because the poll mode drivers consume those cores polling for packets!

So per the screen shot below, we can see that cores 2-3 are fully consumed. This is a classic example of why affinity rules matter! The OS scheduler would never attempt to place a process on a cpu that was fully consumed, but if you wanted your OS to run on core 0 (along with the lcore threads), you only have one core remaining (on this system anyway), to run VMs unless you wanted to share cores 0-1 for userspace processes that are launched.

NOTE: in htop, cores 2-3 are shown as 3-4 because htop starts with 1 and not with 0.

Friday, June 12, 2020

DPDK Hands-On - Part V - Custom Compiling DPDK and OpenVSwitch

So in our last blog, I pointed out that you could "just use yum" to install your DPDK and your OpenVSwitch packages.

There are several problems with this. Let me go through them.

OpenVSwitch is not compiled for DPDK by default

First, and most importantly, when you use yum to install OpenVSwitch, you do NOT get an OpenVSwitch that has been compiled with DPDK support. And this is a very time-consuming and painful lesson to learn.

I just assumed, at first, that the reason nothing seemed to be working was that maybe DPDK wasn't enabled. No. It in fact has to have a special compile flag --with-dpdk in order to support DPDK.

Versions Matter - for BOTH DPDK and OpenVSwitch

At this OpenVSwitch link, http://docs.openvswitch.org/en/latest/faq/releases/ there are two very very important tables you need to examine and consider when choosing your DPDK and OpenVSwitch.

Kernel version - DPDK version compatibility
DPDK version - OpenVSwitch compatibility

There are also some specific feature tables as well, so that you don't use the wrong version for a specific feature you want.

So given that yum makes it easy to install a particular version of DPDK and OpenVSwitch, the versions are pinned to those in their repository, and the OVS is not compiled for DPDK. Neither version may line up with the kernel you happen to be running.

For example, I am still on a 3.x kernel on this system. Not a 4.x kernel. I wound up choosing:

DPDK 17.11.10
OVS 2.10.2

Note: I learned how important these versions are, because I have older Intel e1000e NICs on this little Dell Precision T1700 box. Apparently these NICs were all the rage early on in the development cycles for DPDK and OpenVSwitch. But after a while, these NICs are no longer tested, and in fact may no longer work as new drivers are introduced. So in my case, I was advised to go backwards on DPDK and OVS to ensure that I could find a driver that worked and was tested on the e1000e (more on that in a later post).

Uninstalling OpenVSwitch will break your OpenStack!!!
Before compiling a new from-scratch version of DPDK and OpenVSwitch, you need to remove the DPDK and OpenVSwitch that yum had installed from the default repositories (CentOS 7 in my case). When I did a "yum remove" on DPDK, that went smoothly enough, but when I ran "yum remove openvswitch", there are a plethora of packages that have dependencies on this, and yum removed those as well. All of my Neutron OpenVSwitch packages, for example, were removed. So I saved off the names of these packages, so that I could install them later after I custom compiled my DPDK and OpenVSwitch.

Compiling DPDK

I read the documentation on how to download and compile DPDK and OpenVSwitch. How hard could it be, right? make, make configure, make install. bang bang bang.

And, within a couple of hours, I had compiled DPDK and OpenVSwitch (for DPDK). I was able to bind a NIC using the vfio driver, initialize OpenVSwitch for DPDK, and add ports to a bridge.

NOTE: Later, I will learn that the drivers did not work properly and revert to igb_uio drivers.

Then I realized, that there was no way to start up OpenVSwitch like a typical service. And, as mentioned, my OpenStack was not there anymore because of the fact that I ran a yum remove on openvswitch.

And THAT, is why you want to install packages instead of just compiling stuff.

When we install packages on Linux, we take for granted the rather esoteric and complicated process of compiling, linking, and copying resultant files into a package that can be installed on any POSIX-compliant Linux system.

To make an rpm, the tool called rpmbuilder is required. rpmbuilder parses something called a "spec file" and uses that as the map for creating an rpm.

As I was googling around how to build a spec file for these two hand-compiled packages, I stumbled onto the fact that most "responsible" packages, include a spec file already in them. So that you can just run the rpmbuild command on them.

One issue I had was that I had no experience running rpmbuild. I didn't know what options to use. I got some tutelage from a developer on the OpenVSwitch Users Group, so let me cover that here.

For DPDK:

rpmbuild -bb --with shared pkg/dpdk.spec

For OpenVSwitch:

rpmbuild --with-dpdk --without-check --with autoenable -bb openvswitch-fedora.spec

I did not know that you could pass compile flags into rpmbuild. At first, I had hacked up the Makefiles until I learned this.

Unfortunately, DPDK did not compile, initially. And neither did OpenVSwitch. The reason for both of these, is that the spec file was not being maintained properly, and had to be tweaked and patched. DPDK is big on documentation, it being a development kit, and it does all kinds of doc-related stuff, and some of that wasn't working. I just needed the drivers, not the documentation. I wasn't writing packet sniffers.

DPDK:

removed inkscape and doxygen dependencies
removed texlive-collection-latexextra dependency
removed %package doc and %description doc
removed the line make O=%{target} doc
removed %files doc and %doc %{doctor}/dpdk

OpenVSwitch

Changed the BuildRequires for dpdk

dpdk-devel changed to dpdk-stable-devel

Commented out some man pages that were causing issues

ovs-test.8*
ovs-vlan-test.8*
ovsdb.5*
ovsdb.7*
ovs-db-server.7*

Finally, after all this tweaking of the spec files, we got a couple of successful rpms, and could install those with:

# yum localinstall <rpm file>