Monday, May 11, 2020

DPDK Hands-On - Part IV - Setting Up the DPDK Package


This post is where the fun really started.

I failed to mention, probably, in any of my earlier posts, that I was using CentOS7 as my operating system. Why? Because I am more familiar with the Red Hat Linux than I am with Ubuntu. Going forward, it will be difficult to ride multiple horses with Linux, because RHEL and Ubuntu (especially in 18.04) are deviating quite a bit on System Administration and Tools.

So - when you go to the dpdk.org website, all of the instructions are written in such a way that they assume that you will download and compile the packages. This sounded scary to me. I mean, not really, because I have compiled packages in the past, but I know that when you compile packages from scratch, you typically don't get conveniences like SystemD unit files to set up the service for easy start/stop/restart.  We are in 2020 now. Why can't I just use yum and install the damned package?

As it turns out, on CentOS7, there are yum repositories for DPDK. And, when you run "yum install dpdk", you can get a package that is labeled 18.11.2 as the release. After installing this package, I quickly figured out that there are no scripts, tools or utilities. So, having some general experience with CentOS 7 and its package convention, I looked for the complement packages "devel and tools", and indeed, located those and installed them with "yum install dpdk-tools" and "yum install dpdk-devel".

NOTE: I only really needed the dpdk-tools, as I was not doing any dataplane kit development at this point.

But, things were going smooth so far. Knowing that VFIO is the new driver, I decided to go ahead and load the kernel module for VFIO.

Now, as I was looking over a couple of guys' scripts for doing this (both of them were using DPDK documentation as a basis), I saw that one fellow was installing two packages. Since we were dealing with PCI (more on this later), and Kernel Modules, it made sense to install these:
  1. pci_utils
  2. kernel-devel
After all, these are just tools and utilities and there should be no risk in installing these on your system.

Next, in his script, I saw him changing permissions on a file in /dev - and this really had me concerned. Why would someone be doing this? Is this a bug that he is fixing? 

Turns out, these are instructions on the dpdk.org website, which I found later after looking at his script.


#chmod a+x /dev/vfio
#chmod 0666 /dev/vfio/*

So basically, what this guy was doing, was pretty much following the standard "cookbook" for binding VFIO drivers. The VFIO drivers, as mentioned in Part II, are poll mode drivers that are installed on your system when you install DPDK.

The link for binding NICs can be found at: https://dpdk-guide.gitlab.io/dpdk-guide/setup/binding.html

This documentation gives you two methods for binding your NICs.
  1. driverctl - which was a utility I had not heard of and had to install (e.g. yum install driverctl)
  2. a DPDK utility script, written in Python, called dpdk-devbind.py
The script I was following as a reference was using #1, driverctl, and I figured I would use that instead of the utility script.

NOTE: Later, I decided to start using the utility script and found it to be wonderful

But first - before you bind your NIC, you need to load a kernel module.

I am familiar with kernel modules, so to load the module, you can use modprobe (or insmod, but modprobe is preferred). Again, this is via the DPDK website above on binding NICs.

If all goes well loading the vfio_pci kernel module, this module will in fact load 2 more kernel modules, and what I wound up with when I did "lsmod | grep vfio" was this:

# lsmod | grep vfio
vfio_pci               41412  2
vfio_iommu_type1       22440  1
vfio                   32657  6 vfio_iommu_type1,vfio_pci
irqbypass              13503  8 kvm,vfio_pci

So - there is that irqbypass - to disable interrupt processing. And, there is an IOMMU module that gets loaded, probably the result of having IOMMU enabled on the kernel command line!

It looks like the DPDK is loaded nicely. Now, it is time to load the drivers.

Let's talk about that in a Part V.

DPDK Hands-On - Part III - Huge Pages

The last post discussed IOMMU, which was something I had never heard of. So I had to research it.

Next, in the DPDK Getting Started Guide, http://doc.dpdk.org/spp/setup/getting_started.html, the following is stated about the requirement for HugePages.

Hugepages must be enabled for running DPDK with high performance. Hugepage support is required to reserve large amount size of pages, 2MB or 1GB per page, to less TLB (Translation Lookaside Buffers) and to reduce cache miss. Less TLB means that it reduce the time for translating virtual address to physical.

This SOUNDS like a requirement, but honestly, I am not sure if this is a requirement, or just an optimization. Will DPDK not function at all without HugePages? There is another guide that also discusses Huge Pages.  https://dpdk-guide.gitlab.io/dpdk-guide/setup/hugepages.html .

Why chance it? Let's go ahead and set up HugePages. But - how do you know if your system even supports HugePages?  As it turns out, if the kernel has support for HugePages, you can run this command, and it will allow you to see if not only if it is supported, but the statistics regarding the size and use of your HugePages.

So - we have 16G of memory in this T1700, 4G of which are allocated to HugePages. Hugepages in Linux can be set by passing the parameters into the kernel, as shown below:

  1. default_hugepagesz=1G
  2. hugepage_sz=1G
  3. hugepages=4
  4. transparent_hugepage=never 
NOTE: I am not fully abreast of transparent hugepages, but I see that a lot of people recommend turning this off, which is why I have it disabled above.
 
In the file /etc/default/grub, these can be added, and then the following command run to make it stick:
#grub2mkconfig -O /boot/grub2/grub.cfg 

NOTE: This assumes Grub is being used as your bootloader.
 
And, just as with IOMMU kernel command line parameters, after a reboot you can verify that these parameters are set by running:

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=4102ab69-f71a-4dd0-a14e-8695aa230a0d ro rhgb quiet iommu=pt intel_iommu=on default_hugepagesz=1G hugepagesz=1G hugepages=4 transparent_hugepage=never LANG=en_US.UTF-8gePages_Surp:        0
Hugepagesize:    1048576 kB

After booting the system up with Hugepages, you can check the summary of Hugepages with the following command:

# grep -i HugePages_ /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:       4
HugePages_Free:        3
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB

When we run this command, we can see that out of 4 those, 3 are free. So why is one of them in use???

The answer to why 1G (1 Hugepage) is used, has to do with the initialization of OpenVSwitch. The initialization of OpenVSwitch as the following directives:

ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="2048"

This tells OpenVSwitch to grab 1 Hugepage at startup, with a limit of 2 Hugepages. 

NOTE: Calculating the number of Hugepages you need in OVS, is a topic in and of itself - the subject of a separate post.

Another nifty command, is the ability to see where your Hugepages are distributed, across your NUMA nodes!

# numactl --hardware
# echo ""
# numastat -cm | egrep 'Node|Huge' 
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 15954 MB
node 0 free: 1808 MB
node distances:
node 0
0: 10

Node 0 Total
AnonHugePages 0 0
HugePages_Total 8192 8192
HugePages_Free 7168 7168
HugePages_Surp 0 0

In this example above, we only have a single NUMA node with 4 cores on it. But had this been a 2 x NUMA Node system with 3 cores apiece, you would be able to see how Hugepages are allocated across your NUMA nodes.


DPDK Hands-On - Part II - Poll Mode Drivers and IOMMU


In our last post Hands On with DPDK - Part I we chose a box to try and install DPDK on.

This box, was a circa 2015 Dell T-1700. A bit long in the tooth (it is now 2020), and it is not and never was, a data center grade server.

And, looking forward, this will bite us. But will help us learn a LOT about DPDK due to the persistence and troubleshooting.

So - to get started, I did something rather unconventional. Rather than read all of the documentation (there is a LOT of documentation), I took a cursory look at the dpdk.org site (Getting Started), and then went looking for a couple of blogs where someone else tried to get DPDK working with OVS.

Poll Mode Drivers

Using DPDK requires using a special type of network interface card driver known as a poll mode driver. This means that the driver has to be available (custom compiled and installed with rpm, or perhaps pre-compiled and installed with package managers like yum). 

Poll Mode drivers continuously poll for packets, as opposed to using the classic interrupt-driven approach that the standard vendor drivers use. Using interrupts to process packets is considered less efficient than polling for packets. But - to poll for packets continuously is cpu intensive, so there is a trade-off! 

There are two poll mode drivers listed on the dpdk.org website:
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

  1. UIO (legacy)
    1. uio_pci_generic
    2. igb_uio
  2. VFIO (current recommended driver)

The DPDK website has this to say about the two driver families (UIO and VFIO). 

"VFIO is the new or next-gen poll mode driver, that is a more robust and secure driver in comparison to the UIO driver, relying on IOMMU protection". 

So perhaps it makes sense to discuss IOMMU, as it will need to be disabled for UIO drivers, and enabled for VFIO drivers.

 IOMMU

Covering IOMMU would be a blog series in its own right. So I will simply list the Wikipedia site on IOMMU.  Wikipedia IOMMU Link

What does IOMMU have to do with DPDK? DPDK has this to say in their up-front pre-requisites for DPDK.

"An input-output memory management unit (IOMMU) is required for safely driving DMA-capable hardware from userspace and because of that it is a prerequisite for using VFIO. Not all systems have one though, so you’ll need to check that the hardware supports it and that it is enabled in the BIOS settings (VT-d or Virtualization Technology for Directed I/O on Intel systems)"
 

So there you have it. It took getting down to the poll mode drivers, but IOMMU provides memory security...but for the newer-generation VFIO drivers. Without this security, one rogue NIC could affect the memory for all NICs, or jeopardize the memory of the system in general.

So - how do you enable IOMMU?

Well, first you need to make sure your system even supports IOMMU.

To do this, you can do one of two things (suggested: do both) - Linux system assumed here.
  1. Check and make sure there is a file called /sys/class/iommu 
  2. type (as root) dmesg | grep IOMMU
On #2, you should see something like this
IOMMU                                                          
[    0.000000] DMAR: IOMMU enabled
[    0.049734] DMAR-IR: IOAPIC id 8 under DRHD base  0xfbffc000 IOMMU 0
[    0.049735] DMAR-IR: IOAPIC id 9 under DRHD base  0xfbffc000 IOMMU 0
Now in addition to this, you will need to edit your kernel command line so that two IOMMU directives can be passed in:  iommu=pt intel_iommu=on
 
The typical way these directives are added is using the grub2 utility.

NOTE: Many people forget that once they add the parameters, they need to do a mkconfig to actually apply these parameters!!!

After adding these kernel parameters, you can check your kernel command line by running the following command:

# cat /proc/cmdline

And you should see your iommu parameters showing up:

BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=4102ab69-f71a-4dd0-a14e-8695aa230a0d ro rhgb quiet iommu=pt intel_iommu=on

Next Step: Part III - Huge Pages

Saturday, May 9, 2020

DPDK Hands-On - Part I - Getting Started


I decided to try and enable DPDK on my computer.

This computer is a Dell T1700 Precision, circa 2015, which is a very very nice little development workstation server.

The VERY FIRST thing anyone needs to do, with DPDK, is ensure that their server has supported NICs. It all starts with the NIC cards. You cannot do DPDK without DPDK-compatible NICs.

There is a link at the DPDK website, which shows the list of NICs that are (or should be, as it always comes down to the level of testing, right?) compatible with DPDK.
That website is: DPDK Supported NICs

This T-1700 has an onboard NIC, and two ancillary NIC cards that ARE listed as DPDK-compatible NICs.These NICs are listed as:
82571EB/82571GB Gigabit Ethernet Controller and are part of the Intel e1000e family of NICs.

I was excited that I could use this server without having to invest and install in new NIC cards!

Let's first start, with specs on the computer. First, our CPU specifications.

CPU:
# lscpu
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:           Little Endian
CPU(s):               4
On-line CPU(s) list:  0-3
Thread(s) per core:   1
Core(s) per socket:   4
Socket(s):            1
NUMA node(s):         1
Vendor ID:            GenuineIntel
CPU family:           6
Model:                60
Model name:           Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
Stepping:             3
CPU MHz:              1183.471
CPU max MHz:          3900.0000
CPU min MHz:          800.0000
BogoMIPS:             6983.91
Virtualization:       VT-x
L1d cache:            32K
L1i cache:            32K
L2 cache:             256K
L3 cache:             6144K
NUMA node0 CPU(s):    0-3
Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

Let's take a look at our NUMA capabilities on this box. It says up above, we have one Numa Node. There is a utility called numactl on Linux, and we will run that with the "-H" option to get more information.

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 16019 MB
node 0 free: 7554 MB
node distances:
node  0
 0: 10

From this, we see we have 1 Numa Node. Numa Nodes equate to CPU sockets. And since we have one CPU socket, we have one Numa Node. All 4 cores of the CPU are on this node (Node 0 per above). Having just one Numa Node is not an optimal scenario for DPDK testing, but as long we are NUMA-capable, we can proceed. 

Next, we will look at Memory.


Memory:
# lsmem --summary
Memory block size:      128M
Total online memory:     16G
Total offline memory:     0B

16G memory. Should be more than enough for this exercise.

So how to get started?

Obviously the right way, would be to sit and read reams of documentation from both DPDK and OpenVSwitch. But, what fun is that? Booooring. I am one of those people who like to start running and run my head into the wall.

So, I did some searching, and found a couple of engineers who had scripts that enabled DPDK. I decided to study these, pick them apart, and use them as a basis to get started. I saw a lot of stuff in these scripts that had me googling stuff - IOMMU, HugePages, CPU and Masking, PCI, Poll Mode Drivers, etc.

In order to fully comprehend what was needed to enable DPDK, I would have to familiarize myself with these concepts. Then, hopefully, I could tweak this script, or even write new scripts, and get DPDK working on my box. That's the strategy.

I did realize, as time went on, that the scripts were essentially referring back to the DPDK and OpenVSwitch websites, albeit at different points in time as the content on these sites changes release by release.

Saturday, April 25, 2020

Configuring Persistent Bans with Fail2Ban


Someone gave me a network to put a Virtual Machine on, and I thought that network was a NAT. It wasn't. I was extremely lucky the VM did not get hacked. I immediately shut down the public facing interface, and installed FirewallD, allowing only key authentication through ssh.

That is NOT enough. In examining logs, this VM was getting pounded on all day, every day.

So, I took an extra measure of installing Fail2Ban. Initially, I configured a 24 hour jail time. But after seeing the same IPs come after the VM time and time again, I decided to reconfigure for a permanent ban.

To configure a permanent ban, I used -1 on the ban time (which in old days was in seconds, but they now accept the "365d", "52w", "1y" formats.

Now from there, things get more interesting. Wanting to get this configured quickly, I took the measures explained in this blog post for configuring Persistent Bans on Fail2Ban.

Configuring Persistent Bans with Fail2Ban

First, let's discuss what he assumes. He assumes, that you are configuring your jail to use iptables-multiport actions. Indeed, I have read (in another blog) that using the iptables-multiport actions might be a bit safer than using firewalld-multiport rules, even though you might be running FirewallD!

So that is exactly what I did. My jail.local file has a default ban of 52w. My ssh-specific rules use a -1 value on ban time (permanent ban), and use the iptables-multiport action rules.

I backed up this iptables-multiport file, and added a line on "action start" to loop through all of the hosts (ip addresses) in the /etc/fail2ban/persistent.bans file, and block them (refer to blog link above for specific rule). Then, on action ban, a simple print statement will echo the action of a permanent ban to a log file, so that we can see incrementally, who is banned.

Now later, I did check out the firewallcmd-multiport file, which would essentially attempt the same things that iptables-multiport does, except with firewall-cmd statements instead.

To do that, I would do the same thing. I would back up the firewallcmd-multiport file, and make the following changes.

1. The action to ban an IP is: firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>

So I would take this, and add in the actionstart section, a loop rule that looks like this:
cat /etc/fail2ban/persistent.bans | awk '/^fail2ban-<name>/ {print $2}' | while read IP; do \
firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>; done

2. Then, I would add in the actionban section, the same print statement that resides in the iptables-multiport.conf file so that as new bands are added, they get logged:

echo "fail2ban-<name>  <ip>" >> /etc/fail2ban/persistent.bans

Of course, a restart of fail2ban needs to be made for these to kick in, and this needs to be verified before you walk away after the change!

The only thing that has me wondering now, is that as the list of banned ips grows, your rules will grow, and this could have performance impacts on packet processing. But protecting your box is imperative, and should be the first priority! You could, if your list grows too long, periodically release some prisoners from jail, I suppose. And see if they behave, or perhaps maybe move on to better things.

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization


Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the packets  FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/


Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Wednesday, April 1, 2020

Enabling Jumbo Frames on Tenant Virtual Machines - Should We?

I noticed that all of our OpenStack virtual machines had 1500 MTU on the interfaces. These seemed wasteful to me, since I knew that everything upstream (private MPLS network) was using jumbo frames.

I went looking for answers as to why the tenants were enabled with only 1500 MTU. Which led to me looking into who was responsible for setting the MTU.

  • OpenStack?
  • Neutron?
  • LibVirt?
  • Contrail?
  • something else?
As it turns out, Contrail, which kicks Neutron out of the way and manages the networking with is L3 VPN solution (MPLS over GRE/UDP), works in tandem with Neutron via a bi-directional Plugin (so you can administer your networks and ports from Horizon, or through a Contrail GUI.

But, as I have learned from a web discussion thread, Contrail takes no responsibility for setting the MTU of the virtual machine interfaces. It pleads the 5th.

The thread mentions that the MTU can be set in the Contrail DHCP server. I am not sure, if that would work if you used pre-defined ports, though (do those still use a DHCP mac reservation approach to getting an assigned IP Address?). Do other DHCP servers assign MTUs? DHCP can do a lot of stuff (they cannot cook you a good breakfast unfortunately). I didn't realize DHCP servers could set MTUs, too, until I read that.

Now - the big question. If we can set the MTU on virtual machines, should we? Just because you can, doesn't necessarily mean you should, right?

I set about looking into that. And I ran into some really interesting discussions (and slide decks) on this very topic, and some outright debates on it.

This link below, was pretty informative, I thought.

Discussion: What advantage does enabling Jumbo Frames provide?

Make sure you expand the discussion out with "Read More Comments! That is where the good stuff lies!"

He brings up considerations:
  • Everything in front of you, including WAN Accelerators and Optimizers, would need to support the larger MTUs.
  • Your target VM on the other side of the world, would need to support the larger MTU.
    Unless you use MTU Path Discovery, and I read a lot of bad things about MTU-PD.
  • Your MTU setting in a VM, would need to consider any encapsulation that would be done to the frames - and Contrail, being a L3 VPN, does indeed encapsulate the packets.
  • On any OpenStack compute host running Contrail, the Contrail vRouter already places the payload into 9000 MTU frames, to send over the transport network. Maybe making it not necessary to use jumbo frames at the VM level?
Interesting stuff.


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...