Grasping Technology

Saturday, May 9, 2020

DPDK Hands-On - Part I - Getting Started

I decided to try and enable DPDK on my computer.

This computer is a Dell T1700 Precision, circa 2015, which is a very very nice little development workstation server.

The VERY FIRST thing anyone needs to do, with DPDK, is ensure that their server has supported NICs. It all starts with the NIC cards. You cannot do DPDK without DPDK-compatible NICs.

There is a link at the DPDK website, which shows the list of NICs that are (or should be, as it always comes down to the level of testing, right?) compatible with DPDK.
That website is: DPDK Supported NICs

This T-1700 has an onboard NIC, and two ancillary NIC cards that ARE listed as DPDK-compatible NICs.These NICs are listed as:
82571EB/82571GB Gigabit Ethernet Controller and are part of the Intel e1000e family of NICs.

I was excited that I could use this server without having to invest and install in new NIC cards!

Let's first start, with specs on the computer. First, our CPU specifications.

CPU:

# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 4

On-line CPU(s) list: 0-3

Thread(s) per core: 1

Core(s) per socket: 4

Socket(s): 1

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 60

Model name: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz

Stepping: 3

CPU MHz: 1183.471

CPU max MHz: 3900.0000

CPU min MHz: 800.0000

BogoMIPS: 6983.91

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 6144K

NUMA node0 CPU(s): 0-3

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

Let's take a look at our NUMA capabilities on this box. It says up above, we have one Numa Node. There is a utility called numactl on Linux, and we will run that with the "-H" option to get more information.

# numactl -H

available: 1 nodes (0)

node 0 cpus: 0 1 2 3

node 0 size: 16019 MB

node 0 free: 7554 MB

node distances:

node 0

0: 10

From this, we see we have 1 Numa Node. Numa Nodes equate to CPU sockets. And since we have one CPU socket, we have one Numa Node. All 4 cores of the CPU are on this node (Node 0 per above). Having just one Numa Node is not an optimal scenario for DPDK testing, but as long we are NUMA-capable, we can proceed.

Next, we will look at Memory.

Memory:

# lsmem --summary

Memory block size: 128M

Total online memory: 16G

Total offline memory: 0B

16G memory. Should be more than enough for this exercise.

So how to get started?

Obviously the right way, would be to sit and read reams of documentation from both DPDK and OpenVSwitch. But, what fun is that? Booooring. I am one of those people who like to start running and run my head into the wall.

So, I did some searching, and found a couple of engineers who had scripts that enabled DPDK. I decided to study these, pick them apart, and use them as a basis to get started. I saw a lot of stuff in these scripts that had me googling stuff - IOMMU, HugePages, CPU and Masking, PCI, Poll Mode Drivers, etc.

In order to fully comprehend what was needed to enable DPDK, I would have to familiarize myself with these concepts. Then, hopefully, I could tweak this script, or even write new scripts, and get DPDK working on my box. That's the strategy.

I did realize, as time went on, that the scripts were essentially referring back to the DPDK and OpenVSwitch websites, albeit at different points in time as the content on these sites changes release by release.

Saturday, April 25, 2020

Configuring Persistent Bans with Fail2Ban

Someone gave me a network to put a Virtual Machine on, and I thought that network was a NAT. It wasn't. I was extremely lucky the VM did not get hacked. I immediately shut down the public facing interface, and installed FirewallD, allowing only key authentication through ssh.

That is NOT enough. In examining logs, this VM was getting pounded on all day, every day.

So, I took an extra measure of installing Fail2Ban. Initially, I configured a 24 hour jail time. But after seeing the same IPs come after the VM time and time again, I decided to reconfigure for a permanent ban.

To configure a permanent ban, I used -1 on the ban time (which in old days was in seconds, but they now accept the "365d", "52w", "1y" formats.

Now from there, things get more interesting. Wanting to get this configured quickly, I took the measures explained in this blog post for configuring Persistent Bans on Fail2Ban.

Configuring Persistent Bans with Fail2Ban

First, let's discuss what he assumes. He assumes, that you are configuring your jail to use iptables-multiport actions. Indeed, I have read (in another blog) that using the iptables-multiport actions might be a bit safer than using firewalld-multiport rules, even though you might be running FirewallD!

So that is exactly what I did. My jail.local file has a default ban of 52w. My ssh-specific rules use a -1 value on ban time (permanent ban), and use the iptables-multiport action rules.

I backed up this iptables-multiport file, and added a line on "action start" to loop through all of the hosts (ip addresses) in the /etc/fail2ban/persistent.bans file, and block them (refer to blog link above for specific rule). Then, on action ban, a simple print statement will echo the action of a permanent ban to a log file, so that we can see incrementally, who is banned.

Now later, I did check out the firewallcmd-multiport file, which would essentially attempt the same things that iptables-multiport does, except with firewall-cmd statements instead.

To do that, I would do the same thing. I would back up the firewallcmd-multiport file, and make the following changes.

1. The action to ban an IP is: firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>

So I would take this, and add in the actionstart section, a loop rule that looks like this:
cat /etc/fail2ban/persistent.bans | awk '/^fail2ban-<name>/ {print $2}' | while read IP; do \
firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>; done

2. Then, I would add in the actionban section, the same print statement that resides in the iptables-multiport.conf file so that as new bands are added, they get logged:

echo "fail2ban-<name> <ip>" >> /etc/fail2ban/persistent.bans

Of course, a restart of fail2ban needs to be made for these to kick in, and this needs to be verified before you walk away after the change!

The only thing that has me wondering now, is that as the list of banned ips grows, your rules will grow, and this could have performance impacts on packet processing. But protecting your box is imperative, and should be the first priority! You could, if your list grows too long, periodically release some prisoners from jail, I suppose. And see if they behave, or perhaps maybe move on to better things.

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization

Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the ~~packets~~ FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/

Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Wednesday, April 1, 2020

Enabling Jumbo Frames on Tenant Virtual Machines - Should We?

I noticed that all of our OpenStack virtual machines had 1500 MTU on the interfaces. These seemed wasteful to me, since I knew that everything upstream (private MPLS network) was using jumbo frames.

I went looking for answers as to why the tenants were enabled with only 1500 MTU. Which led to me looking into who was responsible for setting the MTU.

OpenStack?
Neutron?
LibVirt?
Contrail?
something else?

As it turns out, Contrail, which kicks Neutron out of the way and manages the networking with is L3 VPN solution (MPLS over GRE/UDP), works in tandem with Neutron via a bi-directional Plugin (so you can administer your networks and ports from Horizon, or through a Contrail GUI.

But, as I have learned from a web discussion thread, Contrail takes no responsibility for setting the MTU of the virtual machine interfaces. It pleads the 5th.

The thread mentions that the MTU can be set in the Contrail DHCP server. I am not sure, if that would work if you used pre-defined ports, though (do those still use a DHCP mac reservation approach to getting an assigned IP Address?). Do other DHCP servers assign MTUs? DHCP can do a lot of stuff (they cannot cook you a good breakfast unfortunately). I didn't realize DHCP servers could set MTUs, too, until I read that.

Now - the big question. If we can set the MTU on virtual machines, should we? Just because you can, doesn't necessarily mean you should, right?

I set about looking into that. And I ran into some really interesting discussions (and slide decks) on this very topic, and some outright debates on it.

This link below, was pretty informative, I thought.

Discussion: What advantage does enabling Jumbo Frames provide?

Make sure you expand the discussion out with "Read More Comments! That is where the good stuff lies!"

He brings up considerations:

Everything in front of you, including WAN Accelerators and Optimizers, would need to support the larger MTUs.
Your target VM on the other side of the world, would need to support the larger MTU.
Unless you use MTU Path Discovery, and I read a lot of bad things about MTU-PD.
Your MTU setting in a VM, would need to consider any encapsulation that would be done to the frames - and Contrail, being a L3 VPN, does indeed encapsulate the packets.
On any OpenStack compute host running Contrail, the Contrail vRouter already places the payload into 9000 MTU frames, to send over the transport network. Maybe making it not necessary to use jumbo frames at the VM level?

Interesting stuff.

Monday, March 30, 2020

How to run OpenStack on a single server - using veth pair

I decided I wanted to implement OpenStack using OpenVSwitch. On one server.

The way I decided to do this, was to spin up a KVM virtual machine (VM) as an OpenStack controller, and have it communicate to the bare metal CentOS7 Linux host (that runs the KVM hypervisor libvirt/qemu).

I did not realize how difficult this would be, until I realized that OpenVSwitch cannot leverage Linux bridges (bridges on the host).

OpenVSwitch allows you to create, delete and otherwise manipulate bridges - but ONLY bridges that are under the control of OpenVSswitch. So, if you happen to have a bridge on the Linux host (we will call it br0), you cannot snap that bridge into OpenVSwitch.

What you would normally do, is to create a new bridge on OpenVSwitch (i.e. br-ex), and migrate your connections from br0, to br-ex.

That's all well and good - and straightforward, most of the time. But, if you want to run a virtual machine (i.e. an OpenStack Controller VM), and have that virtual machine communicate to OpenStack Compute processes running on the bare metal host, abandoning the host bridges becomes a problem.

Virt-Manager, does NOT know anything about OpenVSwitches, nor OpenVSwitch bridges that OpenVSwitch controls. So when you create your VM, if everything is under an OpenVSwitch bridge (i.e. br-ex), Virt-Manager will only provide you a series of macvtap interfaces (macvtap, and for that matter macvlan, are topics in and of themselves that we won't get into here).

So. I did not want to try and use macvtap interfaces - and jump through hoops to get that to communicate to the underlying host (yes, there are some tricks with macvlan that can do this presumably, but the rabbit hole was getting deeper).

As it turns out, "you can have your cake, and eat it too". You can create a Linux bridge (br0), and plumb that into OpenVSwitch with a veth pair. A veth pair is used just for this very purpose. It is essentially a virtual patch cable between two bridges, since you cannot join bridges (joining bridges is called cascading bridges, and this is not allowed in Linux Networking).

So here is what we wound up doing.

Monday, March 9, 2020

CPU Isolation - and how dangerous it can be

I noticed that in an implementation of OpenStack, they had CPU Pinning configured. I wasn't sure why, so I asked, and I was told that it allowed an application (comprised of several VMs on several Compute hosts in an Availability Zone), to achieve bare-metal performance.

I didn't think too much about it.

THEN - when I finally DID start to look into it - I realized that the feature was not turned on.

CPU Pinning, as they were configuring it, was comprised of 3 parameters:

cpu_isol - a Linux Kernel setting, passed into grub boot loader on grub command line.
vcpu_pin_set - defined in nova.conf - an OpenStack configuration file
reserved_host_cpus - defined in nova.conf - an OpenStack configuration file

These settings have tremendous impact. For instance, they can impact how many CPUs OpenStack sees on the host.

isol_cpu is a comma-delimited array of CPUs. vcpu_pin_set is also an array of CPUs, and what this does, is allow OpenStack Nova to place VMs (qemu processes), via libvirt APIs, on all or a subset of the full bank of isolated CPUs.

So for example, you might isolate 44 CPUs on a 48 core system (24 cores x 2 threads per core). Then you might specify 24 of those 44 to be pinned by Nova/libvirt - and perhaps the remaining 20 are used for non-OpenStack userland processes (i.e. OpenContrail vrouter processes that broker packets in and out of virtual machines and the compute hosts).

So. In a lab environment, with isol_cpu isolating 44 cpus, and these same 44 cpus listed in the vcpu_pin_set array, a customer emailed and complained about sluggish performance. I logged in, started up htop, added the PROCESSOR column, and noticed that everything was running on a single cpu core.

Ironically enough, I had just read this interesting article that helped me realize very quickly what was happening.

https://www.codeblueprint.co.uk/2019/10/08/isolcpus-is-deprecated-kinda.html

Obviously, running every single userland process on a single processor core is a killer.

So why was everything running on one core?

It turned out, that when launching the images, there is a policy that needed to be attached to the flavors, called hw:policy=dedicated.

When specified on the flavor, this property causes Nova to pass this information to libvirt, which knows to assign the virtual machine to one of the specific isolated CPUs.

When NOT specified, it appears that libvirt just shoves the task onto the first available CPU on the system - cpu 0. cpu 0 was indeed an isolated CPU, because the ones left out of the isol_cpu and vcpu_pin_set arrays were 2,4,26 and 28.

So the qemu virtual machine process wound up on the isolated CPU (as it should have). But since there is no load balancing on CPUs when you isolate CPUs, the CPUs just fell onto CPU 0.

Apparently, the flavor property hw:policy=dedicated is CRITICAL in telling libvirt to map an instance to a vcpu in the array.

Changing the flavor properties was not an option in this case, so what wound up happening, was to remove the vcpu_pin_set array in /etc/nova.conf, and to remove the isol_cpu array from the grub boot loader. This fixed the issue of images with no property landing on a single CPU. We also noticed that if a flavor did STILL use the flavor property hw:policy=dedicated, a cpu assignment would still get generated into the libvirt xml file - and the OS would place (and manage) the task on that CPU.

Thursday, March 5, 2020

Mounting a Linux Volume over a File System - the bind mount trick

Logged into a VM today trying to help troubleshoot issues. There was nothing in /var/log! No Syslog!

Turns out that this phenomenon had occurred, where Linux will indeed let you mount on top of pretty much any directory, because after all, a directory is just a mount point as far as Linux is concerned.

But what happens to the files in original directory? I used to think they were lost. They're not. They're there, but shielded. They can be recovered, with a neat trick called a bind mount!

All described here! Learn something new every day.

A snippet of dialog from the link below:
https://unix.stackexchange.com/questions/198542/what-happens-when-you-mount-over-an-existing-folder-with-contents

Q. Right now /tmp has some temporary files in it. When I mount my hard drive (/dev/sdc1) on top of /tmp, I can see the files on the hard drive. What happens to the actual content of /tmp when my hard drive is mounted?

A. Pretty much nothing. They're just hidden from view, not reachable via normal filesystem traversal.

Q. Is it possible to perform r/w operations on the actual content of /tmp while the hard drive is mounted?

A. Yes. Processes that had open file handles inside your "original" /tmp will continue to be able to use them. You can also make the "reappear" somewhere else by bind-mounting / elsewhere.

# mount -o bind / /somewhere/else
# ls /somewhere/else/tmp