Grasping Technology: Libvirt

Showing posts with label Libvirt. Show all posts

Wednesday, February 3, 2021

DPDK Hands-On Part X - Launching a DPDK virtual machine with Virsh Libvirt

In DPDK Hands-On Part IX, we launched a virtual machine with a bash script, calling qemu with sufficient command line options that it would launch a virtual machine with DPDK ports.

To launch a virtual machine that has DPDK ports using virsh (LibVirt), the first thing you need to know, is that you can not use the virt-manager GUI. Why? The GUI does not understand vhostuser (DPDK) ports - at least not in the version I happen to be running. But aside of this, the process remains the same. Your virtual machine needs to be defined by xml, which is parsed by virsh when the VM is launched. In order to launch a DPDK virtual machine, the xml will need to be crafted properly, and this post will discuss the important sections of that xml.

I made several attempts at finding an xml file that would work properly. It took quite a while. There are a few examples on OpenVSwitch websites, as well as Intel-sponsored DPDK websites. In the end, I found a link from an R&D engineer named Tomek Osinski that helped me more than any other, and I will share that here:

Configuring OVS-DPDK with VM for performance testing

In this link, is a full-fledged xml (file).

Let me pick through the relevant sections of this file, and comment on the important sections.

Memory

<currentMemory unit='KiB'>8399608</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1' unit='G' nodeset='0'/>
  </hugepages>
</memoryBacking>

This section specifies that the VM will use in excess of 8G of RAM, but tells virsh (libvirt/KVM) to place the VM's memory on HugePages.

If you try to launch this VM, and the VM Host (server running KVM) does not have Hugepages enabled, or have enough Hugepages available, the launch of the VM will fail.

CPU

<vcpu placement='static'>8</vcpu>

<cpu mode='host-model'>
  <model fallback='allow'/>
  <topology sockets='2' cores='4' threads='1'/>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

<cputune>
  <shares>4096</shares>
  <vcpupin vcpu='0' cpuset='14'/>
  <vcpupin vcpu='1' cpuset='15'/>
  <emulatorpin cpuset='11,13'/>
</cputune>

In this section, 8 virtual CPU (2 sockets x 4 cores each) are specified by the virtual machine. A NUMA cell is specified with virtual CPUs 0 and 1.

Of these 8, two virtual CPUs are pinned to the last two cores of the host (assuming a 16 core host that has cores 0-15).

The CPU pinning is optional. I elected to avoid CPU pinning on my own VM. One reason I avoided it is because I am already pinning my OpenVSwitch to specific cores using the pmd mask. But I am still covering this here because it is worth pointing out.

Network Interface (vhostuser)

    <interface type='vhostuser'>
      <mac address='00:00:00:00:00:01'/>
      <source type='unix' path='/usr/local/var/run/openvswitch/dpdkvhostuser0' mode='client'/>
       <model type='virtio'/>
      <driver queues='2'>
        <host mrg_rxbuf='off'/>
      </driver>
    </interface>

 
    <interface type='vhostuser'>
      <mac address='00:00:00:00:00:02'/>
      <source type='unix' path='/usr/local/var/run/openvswitch/dpdkvhostuser1' mode='client'/>
      <model type='virtio'/>
      <driver queues='2'>
        <host mrg_rxbuf='off'/>
      </driver>
    </interface>

In this section, the user has defined a network interface of type vhostuser.

As you may recall from my earlier blogs on the topic, a vhostuser port means that the VM acts as a client and OpenVSwitch acts as a server (which is not recommended by OpenVSwitch these days - they prefer vhostuserclient ports be used instead so that a reboot of the switch does not strands multiple VMs connected to it).

Tip: The Virt-Manager KVM GUI does not allow you to pick this kind of network interface from the menu when you set up an adaptor. So you need to put this in your xml, and if you do so, it indeed will show up in the GUI as a "vhost" interface.

Each mac address is set distinctively (loading two interfaces with same mac address is obviously going to cause trouble). Each NIC is specified to use multi-queuing (2 Rx/Tx queues per NIC). And, the specific socket to connect to on the OpenVSwitch is specified.

The super important thing to know about these interface directives, is that, prior to launching the VM, the following needs to be in place as a pre-requisite.

OpenVSwitch needs to be running
The DPDK vhostuser ports need to be created on the appropriate bridges on OpenVSwitch.
The datapath on the bridges of the OpenVSwitch that host the vhostuser ports need to be set to netdev (DPDK) rather than the default of system.

Once your xml is crafted properly, boot the VM with "virsh start [ vm name ]", and see if the VM boots with network connectivity.

Tip: When your VM boots, KVM should number the interfaces as eth0, eth1, eth2 according to the order they are in the xml file. But I have found that it doesn't always do this, so it is a good idea to map the xml file mac addresses to those which show up in the VM so that you can ensure you are putting the right IP addresses on the right interfaces!!! Doing this will save you a TON OF DEBUGGING!

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization

Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the ~~packets~~ FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/

Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Wednesday, April 1, 2020

Enabling Jumbo Frames on Tenant Virtual Machines - Should We?

I noticed that all of our OpenStack virtual machines had 1500 MTU on the interfaces. These seemed wasteful to me, since I knew that everything upstream (private MPLS network) was using jumbo frames.

I went looking for answers as to why the tenants were enabled with only 1500 MTU. Which led to me looking into who was responsible for setting the MTU.

OpenStack?
Neutron?
LibVirt?
Contrail?
something else?

As it turns out, Contrail, which kicks Neutron out of the way and manages the networking with is L3 VPN solution (MPLS over GRE/UDP), works in tandem with Neutron via a bi-directional Plugin (so you can administer your networks and ports from Horizon, or through a Contrail GUI.

But, as I have learned from a web discussion thread, Contrail takes no responsibility for setting the MTU of the virtual machine interfaces. It pleads the 5th.

The thread mentions that the MTU can be set in the Contrail DHCP server. I am not sure, if that would work if you used pre-defined ports, though (do those still use a DHCP mac reservation approach to getting an assigned IP Address?). Do other DHCP servers assign MTUs? DHCP can do a lot of stuff (they cannot cook you a good breakfast unfortunately). I didn't realize DHCP servers could set MTUs, too, until I read that.

Now - the big question. If we can set the MTU on virtual machines, should we? Just because you can, doesn't necessarily mean you should, right?

I set about looking into that. And I ran into some really interesting discussions (and slide decks) on this very topic, and some outright debates on it.

This link below, was pretty informative, I thought.

Discussion: What advantage does enabling Jumbo Frames provide?

Make sure you expand the discussion out with "Read More Comments! That is where the good stuff lies!"

He brings up considerations:

Everything in front of you, including WAN Accelerators and Optimizers, would need to support the larger MTUs.
Your target VM on the other side of the world, would need to support the larger MTU.
Unless you use MTU Path Discovery, and I read a lot of bad things about MTU-PD.
Your MTU setting in a VM, would need to consider any encapsulation that would be done to the frames - and Contrail, being a L3 VPN, does indeed encapsulate the packets.
On any OpenStack compute host running Contrail, the Contrail vRouter already places the payload into 9000 MTU frames, to send over the transport network. Maybe making it not necessary to use jumbo frames at the VM level?

Interesting stuff.

Monday, March 9, 2020

CPU Isolation - and how dangerous it can be

I noticed that in an implementation of OpenStack, they had CPU Pinning configured. I wasn't sure why, so I asked, and I was told that it allowed an application (comprised of several VMs on several Compute hosts in an Availability Zone), to achieve bare-metal performance.

I didn't think too much about it.

THEN - when I finally DID start to look into it - I realized that the feature was not turned on.

CPU Pinning, as they were configuring it, was comprised of 3 parameters:

cpu_isol - a Linux Kernel setting, passed into grub boot loader on grub command line.
vcpu_pin_set - defined in nova.conf - an OpenStack configuration file
reserved_host_cpus - defined in nova.conf - an OpenStack configuration file

These settings have tremendous impact. For instance, they can impact how many CPUs OpenStack sees on the host.

isol_cpu is a comma-delimited array of CPUs. vcpu_pin_set is also an array of CPUs, and what this does, is allow OpenStack Nova to place VMs (qemu processes), via libvirt APIs, on all or a subset of the full bank of isolated CPUs.

So for example, you might isolate 44 CPUs on a 48 core system (24 cores x 2 threads per core). Then you might specify 24 of those 44 to be pinned by Nova/libvirt - and perhaps the remaining 20 are used for non-OpenStack userland processes (i.e. OpenContrail vrouter processes that broker packets in and out of virtual machines and the compute hosts).

So. In a lab environment, with isol_cpu isolating 44 cpus, and these same 44 cpus listed in the vcpu_pin_set array, a customer emailed and complained about sluggish performance. I logged in, started up htop, added the PROCESSOR column, and noticed that everything was running on a single cpu core.

Ironically enough, I had just read this interesting article that helped me realize very quickly what was happening.

https://www.codeblueprint.co.uk/2019/10/08/isolcpus-is-deprecated-kinda.html

Obviously, running every single userland process on a single processor core is a killer.

So why was everything running on one core?

It turned out, that when launching the images, there is a policy that needed to be attached to the flavors, called hw:policy=dedicated.

When specified on the flavor, this property causes Nova to pass this information to libvirt, which knows to assign the virtual machine to one of the specific isolated CPUs.

When NOT specified, it appears that libvirt just shoves the task onto the first available CPU on the system - cpu 0. cpu 0 was indeed an isolated CPU, because the ones left out of the isol_cpu and vcpu_pin_set arrays were 2,4,26 and 28.

So the qemu virtual machine process wound up on the isolated CPU (as it should have). But since there is no load balancing on CPUs when you isolate CPUs, the CPUs just fell onto CPU 0.

Apparently, the flavor property hw:policy=dedicated is CRITICAL in telling libvirt to map an instance to a vcpu in the array.

Changing the flavor properties was not an option in this case, so what wound up happening, was to remove the vcpu_pin_set array in /etc/nova.conf, and to remove the isol_cpu array from the grub boot loader. This fixed the issue of images with no property landing on a single CPU. We also noticed that if a flavor did STILL use the flavor property hw:policy=dedicated, a cpu assignment would still get generated into the libvirt xml file - and the OS would place (and manage) the task on that CPU.

Friday, November 15, 2019

How LibVirt Networking Works - Under the Hood

This is the best link on this topic that I have found.

Lots of great pictures. Pictures are worth a thousand words.

https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net

Grasping Technology