Grasping Technology: KVM

Showing posts with label KVM. Show all posts

Wednesday, February 3, 2021

DPDK Hands-On Part X - Launching a DPDK virtual machine with Virsh Libvirt

In DPDK Hands-On Part IX, we launched a virtual machine with a bash script, calling qemu with sufficient command line options that it would launch a virtual machine with DPDK ports.

To launch a virtual machine that has DPDK ports using virsh (LibVirt), the first thing you need to know, is that you can not use the virt-manager GUI. Why? The GUI does not understand vhostuser (DPDK) ports - at least not in the version I happen to be running. But aside of this, the process remains the same. Your virtual machine needs to be defined by xml, which is parsed by virsh when the VM is launched. In order to launch a DPDK virtual machine, the xml will need to be crafted properly, and this post will discuss the important sections of that xml.

I made several attempts at finding an xml file that would work properly. It took quite a while. There are a few examples on OpenVSwitch websites, as well as Intel-sponsored DPDK websites. In the end, I found a link from an R&D engineer named Tomek Osinski that helped me more than any other, and I will share that here:

Configuring OVS-DPDK with VM for performance testing

In this link, is a full-fledged xml (file).

Let me pick through the relevant sections of this file, and comment on the important sections.

Memory

<currentMemory unit='KiB'>8399608</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1' unit='G' nodeset='0'/>
  </hugepages>
</memoryBacking>

This section specifies that the VM will use in excess of 8G of RAM, but tells virsh (libvirt/KVM) to place the VM's memory on HugePages.

If you try to launch this VM, and the VM Host (server running KVM) does not have Hugepages enabled, or have enough Hugepages available, the launch of the VM will fail.

CPU

<vcpu placement='static'>8</vcpu>

<cpu mode='host-model'>
  <model fallback='allow'/>
  <topology sockets='2' cores='4' threads='1'/>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

<cputune>
  <shares>4096</shares>
  <vcpupin vcpu='0' cpuset='14'/>
  <vcpupin vcpu='1' cpuset='15'/>
  <emulatorpin cpuset='11,13'/>
</cputune>

In this section, 8 virtual CPU (2 sockets x 4 cores each) are specified by the virtual machine. A NUMA cell is specified with virtual CPUs 0 and 1.

Of these 8, two virtual CPUs are pinned to the last two cores of the host (assuming a 16 core host that has cores 0-15).

The CPU pinning is optional. I elected to avoid CPU pinning on my own VM. One reason I avoided it is because I am already pinning my OpenVSwitch to specific cores using the pmd mask. But I am still covering this here because it is worth pointing out.

Network Interface (vhostuser)

    <interface type='vhostuser'>
      <mac address='00:00:00:00:00:01'/>
      <source type='unix' path='/usr/local/var/run/openvswitch/dpdkvhostuser0' mode='client'/>
       <model type='virtio'/>
      <driver queues='2'>
        <host mrg_rxbuf='off'/>
      </driver>
    </interface>

 
    <interface type='vhostuser'>
      <mac address='00:00:00:00:00:02'/>
      <source type='unix' path='/usr/local/var/run/openvswitch/dpdkvhostuser1' mode='client'/>
      <model type='virtio'/>
      <driver queues='2'>
        <host mrg_rxbuf='off'/>
      </driver>
    </interface>

In this section, the user has defined a network interface of type vhostuser.

As you may recall from my earlier blogs on the topic, a vhostuser port means that the VM acts as a client and OpenVSwitch acts as a server (which is not recommended by OpenVSwitch these days - they prefer vhostuserclient ports be used instead so that a reboot of the switch does not strands multiple VMs connected to it).

Tip: The Virt-Manager KVM GUI does not allow you to pick this kind of network interface from the menu when you set up an adaptor. So you need to put this in your xml, and if you do so, it indeed will show up in the GUI as a "vhost" interface.

Each mac address is set distinctively (loading two interfaces with same mac address is obviously going to cause trouble). Each NIC is specified to use multi-queuing (2 Rx/Tx queues per NIC). And, the specific socket to connect to on the OpenVSwitch is specified.

The super important thing to know about these interface directives, is that, prior to launching the VM, the following needs to be in place as a pre-requisite.

OpenVSwitch needs to be running
The DPDK vhostuser ports need to be created on the appropriate bridges on OpenVSwitch.
The datapath on the bridges of the OpenVSwitch that host the vhostuser ports need to be set to netdev (DPDK) rather than the default of system.

Once your xml is crafted properly, boot the VM with "virsh start [ vm name ]", and see if the VM boots with network connectivity.

Tip: When your VM boots, KVM should number the interfaces as eth0, eth1, eth2 according to the order they are in the xml file. But I have found that it doesn't always do this, so it is a good idea to map the xml file mac addresses to those which show up in the VM so that you can ensure you are putting the right IP addresses on the right interfaces!!! Doing this will save you a TON OF DEBUGGING!

Tuesday, October 13, 2020

DPDK Hands-On - Part IX - Launching a VM with DPDK Ports from a bash script

When one launches a virtual machine on Linux, you can configure your virtual machine hypervisor to run as a Type 1 hypervisor (qemu), or a Type 2 hypervisor (qemu-kvm). This post won't go into the details of the differences between those.

The options one can pass into qemu or qemu-kvm can be daunting.

But, if you want to launch a virtual machine just for the purpose of testing a DPDK adaptor, this can be done in a lightweight manner.

Remember, that a vhostuser port connects to a socket on the OpenVSwitch. This is the original, legacy manner in which DPDK was implemented between VMs and OpenVSwitch.

The drawback to this, of course, is that if the switch were to die, the VMs would be stranded with no connectivity (a bad thing).

This is why they came out with vhostuserclient ports, where OpenVSwitch connects to a socket that is managed by qemu when it starts the VM. This way, if the switch dies or is restarted, ports on the switch will reconnect to the VM-managed sockets. If a VM dies, it only removes its own sockets.

So we will cover the scenarios of launching a VM in both modes; vhostuser, and vhostuserclient.

VhostUser

With a vhostuser port, OpenStack creates the socket that qemu connects to. So this port needs to be created ahead of time on OpenVSwitch (see DPDK Hands-On Part VIII - creating DPDK ports).

Below is a snippet of a bash script that cranks a VM after some parameters are defined.

if [ $PORT_TYPE == "vhost-user" ]; then
# vhost-user - openvswitch binds to socket and acts as server
export VHOST_SOCK_DIR=/var/run/openvswitch

export VHOST_PORT="${VHOST_SOCK_DIR}/vhostport1"

   # the vhostuser does not have the server parameter.
   /usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
   -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
   -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
   -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
   -chardev socket,id=char${MAC},path=${VHOST_PORT} \
   -netdev type=vhost-user,id=default,chardev=char${MAC},vhostforce,queues=2 \
   -device virtio-net-pci,mac=00:00:00:00:00:0${MAC}, netdev=default, mrg_rxbuf=off, mq=on, vectors=6
#1>qemu.kvm.vhostuser.out 2>&1

So let's discuss these parameters:

- The Numa Node parameter tells you how to specify your cpus. A good link on this can be found at https://futurewei-cloud.github.io/ARM-Datacenter/qemu/how-to-configure-qemu-numa-nodes/

Now on my VM, I only have one Numa slot, which equates to one Numa Node (Numa Node 0), with 4 cores on it. So with the -smp sockets=1 cores=2 directive, I can specify 2 cores for my Virtual Machine.

- Mem Path

This parameter is specified to put the VM on hugepages. Any VM using DPDK ports should be backed by HugePages.

- Socket Parameters

The socket parameters are important. OpenVSwitch, if it creates the socket, will place the socket in /var/run/openvswitch by default. You have to create the port there manually, then tell the VM to connect to it specifically by the path and socket name.

- Multi queuing

You will notice that mq=on, and 2 queues are specified. Having more than one queue for sending and receiving data unlocks a bottleneck.

NOTE: I am not clear, actually, if by specifying 2 on "queues=2", that this means 2 x Tx queues AND 2 x Rx queues. I need to look into that and perhaps update this post (or if someone can comment on this, great).

The proof is in the pudding on this. When you launch the VM, you will need to check TWO places to ensure your networking initialized properly:

First, check /var/log/openvswitch/ovs-vswitchd.log

Second, check the output of the VM. Which means that you would need to redirect your output to a file for stdout and stderr and analyze accordingly. The green highlighted section above shows this. But this does affect your start stop of the VM, so I comment this out when running in normal mode.

VhostUserClient

The vhostuserclient works very similar to vhostuser, except that the socket is managed by qemu rather than OpenVSwitch!

# vhost-user-client - qemu binds to socket and acts as server
export VHOST_SOCK_DIR="/var/lib/libvirt/qemu/vhost_sockets"

export VHOST_PORT="${VHOST_SOCK_DIR}/dpdkvhostclt1"

echo "Starting VM..."
/usr/libexec/qemu-kvm -name $VM_NAME -cpu host -enable-kvm \
-m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \
-numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \
-object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
-chardev socket,id=char1,path=${VHOST_PORT},server \
-netdev type=vhost-user,id=default,chardev=char1,vhostforce,queues=2 \
-device virtio-netpci,mac=00:00:00:00:00:0${MAC},netdev=default,mrg_rxbuf=off,mq=on,vectors=6

Note the difference between the "chardev" directive above, in comparison with vhostuser "chardev" directive. The vhostuserclient directive has the additional "server" parameter included!!! This is the KEY to specifying that QEMU owns the socket, as opposed to OpenVSwitch.

One might wonder, how this works in practice. If you create a VM and it binds to a socket, how does OpenVSwitch know to connect to it? The answer is, OpenVSwitch won't - until you create the port on OpenVSwitch in vhostuserclient mode! At which point, OVS will know it is the client, and create the connection to the qemu process!

So a script that launches a VM with vhostuserclient interfaces, should probably create the OVS port first, then launch the VM. Otherwise, the VM won't have the connectivity it needs in a timely manner when it boots up (i.e. for DHCP to determine its IP Address and settings).

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization

Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the ~~packets~~ FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/

Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Wednesday, April 1, 2020

Enabling Jumbo Frames on Tenant Virtual Machines - Should We?

I noticed that all of our OpenStack virtual machines had 1500 MTU on the interfaces. These seemed wasteful to me, since I knew that everything upstream (private MPLS network) was using jumbo frames.

I went looking for answers as to why the tenants were enabled with only 1500 MTU. Which led to me looking into who was responsible for setting the MTU.

OpenStack?
Neutron?
LibVirt?
Contrail?
something else?

As it turns out, Contrail, which kicks Neutron out of the way and manages the networking with is L3 VPN solution (MPLS over GRE/UDP), works in tandem with Neutron via a bi-directional Plugin (so you can administer your networks and ports from Horizon, or through a Contrail GUI.

But, as I have learned from a web discussion thread, Contrail takes no responsibility for setting the MTU of the virtual machine interfaces. It pleads the 5th.

The thread mentions that the MTU can be set in the Contrail DHCP server. I am not sure, if that would work if you used pre-defined ports, though (do those still use a DHCP mac reservation approach to getting an assigned IP Address?). Do other DHCP servers assign MTUs? DHCP can do a lot of stuff (they cannot cook you a good breakfast unfortunately). I didn't realize DHCP servers could set MTUs, too, until I read that.

Now - the big question. If we can set the MTU on virtual machines, should we? Just because you can, doesn't necessarily mean you should, right?

I set about looking into that. And I ran into some really interesting discussions (and slide decks) on this very topic, and some outright debates on it.

This link below, was pretty informative, I thought.

Discussion: What advantage does enabling Jumbo Frames provide?

Make sure you expand the discussion out with "Read More Comments! That is where the good stuff lies!"

He brings up considerations:

Everything in front of you, including WAN Accelerators and Optimizers, would need to support the larger MTUs.
Your target VM on the other side of the world, would need to support the larger MTU.
Unless you use MTU Path Discovery, and I read a lot of bad things about MTU-PD.
Your MTU setting in a VM, would need to consider any encapsulation that would be done to the frames - and Contrail, being a L3 VPN, does indeed encapsulate the packets.
On any OpenStack compute host running Contrail, the Contrail vRouter already places the payload into 9000 MTU frames, to send over the transport network. Maybe making it not necessary to use jumbo frames at the VM level?

Interesting stuff.

Monday, March 30, 2020

How to run OpenStack on a single server - using veth pair

I decided I wanted to implement OpenStack using OpenVSwitch. On one server.

The way I decided to do this, was to spin up a KVM virtual machine (VM) as an OpenStack controller, and have it communicate to the bare metal CentOS7 Linux host (that runs the KVM hypervisor libvirt/qemu).

I did not realize how difficult this would be, until I realized that OpenVSwitch cannot leverage Linux bridges (bridges on the host).

OpenVSwitch allows you to create, delete and otherwise manipulate bridges - but ONLY bridges that are under the control of OpenVSswitch. So, if you happen to have a bridge on the Linux host (we will call it br0), you cannot snap that bridge into OpenVSwitch.

What you would normally do, is to create a new bridge on OpenVSwitch (i.e. br-ex), and migrate your connections from br0, to br-ex.

That's all well and good - and straightforward, most of the time. But, if you want to run a virtual machine (i.e. an OpenStack Controller VM), and have that virtual machine communicate to OpenStack Compute processes running on the bare metal host, abandoning the host bridges becomes a problem.

Virt-Manager, does NOT know anything about OpenVSwitches, nor OpenVSwitch bridges that OpenVSwitch controls. So when you create your VM, if everything is under an OpenVSwitch bridge (i.e. br-ex), Virt-Manager will only provide you a series of macvtap interfaces (macvtap, and for that matter macvlan, are topics in and of themselves that we won't get into here).

So. I did not want to try and use macvtap interfaces - and jump through hoops to get that to communicate to the underlying host (yes, there are some tricks with macvlan that can do this presumably, but the rabbit hole was getting deeper).

As it turns out, "you can have your cake, and eat it too". You can create a Linux bridge (br0), and plumb that into OpenVSwitch with a veth pair. A veth pair is used just for this very purpose. It is essentially a virtual patch cable between two bridges, since you cannot join bridges (joining bridges is called cascading bridges, and this is not allowed in Linux Networking).

So here is what we wound up doing.

Grasping Technology