Grasping Technology

Friday, April 10, 2020

VMWare Forged Transmits - and how it blocks Nested Virtualization

Nested Virtualization is probably never a good idea in general, but there are certain cases where you need it. We happened to be in one of those certain cases.

After creating a VM on VMWare (CentOS7), we installed libVirtD.

The first issue we ran into, was that nobody had checked a checkbox called "Expose Hardware Virtualization to GuestOS". As a result, we were able to install libVirtD and launch a nested VM, but when creating the VM with virt-install, it was generated to run in qemu-mode rather than kvm-mode.

We also needed to change the LibVirtD default storage pool to point to a volume, so that it had enough space to run a large qcow2 vendor-provided image.

After running virt-install, we were able to get a virtual machine up and running, and get to the console (we had to toy with serial console settings in virt-install to get this to work).

The adaptor in the nested VM was a host-bridge, and what we found, was that we could - from the nested VM - ping the CentOS7 host VM (and vice-versa). But we couldn't ping anything further than that. The LibVirtD VM, that was hosting the nested VM had no problem pinging anything; it could ping the VM is was hosting, it could ping the default gateway on the subnet, ping other hosts on the subnet, and it could ping out to the internet.

So, the ~~packets~~ FRAMES, were not getting out to the VMWare vSwitch. Or were they?

In doing some arp checks, we actually saw that the CentOS7 LibVirtD host had a populated arp table. But the tenant nested VM, only had a partially full arp table.

After pulling in some additional network expertise to work alongside us in troubleshooting, this one fellow sent in a link to a blog article about a security policy feature on VMWare vSwitches called Forged Transmits.

I will drop a link to that article, but also post the picture from that article, because the diagram so simply and perfectly describes what is happening.

https://wahlnetwork.com/2013/04/29/how-the-vmware-forged-transmits-security-policy-works/

Not being a VMWare Administrator, I don't know how enabling this works; if it is at the entire vSwitch level, or if it is at a port or port group level, etc.

But - if you ever plan on running nested virtualization on a VMWare Type 1 Hypervisor, this setting will kill you. Your networking won't work for your nested virtual machine, unless you can find some clever way of tunneling or using a proxy.

Wednesday, April 1, 2020

Enabling Jumbo Frames on Tenant Virtual Machines - Should We?

I noticed that all of our OpenStack virtual machines had 1500 MTU on the interfaces. These seemed wasteful to me, since I knew that everything upstream (private MPLS network) was using jumbo frames.

I went looking for answers as to why the tenants were enabled with only 1500 MTU. Which led to me looking into who was responsible for setting the MTU.

OpenStack?
Neutron?
LibVirt?
Contrail?
something else?

As it turns out, Contrail, which kicks Neutron out of the way and manages the networking with is L3 VPN solution (MPLS over GRE/UDP), works in tandem with Neutron via a bi-directional Plugin (so you can administer your networks and ports from Horizon, or through a Contrail GUI.

But, as I have learned from a web discussion thread, Contrail takes no responsibility for setting the MTU of the virtual machine interfaces. It pleads the 5th.

The thread mentions that the MTU can be set in the Contrail DHCP server. I am not sure, if that would work if you used pre-defined ports, though (do those still use a DHCP mac reservation approach to getting an assigned IP Address?). Do other DHCP servers assign MTUs? DHCP can do a lot of stuff (they cannot cook you a good breakfast unfortunately). I didn't realize DHCP servers could set MTUs, too, until I read that.

Now - the big question. If we can set the MTU on virtual machines, should we? Just because you can, doesn't necessarily mean you should, right?

I set about looking into that. And I ran into some really interesting discussions (and slide decks) on this very topic, and some outright debates on it.

This link below, was pretty informative, I thought.

Discussion: What advantage does enabling Jumbo Frames provide?

Make sure you expand the discussion out with "Read More Comments! That is where the good stuff lies!"

He brings up considerations:

Everything in front of you, including WAN Accelerators and Optimizers, would need to support the larger MTUs.
Your target VM on the other side of the world, would need to support the larger MTU.
Unless you use MTU Path Discovery, and I read a lot of bad things about MTU-PD.
Your MTU setting in a VM, would need to consider any encapsulation that would be done to the frames - and Contrail, being a L3 VPN, does indeed encapsulate the packets.
On any OpenStack compute host running Contrail, the Contrail vRouter already places the payload into 9000 MTU frames, to send over the transport network. Maybe making it not necessary to use jumbo frames at the VM level?

Interesting stuff.

Monday, March 30, 2020

How to run OpenStack on a single server - using veth pair

I decided I wanted to implement OpenStack using OpenVSwitch. On one server.

The way I decided to do this, was to spin up a KVM virtual machine (VM) as an OpenStack controller, and have it communicate to the bare metal CentOS7 Linux host (that runs the KVM hypervisor libvirt/qemu).

I did not realize how difficult this would be, until I realized that OpenVSwitch cannot leverage Linux bridges (bridges on the host).

OpenVSwitch allows you to create, delete and otherwise manipulate bridges - but ONLY bridges that are under the control of OpenVSswitch. So, if you happen to have a bridge on the Linux host (we will call it br0), you cannot snap that bridge into OpenVSwitch.

What you would normally do, is to create a new bridge on OpenVSwitch (i.e. br-ex), and migrate your connections from br0, to br-ex.

That's all well and good - and straightforward, most of the time. But, if you want to run a virtual machine (i.e. an OpenStack Controller VM), and have that virtual machine communicate to OpenStack Compute processes running on the bare metal host, abandoning the host bridges becomes a problem.

Virt-Manager, does NOT know anything about OpenVSwitches, nor OpenVSwitch bridges that OpenVSwitch controls. So when you create your VM, if everything is under an OpenVSwitch bridge (i.e. br-ex), Virt-Manager will only provide you a series of macvtap interfaces (macvtap, and for that matter macvlan, are topics in and of themselves that we won't get into here).

So. I did not want to try and use macvtap interfaces - and jump through hoops to get that to communicate to the underlying host (yes, there are some tricks with macvlan that can do this presumably, but the rabbit hole was getting deeper).

As it turns out, "you can have your cake, and eat it too". You can create a Linux bridge (br0), and plumb that into OpenVSwitch with a veth pair. A veth pair is used just for this very purpose. It is essentially a virtual patch cable between two bridges, since you cannot join bridges (joining bridges is called cascading bridges, and this is not allowed in Linux Networking).

So here is what we wound up doing.

Monday, March 9, 2020

CPU Isolation - and how dangerous it can be

I noticed that in an implementation of OpenStack, they had CPU Pinning configured. I wasn't sure why, so I asked, and I was told that it allowed an application (comprised of several VMs on several Compute hosts in an Availability Zone), to achieve bare-metal performance.

I didn't think too much about it.

THEN - when I finally DID start to look into it - I realized that the feature was not turned on.

CPU Pinning, as they were configuring it, was comprised of 3 parameters:

cpu_isol - a Linux Kernel setting, passed into grub boot loader on grub command line.
vcpu_pin_set - defined in nova.conf - an OpenStack configuration file
reserved_host_cpus - defined in nova.conf - an OpenStack configuration file

These settings have tremendous impact. For instance, they can impact how many CPUs OpenStack sees on the host.

isol_cpu is a comma-delimited array of CPUs. vcpu_pin_set is also an array of CPUs, and what this does, is allow OpenStack Nova to place VMs (qemu processes), via libvirt APIs, on all or a subset of the full bank of isolated CPUs.

So for example, you might isolate 44 CPUs on a 48 core system (24 cores x 2 threads per core). Then you might specify 24 of those 44 to be pinned by Nova/libvirt - and perhaps the remaining 20 are used for non-OpenStack userland processes (i.e. OpenContrail vrouter processes that broker packets in and out of virtual machines and the compute hosts).

So. In a lab environment, with isol_cpu isolating 44 cpus, and these same 44 cpus listed in the vcpu_pin_set array, a customer emailed and complained about sluggish performance. I logged in, started up htop, added the PROCESSOR column, and noticed that everything was running on a single cpu core.

Ironically enough, I had just read this interesting article that helped me realize very quickly what was happening.

https://www.codeblueprint.co.uk/2019/10/08/isolcpus-is-deprecated-kinda.html

Obviously, running every single userland process on a single processor core is a killer.

So why was everything running on one core?

It turned out, that when launching the images, there is a policy that needed to be attached to the flavors, called hw:policy=dedicated.

When specified on the flavor, this property causes Nova to pass this information to libvirt, which knows to assign the virtual machine to one of the specific isolated CPUs.

When NOT specified, it appears that libvirt just shoves the task onto the first available CPU on the system - cpu 0. cpu 0 was indeed an isolated CPU, because the ones left out of the isol_cpu and vcpu_pin_set arrays were 2,4,26 and 28.

So the qemu virtual machine process wound up on the isolated CPU (as it should have). But since there is no load balancing on CPUs when you isolate CPUs, the CPUs just fell onto CPU 0.

Apparently, the flavor property hw:policy=dedicated is CRITICAL in telling libvirt to map an instance to a vcpu in the array.

Changing the flavor properties was not an option in this case, so what wound up happening, was to remove the vcpu_pin_set array in /etc/nova.conf, and to remove the isol_cpu array from the grub boot loader. This fixed the issue of images with no property landing on a single CPU. We also noticed that if a flavor did STILL use the flavor property hw:policy=dedicated, a cpu assignment would still get generated into the libvirt xml file - and the OS would place (and manage) the task on that CPU.

Thursday, March 5, 2020

Mounting a Linux Volume over a File System - the bind mount trick

Logged into a VM today trying to help troubleshoot issues. There was nothing in /var/log! No Syslog!

Turns out that this phenomenon had occurred, where Linux will indeed let you mount on top of pretty much any directory, because after all, a directory is just a mount point as far as Linux is concerned.

But what happens to the files in original directory? I used to think they were lost. They're not. They're there, but shielded. They can be recovered, with a neat trick called a bind mount!

All described here! Learn something new every day.

A snippet of dialog from the link below:
https://unix.stackexchange.com/questions/198542/what-happens-when-you-mount-over-an-existing-folder-with-contents

Q. Right now /tmp has some temporary files in it. When I mount my hard drive (/dev/sdc1) on top of /tmp, I can see the files on the hard drive. What happens to the actual content of /tmp when my hard drive is mounted?

A. Pretty much nothing. They're just hidden from view, not reachable via normal filesystem traversal.

Q. Is it possible to perform r/w operations on the actual content of /tmp while the hard drive is mounted?

A. Yes. Processes that had open file handles inside your "original" /tmp will continue to be able to use them. You can also make the "reappear" somewhere else by bind-mounting / elsewhere.

# mount -o bind / /somewhere/else
# ls /somewhere/else/tmp

Wednesday, January 1, 2020

Cloudify

I have been doing some work with Cloudify.

First, someone gave me access to an instance. Without spending up-front time reading up on Cloudify, I always try to see if I can intuitively figure it out without reading anything.

Not the case with Cloudify.

I had to take some steps to "get into" Cloudify, and I will recap some of those.

1. I went to YouTube, and watched a couple of Videos.

This was somewhat valuable, but I felt this was "hands-on" technology. I knew I would need to install this in my home lab to get proficient with it; that was clear from watching the videos.

2. I logged onto a Cloudify Instance, and looked through the UI

I saw the Blueprints, but couldn't read any of the meta information. Finally I figured out that if I switched browsers, I could scroll down and see the descriptors.

3. Reading up on TOSCA - and Cloudify TOSCA specifically

In examining the descriptors, I realized they were Greek to me, and had to take a step back and read and learn. So I first started reading up on some of the TOSCA standards, and standards like these are tedious and frankly, quite boring after a while. But - as a result of doing this, I started to realize that Cloudify has extended the TOSCA descriptors. So, there is a degree of proprietary with regards to Cloudify, and in reading a few blogs, Cloudify "sorta kinda" follows the ETSI MANO standards, but in extending (and probably changing) some of the TOSCA YAML descriptors, they are going to create some vendor lock-in. They tout this as "value add", and "innovation" of course. Hey - that is how people try to make money with standards.

4. Finally, I decided to stick my toe in the water

I logged onto Cloudify Manager, and decided I would try the openstack-example-network.

It wouldn't upload, so I had to look into why. We had the v3.x version of the OpenStack Plugin, which requires a compat.xml file that was not loaded. In figuring this out, I realized we probably shouldn't even be using that version of the plugin since the plugin is supported on version 5.x of Cloudify Manager, and we were running version 4.6.

So, I uninstalled version 3.x of the OpenStack plugin. And tried to upload the sample example blueprint, and voila', success. I stopped there, because I wanted to see if I could create my own blueprint.

5. Created my own Blueprint

Our initial interest in a use case was not to deploy services per se, but to onboard new customers onto an OpenStack platform. So, I saw the palette in OpenStack Composer for the OpenStack Plugin v2.14.7, and it allowed you to create all kinds of OpenStack objects. I decided to put a User and a Project on the palette. I used some web documentation to learn about secrets (which were already created by a Cloudify Consultant who set up the system), and used those to configure the openstack_config items on the project and user. I then configured up the user and project sections.

I saved the blueprint,
validated the blueprint (no complaints from Composer on that),
and then uploaded the blueprint to Cloudify Manager.

6. I then attempted to Deploy the Blueprint

This seemed to work, but I did not see a new project or user on the system. I saw a bunch of console messages on the Cloudify Manager GUI, but didn't really see any errors.

It is worth noting that I don't see any examples on the web of people trying to "onboard" an OpenStack tenant. Just about all examples are people instantiating some kind of VM on an already-configured OpenStack platform (tenants, users, projects, et al already created).

7. Joined the Cloudify Slack Community

At this point, I signed up for the Cloudify Slack Community, and am trying to seek some assistance from this point on figuring out why my little blueprint did not seem to execute on the target system.

...Meanwhile, I spun up a new thread, and did some additional things:

8. Installed the Cloudify qcow2 image

If you try to do this, it directs you to the Cloudify Sales page. But there is a link to get the back versions, and I downloaded version 4.4 of the qcow2 image.

NOTE: I did not launch this in OpenStack. It surprised me that this was what they seemed to want you to do, because most Orchestrators I have seen operate from outside the OpenStack domain (as a VM outside of OpenStack).

This qcow2 is a CentOS7 image, and I could not find a password to get into the operating system image itself (i.e. as root). What they instead ask you to do, is just hit the ip address from a browser using http (not https!), and see if you get a GUI for Cloudify Manager (I did). Then use your default login. I did log in successfully, and that is as far as I have gotten for now.

9. Installed the CLI

The CLI is an rpm, and I installed this rpm and it installed successfully. So I plan to configure that and use that CLI to learn the CLI and interact with Cloudify Manager.

So, let's see what I learn to get to the next steps. More on this later.

SDN- NFV Certified from Metro Ethernet Forum

It has been a while since I have blogged any updates, so I'll knock out a few!

First, I just completed the course and certification from Metro Ethernet Forum for SDN-NFV.

This was a 3 day course, and it was surprisingly hands-on as it focused heavily on OpenFlow and OpenDaylight. I was always wanting to learn more about these, so I found this quite rewarding.

One interesting stumbling block in the labs was the fact that there is a -O option that needs to be used to specify the proper version of OpenFlow.

The course seemed to focus on the use case and context of using OpenFlow (and OpenDaylight) to configure switches - but not "everything else" out there in the field of networking that could be configured with something like OpenFlow.

For example, it was my understanding that the primary beneficiary of something like OpenFlow (and OpenDaylight) was in the Wireless (802.11x) domain, where people had scores, hundreds or even thousands of access points that had to be configured or upgraded, and it was extremely difficult to this by hand.

But, the course focused on switches - OpenVSwitch switches to be precise. And that was probably because the OpenVSwitch keeps things simple enough for the course and instructor.

Problem is, in my shop, everyone is using Juniper switches, and Juniper does not play ball with OpenFlow and OpenVSwitch. So I'm not sure how much this can or will be put to use in our specific environment. I do, however, use OpenVSwitch in my own personal OpenVSwitch-based OpenStack environment, and since OpenVSwitch works well with DPDK and VPP, this knowledge can come in handy as I need to start doing more sophisticated things with packet flows.

Nontheless, I found the course interesting and valuable. And the exam also centered around the ETSI-MANO Reference Architecture. I had familiar with this architecture, but like all exams like this, I missed questions because of time, or overthinking things, or picking the wrong of two correct answers (not the best answer), et al. But, I passed the exam, and I guess that's what matters most.