Showing posts with label Virtualization. Show all posts
Showing posts with label Virtualization. Show all posts

Friday, April 4, 2025

Optimizing for NUMA in a Virtualized Environment

Back in January, we had a platform tenant come to us requesting that the vcpu.numa.preferHT setting be set on all of their VMs. As we weren't completely familiarized with this setting, it naturally concerned us since this is a shared on-prem cloud platform that runs VNFs.  This particular tenant had some extremely latency-sensitive workloads, so they were interested in reducing latency and jitter that one of their customers had apparently complained about.

Diving into NUMA in a virtualized environment is not for lightweights. It requires you to understand SMP, the latest advances in NUMA architecture, and on top of that, how it maps and applies to VMs that run operating systems like Linux, on top of what is already a POSIX Unix-like or Linux-like OS (PhotonOS) which is proprietary.

I found that this diagram has been extremely helpful in showing how NUMA works to platform tenants who don't understand NUMA or don't understand it well.


In this diagram, we have a theoretical dual socket cpu system, which has 24 cores on each socket. So 48 total physical cores. Each socket is a physical NUMA node, so there is a Numa Node 0, and a Numa Node 1. Now this is not a "law" that a socket is equivalent to a physical numa node, but in this scenario, that is indeed the case (and on many low to mid-grade servers, depending on the chipset and architecture you will usually see sockets being equal to numa nodes).

Each NUMA node usually gets its own "memory real estate". To use a kitchen analogy, the idea is that processes (VMs) that access memory should be able to hit their local cache (i.e. kitchen table), but if they get a cache-miss, they can pull from their local memory allotment (i.e. run to the basement pantry). If this is not possible, there is an expensive "trip to the store" - remote memory - which takes longer to retrieve from. So you want, ideally, 100% of the memory requests to be local in a high performing environment.

If one were to go into the BIOS and enable Hyperthreading, contrary to what people seem to think, this does NOT double the number of processor cores you have! What happens, on each socket, is that the physical cores are halved, presenting 48 "logical" cores instead of 24 physical cores.

Now, hyperthreading does give some advantages - especially for certain kinds and types of workloads. There is a high (and always growing) amount of non-blocking parallelism in SMT processing, so having 48 cores can certainly boost performance - although this is an in-depth topic that is not covered in great depth in this post.  But the important takeaway is that by increasing from 24 cores to 48 logical cores, it is not a pure linear increase. There is considerably more overhead managing 48 cores than 24 just for starters. Then there is the context-switching and everything else that factors in.

A virtual machine (that is NOT numa aware), that is provisioned for 52 vCPU, does NOT fit onto a single NUMA node (which has 24 cores, but 48 if HT is enabled). It may be provisioned by default to have 1 core per socket resulting in 52 sockets which allows the hypervisor to "sprinkle them around" to available slots of all of the NUMA nodes at the expense if managing this. Or, perhaps the provisioned steps in and overrides default behavior and specifies 52 cores on a single socket. Neither situation will contain a VM onto a single NUMA node because of the simple fact that it does not fit. There are only 24 physical cores - and 48 if Hyperthreading is enabled as it is in this particular example.  

Now to elaborate on this example, there really aren't even 24 physical cores because the OS of the hypervisor is going to reserve cores for itself - perhaps two physical cores on each socket. So in reality, one may think they can fit a 48 core VM (people like to provision in a power of 2 in a binary world) onto a single NUMA node (in this example), only to discover that this isn't the case after deployment, because they in fact needed to provison with 44 cores in order for the numa placement algorithm to "home" the VM on one NUMA home node (0) versus another adjacent one (1).

So "rightsizing" a VM, using 44 cores on a single socket will give some performance benefit in most cases, because the VM will be scheduled on one NUMA node or another in this example. If you provisioned 2 VMs, one might NUMA node 0 while the next one gets assigned NUMA node 1. But, they should stay put on that home node. UNLESS - more VMs are provisioned, and contention begins. In this case, the NUMA scheduler may decide to shift them from one NUMA node to another. When this happens, it is known as a "NUMA Migration".

As stated earlier, when a VM is on a specific NUMA node, it is best to have its memory localized. But it IS possible to provision a 44 core VM that sits on a NUMA Home Node (i.e. 0), but the memory is 90% localized instead of fully localized. In this case, the VM might have more memory provisioned than it should, to be properly optimized.  This is why it is super important to ensure that the memory is being utilized - and not just reserved! And frankly, the same with the vCPU resources! But in general, making a VM "NUMA Aware" by having it sized to fit on NUMA nodes will cut down on migrations, and in most cases (but not all), improve performance.

VMware has a "Rightsizing" report in Aria Operations, that figures this out - and exposes VMs that were over-provisioned with memory and cores that are not being used.  This is a big problem for a shared environment because the net effect of having scores of VMs on a hypervisor that are underutilized, is that the CPU Ready percentage starts to go through the roof. You will look at the stats, and see low CPU utilization and low memory utilization, and think "everything looks under control". But then you will look at the latency on datastores (read and write - write especially), or the CPU Ready percentage (which is showing how much time a process waits for CPU runtime - or, time on the die), and see that they are unsatisfactory to downright abysmal. This is why the Rightsizing report exists - because instead of buying more hardware, there does exist the option to downsize (rightsize) the virtual machines without making additional capital investments in hardware servers.

Last topic to discuss, is NUMA Node Affinity. VMs can set, in Advanced Properties, a NUMA Node Affinity. This is a very very dangerous setting, because now, with this set, a VM literally stays put, and it could be at the expense of everything else in a shared infrastructure because the NUMA load balancer can't touch it or move it. 

As I understand it, the vCPU "Hot Add" feature also nullifies what the NUMA load balancer is trying to do, because of the fact that elasticity of vCPUs creates an "all bets are off" situation that it simply can't manage effectively. In this day and age of automation and auto-scaling, all of these auto scaling engines that want to monitor CPU and add it "hot" (on the fly - without shutting a VM down and restarting it) look super cool. But they can result in severe performance degradation.

Lastly, the vcpu.numa.preferHT setting - what this does, is align a virtual machine (vCPU topology) with the underlying physical topology. This gets into the topic of VPD (Virtual Proximity Domain) and PPD (Physical Proximity Domain). By using preferHT, the NUMA scheduler will take more aggressive advantage of hyperthreads when making core placements, than it would using physical cores as the basis for its placements (which prioritizes memory locality). 

If a VM is cache / memory intensive, memory locality is important. Say you are running a VM with a Times Ten in-memory database.  But if you are a packet-pumping cpu-intensive VM that doesn't need to do a lot of memory access (reads and writes), the computational boost of hyperthreading might give these VMs more of an advantage.

Friday, August 18, 2023

The Linux XFS File System - How Resilient Is It?

We are using VMWare Datastores, using NFS version 3.x.  The storage was routed, which is never a good thing to do because let's face it, if your VMs all lose their storage simultaneously, that constitutes a disaster. Having dependencies on a router, which can lose its routing prefixes due to a maintenance or configuration problem, is architecturally deficient (polite way of putting it). To solve this, you need to make sure that you don't have routing hops (storage on same segment as storage interface on hypervisor).

So, after our storage routers went AWOL due to a maintenance event, I noticed some VMs came back and appeared to be fine. They had rebooted and were at a login prompt.  Other VMs, however, did not come back, and had some nasty things printing on the console (you could not log into these VMs).


What we noticed, was that any Linux virtual machine running with XFS file system type on boot or root (/boot or /) had this issue of being unrecoverable.  VMs that were using ext3 or ext4 seemed to be able to recover and start running their services - although some were still echoing some messages to the console.

There is a lesson here. That the file system matters when it comes to resiliency in a virtualized environment.

I did some searching around for discussions on file system types, and of course there are many. This one in particular, I found interesting:  ext4-vs-xfs-vs-btrfs-vs-zfs-for-nas


Monday, April 26, 2021

Tenancy is Critical on a Cloud Platform

With this new VMWare platform, it was ultimately decided to go with ESXi hypervisors, managed by vCenter, and NSX-T.  

During the POC, it was pointed out that this combination of solutions had some improvements and enhancements over OpenStack (DRS, vMotion, et al). But one thing seemed to be overlooked, and we pointed it out: Tenancy

VMWare attempts to address Tenancy with Vertical Stack point solutions, like vCloud Director (positioned at Service Providers), or vRealize Automation. The latter, is going through a complete transformation in its latest version.  These solutions are also expensive. And, if you don't have the budget, what are your options??

One option is to set up Resource Pools and Folders in vCenter. Not the cleanest solution because you cannot set policies, workflows, etc.

What else can you do? Well, you can use a Cloud Management solution.

We had Cloudify as an Orchestrator. And we evaluated that as a Cloud Management solution. But what we found in the end, was that Cloudify excelled at complex orchestration, but it was not designed and built, ground-up, to be a Cloud Management Platform.

It seemed that this (lack of) Tenancy seemed to become apparent to everyone all at once - once the platform came up on VMWare.  And, with Cloudify we lacked the Blueprint development to do the scores to hundreds of tasks that we needed to have. It needed integrations with NSX-T, vCenter, and a host of other solutions.

We looked at a couple of other solutions, and settled on a solution called Morpheus.

I will blog a bit more about Morpheus in upcoming posts. I have been very hands-on with it lately. 

Tuesday, May 28, 2019

Palo Alto Firewall VM Series - Integration and Evaluation - Part III

Okay, this is just a short post to discuss where we are in the integration process.
  1. I have a Python script that generates XML. In this script, you pass in parameters, and then I use the ETree library in Python to generate the XML.
  2. I have some bash scripts that take the XML files, and invoke the Python XML API Wrapper, which in turn does the legwork to send the data to the API Server on the Firewall.
Normally one might create the Management Profile, Zones and Security Policies first. And THEN add or assign interfaces, routers, routes on those routers, etc.

This is the basic process I am following thus far:
  1. Create Management Profile
  2. Load Interface(s) - the management profile in #1 is included.
  3. Create Zone(s)
  4. Create Security Policies - the interfaces included
  5. Assign interface to default router
  6. Load Static Route on default router - include interface
Seems to be working okay although the process needs to be tightened up a bit so that you are not using one Python program to generate the xml, and another to call the API. 

But it's good enough to load and test and see if I can get a firewall operational.

Friday, May 17, 2019

Palo Alto Firewall VM Series - Integration and Evaluation - Part II


After a couple of days of playing around with the Palo Alto VM-Series Firewall (running the VM on a KVM / LibvirtD virtualization platform on a CentOS7 host), I felt I was comfortable enough with it to explore the API.

I asked a Palo Alto engineer how they bootstrap these things. He told me they use CloudInit and use a boot.xml file to change the default password. From there, they use their management platform, Panorama, to push configurations to the devices.

I don't happen to have Panorama anywhere. And I presume like everything else, it needs licenses. So, I started looking at the facilities to interface/integrate with the device; meaning APIs.

There are actually several APIs:

  • Command Line Interface (CLI)
  • WildFire API
  • AutoFocus API
  • PAN-OS Licensing API
  • Panorama XML API (requires Panorama of course)
  • Pan-OS XML API

I located, downloaded and glanced through the XML API Guide. Which actually does do a nice job of getting you acquainted with the API. There is nothing really unusual. You need to authenticate, get a token (they call it a key), and with that key you can go to work (I won't cover details of the API here).

Next it was time to examine the API firsthand. Is it running? Do I need a license? I used Postman for this. I don't know if there are other better tools for picking at APIs, but I think Postman is definitely one of those most popular tools. Making add/modify changes is always risky when you are learning a new API, so it always makes sense to start with some "get" calls so you can understand the structure of the data. So, I was able to hit the VM on standard SSL port 443, and get back a key, and with the key, run a few get commands based on examples in the API Guide. The API works, it appears!

One noteworthy comment is that the API would not work without turning off certificate validation in the settings!

Next, I considered starting to write some Python code as a client, but as Palo Alto is a pretty popular firewall from a large company, there had to be some folks who have broken ground on that already, right? A quick google search for a Python API client turned up a project from a guy named Kevin Steves, who has clients for ALL of the APIs in Python. It is on GitHub with a free use license.

https://github.com/PaloAltoNetworks/pandevice/

After cloning this, I noticed you can run setup. I elected not to run setup, and just try to invoke the API directly. I had to use the panxapi.py python file. Examining the source code, you can supply an exhaustive list of options to the main() module of the Python file, which will parse those and invoke accordingly.

Immediately, however, I ran into the same certificate validation error I experienced with PostMan. But in PostMan I could just go into settings and disable certificate validation. Figuring out how to do this with the API was more difficult. Eventually, I found an issue recorded on the project that discusses this same problem, which can be found at this link:  Certificate Validation Issue

The issue discusses versions of Python on CentOS that do certificate checking. Rather than fool with upgrading Python, one poster pointed out that you can, in fact, disable certificate checking in Python by setting an environment variable: "export PYTHONHTTPSVERIFY=0". Bingo. That's all I need right now to experiment with the API.

Tuesday, May 14, 2019

Palo Alto Firewall VM Series - Integration and Evaluation - Part I

This week, I am evaluating the Palo Alto VM-Series Firewall.

I will ramble a bit about what I am doing, learning, etc.

First off, this VM-Series Firewall can run as a qcow2 image on KVM, it can load as an OVF onto the VMWare vSphere ESXi platform, and I have seen some evidence of people loading it on VirtualBox also. I am using it on a KVM (libvirtd) host.

The main thing about a virtual firewall appliance is how it plumbs into the virtual networking, using virtual adaptors, or host NICs.

One the VM I just installed, I set up 4 Adaptors.

4 NICs on the Palo Alto VM
If we assume 4 NICs on the virtual machine, the very first NIC is designated as the management NIC.  What is confusing about this, is that you might expect this NIC to show up in the list of interfaces. It doesn't. You have to "know" that the first NIC is a management NIC that does NOT show up in the list of Interfaces.

If we look at the screenshot below, you will see a Management IP Address on a 172.22.0.0/24 network. This is shown on the "Dashboard" tab of the Palo Alto user interface.

The Management Interface connects to a VM NIC that does not show up in the list of Interfaces

Yet, if we look at the list of Interfaces (Network tab), we will see that this Management Interface (first NIC in the KVM list of adaptors) does not exist.


I would have to go back and see how well this is documented, for I admittedly dive in without reading documentation sometimes. But it was NOT very intuitive. I would prefer for the interfaces to "line up" with the VM adaptors, and if one is a Management Interface, perhaps it is greyed out, or managed in a unique way.

I understand why Palo Alto did this, for the Management Interface is generally considered a unique pipe into the product - used for administration of the device itself and generally is not considered part of a traffic plane interface. But it did make it difficult at first for me because I did not know which bridge the first Interface was indeed connected to - the br0 (which has a 172.22.0.0/24 network on it), or br1 (which has a 172.21.0.0/24 network on it).

Palo Alto, like FirewallD, is a zone-based firewall. So while you may have a tendency to get fixated on your Interfaces (trying to get them to light up green) initially, the first thing you really SHOULD do is a bit of forethought and planning, and create your Zones.

I created two zones (currently):

  • Trusted L3
  • UnTrusted L3
Two Zones Initially Created - L3 Trusted, and L3 Untrusted
The Untrusted Zone contains the interface Ethernet1/1, which is connected to a host adaptor via a bridge. For this reason I considered this interface to be untrusted, as per my thinking, it would connect up to my Router much like a Firewall at the edge of a premise might connect to an ISP.

The Trusted Zone contains two interfaces Ethernet1/2 and Ethernet1/3. 

Ethernet1/2 is mapped to an adaptor on the virtual machine that is connected to the "default" KVM network, which has CIDR of 192.168.122.0/24. But - this is a NAT network! Packets that leave the KVM host are Source NAT'ed. How this works with the Firewall? I don't know yet - I have not tested extensively with this "type" of network.

Ethernet1/3 is mapped to an adaptor on the virtual machine that is connected to an Internal KVM network which has a CIDR of 192.168.124.0/24. This network, though is NOT a NAT network. It is a Routed Network. A Routed Network is routed between KVM Internal networks, but generally is not reachable outside the KVM host, because KVM creates iptables rules that drop inbound packets coming from any host aside of the KVM host itself (I tested this - pings from another host cannot reach the 192.168.122.0/24 network because they get dropped on a FORWARD chain rule). I suppose theoretically, if you hacked the iptables rules properly on the KVM host, these KVM internal networks could be reachable. Maybe there is a facility for this designed to accommodate strange circumventions like these, but messing with the iptables rules, and especially the order of such rules, are prone to issues.

So in summary, it does appear that the Palo Alto Firewall VM-Series, on KVM, will work with KVM Internal networks in a respectful way. You would just want to classify these as "Trusted" networks when it comes to zones and security policies.

Friday, November 9, 2018

There are other container platforms besides Docker? Like LXC?


I'm relatively new to containers technology. I didn't even realize there were alternatives to Docker (although I hadn't really thought about it).

Colleague of mine knew this though, and sent me this interesting link.

https://robin.io/blog/linux-containers-comparison-lxc-docker/

This link is a discussion about a more powerful container platform called LXC, which could be used as an alternative to Docker.

I'm still in the process of learning about it. Will update the blog later.

Wednesday, October 31, 2018

Data Plane Development Kit (DPDK)


I kept noticing that a lot of the carrier OEMs are implementing their "own" Virtual Switches.

I wasn't really sure why, and decided to look into the matter.  After all, there is a fast-performing OpenVSwitch, which while fairly complex, is powerful, flexible, and, well, open source.

Come to learn, there is actually a faster way to do networking than with native OpenVSwitch.

OpenVSwitch minimizes all of the context switching between user space and kernel space when it comes to taking packets from a physical port, and forwarding those packets to virtualized network functions (VNF) and back. 

But - DPDK provides a means to circumvent the kernel, and have practically everything in user space interacting directly to the hardware (bypassing the kernel).

This is fast, indeed, if you can do this. But it bypasses all of the purposes of a kernel network stack, so there has to be some sacrifice (which I need to look into and understand better).  One of the ways it bypasses the kernel is through Direct Memory Access (DMA), based on some limited reading (frankly, reading it and digesting it and understanding it usually requires several reads and a bit of concentration as this stuff gets very complex very fast).

The other question I have, is that if DPDK is bypassing the kernel en route to a physical NIC, what about other kernel-based networking services that are using that same NIC? How does that work?

I've got questions. More questions.

But up to now, I was unaware of this DPDK and its role in the new generation of virtual switches coming out. Even OpenVSwitch itself has a DPDK version.

Sunday, October 28, 2018

Service Chaining and Service Function Forwarding

I had read about the concept of service chaining and service forward functioning early on, in a SD-WAN / NFV book that I had read, which at the time was ahead of its time.  I hadn't actually SEEN this, or implemented it, until just recently on my latest project.

Now, we have two "Cloud" initiatives going on at the moment, plus one that's been in play for a while.
  1. Ansible - chosen over Puppet, and Chef in a research initiative, this technology is essentially used to automate the deployment and configurations of VMs (LibVirt KVMs to be accurate). 
    • But there is no service chaining or service function forwarding in this.
  2. OpenStack / OpenBaton - this is a project to implement Service Orchestration - using ETSI MANO descriptors to "describe" Network Functions, Services, etc.
    • But we only implemented a single VNF, and did not chain them together with chaining rules, or forwarding rules. 
  3. Kubernetes - this is a current project to deploy technology into containers. And while there is reliance and dependencies between the containers, including scaling and autoscaling, I would not say that we have implemented Service Chaining or Service Function Forwarding the way it was conceptualized academically and in standards.
The latest project I was involved with DID make use of Service Chaining and Service Function Forwarding.  We had to deploy a VNF onto a Ciena 3906mvi device, which had a built-in Network Virtualization module that ran on a Linux operating system. This ran "on top" of an underlying Linux operating system that dealt with the more physical aspects of the box (fiber ports, ethernet ports both 1G and 100G, et al).

It's my understanding that the terms Service Chaining and Service Function Forwarding have their roots in the YANG reference model. https://en.wikipedia.org/wiki/YANG

This link has a short primer on YANG. 

YANG is supposed to extend a base set of network operations that are spelled out in a standard called NETCONF (feel free to research this - it and YANG are both topics in and of themselves).

In summary, it was rather straightforward to deploy the VNF. You had to know how to do it on this particular box, but it was rather straightforward. What was NOT straightforward, was figuring out how you wanted your traffic to flow, and configuring the Service Chaining and Service Function Forwarding rules.

What really hit home to me is that the Virtual Switch (fabric) is the epicenter of the technology. Without knowing how these switches are configured and inter-operate, you can't do squat - automated, manual, with descriptors, or not. And this includes troubleshooting them.

Now with Ciena, theirs on this box was proprietary. So you were configuring Flooding Domains, Ports, Logical Ports, Traffic Classifiers, VLANs, etc.  This is the only way you can make sure your traffic is hop-scotching around the box the way you want it to, based on rules you specify.

Here is another link on Service Chaining and Service Function Forwarding that's worth a read.


Monday, October 15, 2018

Kubernetes Part VI - Helm Package Manager

This past week, my colleague has introduced me to something called Helm, which is sort of like a "pip" for Python. It manages Kubernetes packages (it is a Kubernetes Package Manager).

The reason this was introduced:
We found SEVERAL github repos with Prometheus Metrics in them, and they were not at all consistent.
  • Kubernetes had one
  • There was another one at a stefanprod project
  • There was yet a third called "incubator"
My colleague, through relentless research, figured out (or decided) that the one installed through Helm was the best one.

This meant I had to understand what Helm is. Helm is divided into a client (Helm), a server (Tiller), and you install packages (Charts). I guess it's a maritime themed concept, although I don't know why they can't call a package a package (copyright reasons maybe?).

So - I installed Helm, and that went smoothly enough. I installed it on my Kubernetes Master.  I also downloaded a bunch of Charts off the stable release in GitHub (Prometheus is one of these). These all sit in a /stable directory (after the git clone, ./charts/stable).

When I came back in, and wanted to pick back up, I wasn't sure if I had installed Prometheus or not. So I ran a "helm list", and got the following error:

Error: configmaps is forbidden: User "system:serviceaccount:kube-system:default" cannot list configmaps in the namespace "kube-system"

Yikes. For a newbie, this looked scary. Fortunately Google had a fix for this on a StackOverflow page.

I had to run these commands:
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'      
helm init --service-account tiller --upgrade
 
These seemed to work. The "helm list" command showed no results, though, so I guess I need to install the prometheus package (sorry...chart) after all with helm now.

But, more importantly, I really need to take some time to understand what we just ran above, with regards to this stuff above; the cluster role bindings, et al.

 
 



Friday, September 21, 2018

Docker: Understanding Docker Images and Containers


So this week has been emphasized on understanding how Docker works.

First, I learned about Docker images - how to create images, etc.

There is a pretty good blog that can be used to get going on this topic, which can be found here:
https://osric.com/chris/accidental-developer/2017/08/running-centos-in-a-docker-container/

Docker images are actually created in Layers. So you generally start off by pulling in a docker container image for centos, and then running it.

This is done as follows;
# docker pull centos
# docker run centos
# docker image ls

Note: If you run "docker container ls" you won't see it because it's not running and therefore not containerized.  Once you run the container image, THEN it becomes containerized and you can run "docker container ls" and you will be able to see it.

# docker run -it centos

Once you run the image, you are now "in" the container and you get a new prompt with a new guid, as shown below:

[root@4f0b435cbdb6 /]#

Now you can make changes to this image as you see fit; yum install packages, copy things into a running container by using the "docker cp" command.

Once you get a container the way you want it, you can exit that container, and then use the guid of that container (don't lose it!) to "commit" it.

Once committed, you need to push it to a registry.

Registries are another topic. If you use Docker Hub (https://hub.docker.com), you need to create an account, and you can create a public repository or a private repository. If you use a private one, you need to authenticate to use it. If you use a public one, anyone can see, take or use whatever you upload.

JFrog is another artifact repository that can be used.

Ultimately what we wound up doing, is creating a new local registry in a container by using the following command:
# docker run -d -p 5000:5000 --restart=always --name registry registry:2

Then, you can push your newly saved container images to this registry by using a command such as:
# docker push kubernetes-master:5000/centos-revised-k8s:10

So essentially we did the push to a specific host (kubernertes-master), on port 5000, and gave it the image name and a new tag.


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...