Showing posts with label NUMA. Show all posts
Showing posts with label NUMA. Show all posts

Friday, April 4, 2025

Optimizing for NUMA in a Virtualized Environment

Back in January, we had a platform tenant come to us requesting that the vcpu.numa.preferHT setting be set on all of their VMs. As we weren't completely familiarized with this setting, it naturally concerned us since this is a shared on-prem cloud platform that runs VNFs.  This particular tenant had some extremely latency-sensitive workloads, so they were interested in reducing latency and jitter that one of their customers had apparently complained about.

Diving into NUMA in a virtualized environment is not for lightweights. It requires you to understand SMP, the latest advances in NUMA architecture, and on top of that, how it maps and applies to VMs that run operating systems like Linux, on top of what is already a POSIX Unix-like or Linux-like OS (PhotonOS) which is proprietary.

I found that this diagram has been extremely helpful in showing how NUMA works to platform tenants who don't understand NUMA or don't understand it well.


In this diagram, we have a theoretical dual socket cpu system, which has 24 cores on each socket. So 48 total physical cores. Each socket is a physical NUMA node, so there is a Numa Node 0, and a Numa Node 1. Now this is not a "law" that a socket is equivalent to a physical numa node, but in this scenario, that is indeed the case (and on many low to mid-grade servers, depending on the chipset and architecture you will usually see sockets being equal to numa nodes).

Each NUMA node usually gets its own "memory real estate". To use a kitchen analogy, the idea is that processes (VMs) that access memory should be able to hit their local cache (i.e. kitchen table), but if they get a cache-miss, they can pull from their local memory allotment (i.e. run to the basement pantry). If this is not possible, there is an expensive "trip to the store" - remote memory - which takes longer to retrieve from. So you want, ideally, 100% of the memory requests to be local in a high performing environment.

If one were to go into the BIOS and enable Hyperthreading, contrary to what people seem to think, this does NOT double the number of processor cores you have! What happens, on each socket, is that the physical cores are halved, presenting 48 "logical" cores instead of 24 physical cores.

Now, hyperthreading does give some advantages - especially for certain kinds and types of workloads. There is a high (and always growing) amount of non-blocking parallelism in SMT processing, so having 48 cores can certainly boost performance - although this is an in-depth topic that is not covered in great depth in this post.  But the important takeaway is that by increasing from 24 cores to 48 logical cores, it is not a pure linear increase. There is considerably more overhead managing 48 cores than 24 just for starters. Then there is the context-switching and everything else that factors in.

A virtual machine (that is NOT numa aware), that is provisioned for 52 vCPU, does NOT fit onto a single NUMA node (which has 24 cores, but 48 if HT is enabled). It may be provisioned by default to have 1 core per socket resulting in 52 sockets which allows the hypervisor to "sprinkle them around" to available slots of all of the NUMA nodes at the expense if managing this. Or, perhaps the provisioned steps in and overrides default behavior and specifies 52 cores on a single socket. Neither situation will contain a VM onto a single NUMA node because of the simple fact that it does not fit. There are only 24 physical cores - and 48 if Hyperthreading is enabled as it is in this particular example.  

Now to elaborate on this example, there really aren't even 24 physical cores because the OS of the hypervisor is going to reserve cores for itself - perhaps two physical cores on each socket. So in reality, one may think they can fit a 48 core VM (people like to provision in a power of 2 in a binary world) onto a single NUMA node (in this example), only to discover that this isn't the case after deployment, because they in fact needed to provison with 44 cores in order for the numa placement algorithm to "home" the VM on one NUMA home node (0) versus another adjacent one (1).

So "rightsizing" a VM, using 44 cores on a single socket will give some performance benefit in most cases, because the VM will be scheduled on one NUMA node or another in this example. If you provisioned 2 VMs, one might NUMA node 0 while the next one gets assigned NUMA node 1. But, they should stay put on that home node. UNLESS - more VMs are provisioned, and contention begins. In this case, the NUMA scheduler may decide to shift them from one NUMA node to another. When this happens, it is known as a "NUMA Migration".

As stated earlier, when a VM is on a specific NUMA node, it is best to have its memory localized. But it IS possible to provision a 44 core VM that sits on a NUMA Home Node (i.e. 0), but the memory is 90% localized instead of fully localized. In this case, the VM might have more memory provisioned than it should, to be properly optimized.  This is why it is super important to ensure that the memory is being utilized - and not just reserved! And frankly, the same with the vCPU resources! But in general, making a VM "NUMA Aware" by having it sized to fit on NUMA nodes will cut down on migrations, and in most cases (but not all), improve performance.

VMware has a "Rightsizing" report in Aria Operations, that figures this out - and exposes VMs that were over-provisioned with memory and cores that are not being used.  This is a big problem for a shared environment because the net effect of having scores of VMs on a hypervisor that are underutilized, is that the CPU Ready percentage starts to go through the roof. You will look at the stats, and see low CPU utilization and low memory utilization, and think "everything looks under control". But then you will look at the latency on datastores (read and write - write especially), or the CPU Ready percentage (which is showing how much time a process waits for CPU runtime - or, time on the die), and see that they are unsatisfactory to downright abysmal. This is why the Rightsizing report exists - because instead of buying more hardware, there does exist the option to downsize (rightsize) the virtual machines without making additional capital investments in hardware servers.

Last topic to discuss, is NUMA Node Affinity. VMs can set, in Advanced Properties, a NUMA Node Affinity. This is a very very dangerous setting, because now, with this set, a VM literally stays put, and it could be at the expense of everything else in a shared infrastructure because the NUMA load balancer can't touch it or move it. 

As I understand it, the vCPU "Hot Add" feature also nullifies what the NUMA load balancer is trying to do, because of the fact that elasticity of vCPUs creates an "all bets are off" situation that it simply can't manage effectively. In this day and age of automation and auto-scaling, all of these auto scaling engines that want to monitor CPU and add it "hot" (on the fly - without shutting a VM down and restarting it) look super cool. But they can result in severe performance degradation.

Lastly, the vcpu.numa.preferHT setting - what this does, is align a virtual machine (vCPU topology) with the underlying physical topology. This gets into the topic of VPD (Virtual Proximity Domain) and PPD (Physical Proximity Domain). By using preferHT, the NUMA scheduler will take more aggressive advantage of hyperthreads when making core placements, than it would using physical cores as the basis for its placements (which prioritizes memory locality). 

If a VM is cache / memory intensive, memory locality is important. Say you are running a VM with a Times Ten in-memory database.  But if you are a packet-pumping cpu-intensive VM that doesn't need to do a lot of memory access (reads and writes), the computational boost of hyperthreading might give these VMs more of an advantage.

Sunday, January 19, 2025

NUMA PreferHT VM setting on a Hyperthread-Enabled ESXi Hypervisor

This could be a long post, because things like NUMA can get complicated.

For background, we are running servers - hypervisors - that have 24 cores. There are two chips - wafers as I like to refer to them - each with 12 cores, giving a total of 24 physical cores.

When you enable hyperthreading, you get 48 cores, and this is what is presented to the operating system and cpu scheduler (somewhat - more on this later).  But - you don't get an effective doubling of cores when you enable hyperthreading. What is really happening, is that the 24 cores are "cut in half" so that another 24 cores can be "fit in", giving you 48 logical cores.  

Worth mentioning also, is that each (now half) core, has a "sibling" - and this also matters from a scheduling perspective when you see things like cpu pinning used - because if you pin something to a specific core, then that "sibling" cannot be used for something else.  For example, if you enabled hyperthreading, the cores would look like:

0 | 1

2 | 3

4 | 5

... and so on. So if someone pinned to core 4, core 5 is also "off the table" now from a scheduling perspective because pinning is a physical core concept, not a logical core concept.

So with this background, we had a tenant who wanted to enable a "preferHT" setting. This setting can be applied to an entire hypervisor by setting numa.PreferHT=1, affecting all VMs deployed on it.

Or, one can selectively add this setting to a particular or specific virtual machine by going into the Advanced Settings and configuring numa.vcpu.preferHT=TRUE.  

In our case, it was the VM setting being requested - not the hypervisor setting.  Now, this tenant is the "anchor tenant" on the platform, and their workloads are very latency sensitive. So it was important to jump through this hoop when it was requested. First, we tested the setting by powering a VM off and adding the setting, then powering the VM back on. No problems with this. We then migrated the VM to another hypervisor, and had no issues with that either. Aside of that, though, how do you know that the VM setting "took" - meaning that it was picked up and recognized?

It turns out, that there are a couple of ways to do this:

1. esxtop

When you load esxtop, it is going to show you cpu by default. But if you hit the "m" key, it goes into a "memory view". If you go into memory view by hitting "m" and then hit the "f" key, a list of fields will show up. One of them, is NUMA Statistics. So by selecting this, you get a ton of interesting information about NUMA. The settings you are most interested in, are going to be:

NHN - Current home node for the virtual machine or resource pool - in our case, this was 0 or 1 (we had two numa nodes, as there is usually one per physical cpu socket).

NMIG - Number of NUMA migrations between two snapshot samples

NRMEM - (NUMA Remote Memory): Amount of remote memory allocated to the virtual machine, in MB

NLMEM (NUMA Local Memory) - Amount of local memory allocated to the virtual machine, in MB

L%D - this shows the amount of memory that is Localized. You want this number to be 100% but seeing the number in the 90s is probably okay also because it is showing that the memory access is not traversing a NUMA bus, which adds latency.

GST_NDx (Guest Node x): Guest memory being allocated for the VM on NUMA node x, where x is the node number

MEMSZ (Memory Size): Total amount of physical memory allocated to a virtual machine

2. vmdumper command

I found this command on a blog post - which I will list in my sources at the end of this blog post. This useful command, can show you a lot of interesting information about how NUMA is working "under the hood" (in practice). It can show you a Logical Processor to NUMA Node Map, it can show you how many home nodes are utilized for a given VM, and show you the assignment of NUMA clients to the respective NUMA nodes.

One of the examples covered in this blog post refers to the situation where a VM has 12 vCPUs on a 10 core system, and then goes down and shows what it would look like if the VM had 10 vCPU instead.


Sources:

http://www.staroceans.org/ESXi_VMkernel_NUMA_Constructs.htm

https://frankdenneman.nl/2010/02/03/sizing-vms-and-numa-nodes/

https://frankdenneman.nl/2010/10/07/numa-hyperthreading-and-numa-preferht/

https://docs.pexip.com/server_design/vmware_numa_affinity.htm

https://docs.pexip.com/server_design/numa_best_practices.htm#hyperthreading

https://knowledge.broadcom.com/external/article?legacyId=2003582


 

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...