Back in January, we had a platform tenant come to us requesting that the vcpu.numa.preferHT setting be set on all of their VMs. As we weren't completely familiarized with this setting, it naturally concerned us since this is a shared on-prem cloud platform that runs VNFs. This particular tenant had some extremely latency-sensitive workloads, so they were interested in reducing latency and jitter that one of their customers had apparently complained about.
Diving into NUMA in a virtualized environment is not for lightweights. It requires you to understand SMP, the latest advances in NUMA architecture, and on top of that, how it maps and applies to VMs that run operating systems like Linux, on top of what is already a POSIX Unix-like or Linux-like OS (PhotonOS) which is proprietary.
I found that this diagram has been extremely helpful in showing how NUMA works to platform tenants who don't understand NUMA or don't understand it well.
In this diagram, we have a theoretical dual socket cpu system, which has 24 cores on each socket. So 48 total physical cores. Each socket is a physical NUMA node, so there is a Numa Node 0, and a Numa Node 1. Now this is not a "law" that a socket is equivalent to a physical numa node, but in this scenario, that is indeed the case (and on many low to mid-grade servers, depending on the chipset and architecture you will usually see sockets being equal to numa nodes).
Each NUMA node usually gets its own "memory real estate". To use a kitchen analogy, the idea is that processes (VMs) that access memory should be able to hit their local cache (i.e. kitchen table), but if they get a cache-miss, they can pull from their local memory allotment (i.e. run to the basement pantry). If this is not possible, there is an expensive "trip to the store" - remote memory - which takes longer to retrieve from. So you want, ideally, 100% of the memory requests to be local in a high performing environment.
If one were to go into the BIOS and enable Hyperthreading, contrary to what people seem to think, this does NOT double the number of processor cores you have! What happens, on each socket, is that the physical cores are halved, presenting 48 "logical" cores instead of 24 physical cores.
Now, hyperthreading does give some advantages - especially for certain kinds and types of workloads. There is a high (and always growing) amount of non-blocking parallelism in SMT processing, so having 48 cores can certainly boost performance - although this is an in-depth topic that is not covered in great depth in this post. But the important takeaway is that by increasing from 24 cores to 48 logical cores, it is not a pure linear increase. There is considerably more overhead managing 48 cores than 24 just for starters. Then there is the context-switching and everything else that factors in.
A virtual machine (that is NOT numa aware), that is provisioned for 52 vCPU, does NOT fit onto a single NUMA node (which has 24 cores, but 48 if HT is enabled). It may be provisioned by default to have 1 core per socket resulting in 52 sockets which allows the hypervisor to "sprinkle them around" to available slots of all of the NUMA nodes at the expense if managing this. Or, perhaps the provisioned steps in and overrides default behavior and specifies 52 cores on a single socket. Neither situation will contain a VM onto a single NUMA node because of the simple fact that it does not fit. There are only 24 physical cores - and 48 if Hyperthreading is enabled as it is in this particular example.
Now to elaborate on this example, there really aren't even 24 physical cores because the OS of the hypervisor is going to reserve cores for itself - perhaps two physical cores on each socket. So in reality, one may think they can fit a 48 core VM (people like to provision in a power of 2 in a binary world) onto a single NUMA node (in this example), only to discover that this isn't the case after deployment, because they in fact needed to provison with 44 cores in order for the numa placement algorithm to "home" the VM on one NUMA home node (0) versus another adjacent one (1).
So "rightsizing" a VM, using 44 cores on a single socket will give some performance benefit in most cases, because the VM will be scheduled on one NUMA node or another in this example. If you provisioned 2 VMs, one might NUMA node 0 while the next one gets assigned NUMA node 1. But, they should stay put on that home node. UNLESS - more VMs are provisioned, and contention begins. In this case, the NUMA scheduler may decide to shift them from one NUMA node to another. When this happens, it is known as a "NUMA Migration".
As stated earlier, when a VM is on a specific NUMA node, it is best to have its memory localized. But it IS possible to provision a 44 core VM that sits on a NUMA Home Node (i.e. 0), but the memory is 90% localized instead of fully localized. In this case, the VM might have more memory provisioned than it should, to be properly optimized. This is why it is super important to ensure that the memory is being utilized - and not just reserved! And frankly, the same with the vCPU resources! But in general, making a VM "NUMA Aware" by having it sized to fit on NUMA nodes will cut down on migrations, and in most cases (but not all), improve performance.
VMware has a "Rightsizing" report in Aria Operations, that figures this out - and exposes VMs that were over-provisioned with memory and cores that are not being used. This is a big problem for a shared environment because the net effect of having scores of VMs on a hypervisor that are underutilized, is that the CPU Ready percentage starts to go through the roof. You will look at the stats, and see low CPU utilization and low memory utilization, and think "everything looks under control". But then you will look at the latency on datastores (read and write - write especially), or the CPU Ready percentage (which is showing how much time a process waits for CPU runtime - or, time on the die), and see that they are unsatisfactory to downright abysmal. This is why the Rightsizing report exists - because instead of buying more hardware, there does exist the option to downsize (rightsize) the virtual machines without making additional capital investments in hardware servers.
Last topic to discuss, is NUMA Node Affinity. VMs can set, in Advanced Properties, a NUMA Node Affinity. This is a very very dangerous setting, because now, with this set, a VM literally stays put, and it could be at the expense of everything else in a shared infrastructure because the NUMA load balancer can't touch it or move it.
As I understand it, the vCPU "Hot Add" feature also nullifies what the NUMA load balancer is trying to do, because of the fact that elasticity of vCPUs creates an "all bets are off" situation that it simply can't manage effectively. In this day and age of automation and auto-scaling, all of these auto scaling engines that want to monitor CPU and add it "hot" (on the fly - without shutting a VM down and restarting it) look super cool. But they can result in severe performance degradation.
Lastly, the vcpu.numa.preferHT setting - what this does, is align a virtual machine (vCPU topology) with the underlying physical topology. This gets into the topic of VPD (Virtual Proximity Domain) and PPD (Physical Proximity Domain). By using preferHT, the NUMA scheduler will take more aggressive advantage of hyperthreads when making core placements, than it would using physical cores as the basis for its placements (which prioritizes memory locality).
If a VM is cache / memory intensive, memory locality is important. Say you are running a VM with a Times Ten in-memory database. But if you are a packet-pumping cpu-intensive VM that doesn't need to do a lot of memory access (reads and writes), the computational boost of hyperthreading might give these VMs more of an advantage.