Monday, October 4, 2021

The first Accelerated VNF on our NFV platform

 I haven't posted anything since April but that isn't because I haven't been busy.

We have our new NFV Platform up and running, and it is NOT on OpenStack. It is NOT on VMWare VIO. It also, is NOT on VMWare Telco Cloud!

We are using ESXi, vCenter, NSX-T for the SD-WAN, and Morpheus as a Cloud Management solution. Morpheus has a lot of different integrations, and a great user interface that gives tenants a place to log in and call home and self-manage their resources.

The diagram below depicts what this looks like from a Reference Architecture perspective.

The OSS, which is not covered in the diagram, is a combination of Zabbix and VROPS, both working in tandem to ensure that the clustered hosts and management functions are behaving properly.

The platform is optimized with E-NVDS, which is also referred to commonly as Enhanced Datapath which requires special DPDK drivers to be loaded on the ESXi hosts, for starters, as well as some configuration in the hypervisors. There are also settings to be made in the hypervisors to ensure that the E-NVDS is configured properly (separate upcoming post).

Now that the platform is up and running, it is time to start discussing workload types. There are a number of Workload Categories that I tend to use:

  1. Enterprise Workloads - Enterprise Applications, 3-Tier Architectures, etc.
  2. Telecommunications Workloads
    • Control Plane Workloads
    • Data Plane Workloads

Control Plane workloads are have more tolerances for latency and system resources than Data Plane Workloads do. 

Why? Because Control Plane workloads are typically TCP-based,  frequently use APIs (RESTful),  and tend to be more periodic in their behavior (periodic updates).  Most of the time, when you see issues related to Control Plane, it is related to back-hauling a lot of measurements and statistics (Telemetry Data). But generally speaking, this data in of itself does not have stringent requirements.

From a VM perspective, there are a few key things you need to do to ensure your VNF behaves as a true VNF and not as a standard workload VM. These include:

  • setting Latency Sensitivity to High, which turns off interrupts and ensures that poll mode drivers are used.
  • Enable Huge Pages on the VM by going into VM Advanced Settings and adding the parameter: sched.mem.lpage.enable1GHugePage = TRUE

Note: Another setting worth checking, although we did not actually set this parameter ourselves, is: sched.mem.pin = TRUE

Note: Another setting, sched.mem.maxmemctl ensures that ballooing is turned off. We do NOT have this setting, but it was mentioned to us, and we are researching this setting.

One issue we seemed to continually run into, was a vCenter alert called Virtual Machine Memory Usage, displaying in vCenter as a red banner with "Acknowledge and Reset to Green" links. The VM was in fact running, but vCenter seemed to have issues with it. The latest change we made that seems to have fixed this error, was to check the "Reserve all guest memory (All locked)" option checkbox.

This checkbox to Reserve all guest memory seemed intimidating at first, because the concern was that the VM could reserve all memory on the host. That is NOT what this setting does!!! What it does, is allow the VM to reserve all of its memory up-front - but just the VM memory that is specified (i.e. 24G). If the VM has has HugePages enabled, it makes sense that one would want the entire allotment of VM memory to  memory to be reserved up front and be contiguous. When we enabled this, our vCenter alerts disappeared.

Lastly, we decided to change DRS to Manual in VM Overrides. To find this setting amongst the huge number of settings hidden in vCenter, you go to the Cluster (not the Host, not the VM, not the Datacenter) and the option for VM Overrides is there, and you have four options:

  • None
  • Manual
  • Partial
  • Full

The thinking here, is that VMs with complex settings may not play well with vMotion. I will be doing more research on DRS for VNFs before considering setting this (back) to Partial or Full.

No comments:

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it wa...