Showing posts with label Morpheus. Show all posts
Showing posts with label Morpheus. Show all posts

Wednesday, September 18, 2024

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it was designated EOL and the repositories were archived, but I worked through that and it seemed everyone was using the system just fine.

Today, however, I had someone contact me to tell me that they provisioned a virtual machine, but it was stuck in an incomplete "Provisioning" state (a state that has a blue icon with a rocketship in it). The VM was provisioned on vCenter and working, but the state in Morpheus never set to "Finalized".

I couldn't figure this out, so I went to the Morpheus help site and I discovered that I myself had logged a ticket on this issue quite a while back. It turned out that the reason the state never flipped in that case, was because the clustering wasn't working properly.

So I checked RabbitMQ. It looked fine.

I checked MySQL and Percona, and I suspected that perhaps the clustering wasn't working properly. In the process of restarting the VMs, one of the virtual machines wouldn't start. I had to do a bunch of Percona advanced troubleshooting to figure out that I needed to do a wsrep recover commit before I could start the system and have it properly join the cluster. 

The NEXT problem was that Zabbix was screeching about these Morpheus VMs using too much disk space. It turned out that the /var file system was 100% full - because of ElasticSearch. Fortunately I had an oversized /home directory, and was able to do an rsync of the elasticsearch directory over to /home and re-link it.

But this gets to the topic of system administration with respect to disks.

First let's start with some KEY commands you MUST know:

>df -Th 

This command (disk free = df) shows how much space is used in human readable format, but with the mountpoint and file system type. This tells you NOTHING about the physical disks though!

>lsblk -f

This command (list block device) will give you the physical disk, the mountpoint, the uuid and any labels. It is a device specific command and doesn't show you space consumption.

>fdisk -l

I don't really like this command that much because of the output formatting. But it does list disk partitions and related statistics.

Some other commands you can use are:

>sudo file -sL /dev/sda3

the -s flag enables reading of block or character files and -L enables following of symlinks:

>blkid /dev/sda3

Similar command to lsblk -f above.

Thursday, February 16, 2023

Morpheus API - pyMorpheus Python API Wrapper

I have been working on some API development in the Morpheus CMP tool.

The first thing I do when I need to use an API, is to see if there is a good API wrapper. I found this one API wrapper out on Github, called pyMorpheus.

With this wrapper, I was up and running in absolutely no time, making calls to the API, parsing JSON responses, etc.

The Use Case I am working on, is a "re-conciliator" that will do two things:

  • Remove Orphaned VMs
Find, and delete (upon user confirmation) those VMs that have had their "rug pulled out" from Morpheus (deleted in vCenter but still sitting in Morpheus as an Instance)
  •  Convert Certain Discovered VMs to Morpheus

This part sorta kinda worked.  The call to https://<applianceurl>/servers/id/make-managed did take a Discovered VM  and converted it to an instance, with a "VMWare" logo on it. 

But I was unable to set advanced attributes of the VMs - Instance Type, Layout, Plan, etc. and this made it only a partial success.

Maybe if we can get the API fixed up a bit, we can get this to work.

One issue, is the "Cloud Sync". When we call the API, we do a cloud sync, to find Discovered VMs. We do the same cloud sync, to determine whether any of the VM's fields in Morpheus change their state, if someone deletes a VM in vCenter (such a state change gives us the indicator that the VM is, in fact, now an orphan).  The Cloud Sync is an asynchronous call. You have to wait for an indefinite amount of time, to ensure that the results you are looking for in vCenter, are reflected in Morpheus. It's basically polling, which is not an exact art. For this reason, the reconciliator tool needs to be run as an operations tool, manually, as opposed to some kind of batch scheduled job.


Monday, October 4, 2021

The first Accelerated VNF on our NFV platform

 I haven't posted anything since April but that isn't because I haven't been busy.

We have our new NFV Platform up and running, and it is NOT on OpenStack. It is NOT on VMWare VIO. It also, is NOT on VMWare Telco Cloud!

We are using ESXi, vCenter, NSX-T for the SD-WAN, and Morpheus as a Cloud Management solution. Morpheus has a lot of different integrations, and a great user interface that gives tenants a place to log in and call home and self-manage their resources.

The diagram below depicts what this looks like from a Reference Architecture perspective.

The OSS, which is not covered in the diagram, is a combination of Zabbix and VROPS, both working in tandem to ensure that the clustered hosts and management functions are behaving properly.

The platform is optimized with E-NVDS, which is also referred to commonly as Enhanced Datapath which requires special DPDK drivers to be loaded on the ESXi hosts, for starters, as well as some configuration in the hypervisors. There are also settings to be made in the hypervisors to ensure that the E-NVDS is configured properly (separate upcoming post).

Now that the platform is up and running, it is time to start discussing workload types. There are a number of Workload Categories that I tend to use:

  1. Enterprise Workloads - Enterprise Applications, 3-Tier Architectures, etc.
  2. Telecommunications Workloads
    • Control Plane Workloads
    • Data Plane Workloads

Control Plane workloads are have more tolerances for latency and system resources than Data Plane Workloads do. 

Why? Because Control Plane workloads are typically TCP-based,  frequently use APIs (RESTful),  and tend to be more periodic in their behavior (periodic updates).  Most of the time, when you see issues related to Control Plane, it is related to back-hauling a lot of measurements and statistics (Telemetry Data). But generally speaking, this data in of itself does not have stringent requirements.

From a VM perspective, there are a few key things you need to do to ensure your VNF behaves as a true VNF and not as a standard workload VM. These include:

  • setting Latency Sensitivity to High, which turns off interrupts and ensures that poll mode drivers are used.
  • Enable Huge Pages on the VM by going into VM Advanced Settings and adding the parameter: sched.mem.lpage.enable1GHugePage = TRUE

Note: Another setting worth checking, although we did not actually set this parameter ourselves, is: sched.mem.pin = TRUE

Note: Another setting, sched.mem.maxmemctl ensures that ballooing is turned off. We do NOT have this setting, but it was mentioned to us, and we are researching this setting.

One issue we seemed to continually run into, was a vCenter alert called Virtual Machine Memory Usage, displaying in vCenter as a red banner with "Acknowledge and Reset to Green" links. The VM was in fact running, but vCenter seemed to have issues with it. The latest change we made that seems to have fixed this error, was to check the "Reserve all guest memory (All locked)" option checkbox.

This checkbox to Reserve all guest memory seemed intimidating at first, because the concern was that the VM could reserve all memory on the host. That is NOT what this setting does!!! What it does, is allow the VM to reserve all of its memory up-front - but just the VM memory that is specified (i.e. 24G). If the VM has has HugePages enabled, it makes sense that one would want the entire allotment of VM memory to  memory to be reserved up front and be contiguous. When we enabled this, our vCenter alerts disappeared.

Lastly, we decided to change DRS to Manual in VM Overrides. To find this setting amongst the huge number of settings hidden in vCenter, you go to the Cluster (not the Host, not the VM, not the Datacenter) and the option for VM Overrides is there, and you have four options:

  • None
  • Manual
  • Partial
  • Full

The thinking here, is that VMs with complex settings may not play well with vMotion. I will be doing more research on DRS for VNFs before considering setting this (back) to Partial or Full.

Monday, April 26, 2021

Tenancy is Critical on a Cloud Platform

With this new VMWare platform, it was ultimately decided to go with ESXi hypervisors, managed by vCenter, and NSX-T.  

During the POC, it was pointed out that this combination of solutions had some improvements and enhancements over OpenStack (DRS, vMotion, et al). But one thing seemed to be overlooked, and we pointed it out: Tenancy

VMWare attempts to address Tenancy with Vertical Stack point solutions, like vCloud Director (positioned at Service Providers), or vRealize Automation. The latter, is going through a complete transformation in its latest version.  These solutions are also expensive. And, if you don't have the budget, what are your options??

One option is to set up Resource Pools and Folders in vCenter. Not the cleanest solution because you cannot set policies, workflows, etc.

What else can you do? Well, you can use a Cloud Management solution.

We had Cloudify as an Orchestrator. And we evaluated that as a Cloud Management solution. But what we found in the end, was that Cloudify excelled at complex orchestration, but it was not designed and built, ground-up, to be a Cloud Management Platform.

It seemed that this (lack of) Tenancy seemed to become apparent to everyone all at once - once the platform came up on VMWare.  And, with Cloudify we lacked the Blueprint development to do the scores to hundreds of tasks that we needed to have. It needed integrations with NSX-T, vCenter, and a host of other solutions.

We looked at a couple of other solutions, and settled on a solution called Morpheus.

I will blog a bit more about Morpheus in upcoming posts. I have been very hands-on with it lately. 

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...