Grasping Technology: linux

Showing posts with label linux. Show all posts

Wednesday, September 18, 2024

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it was designated EOL and the repositories were archived, but I worked through that and it seemed everyone was using the system just fine.

Today, however, I had someone contact me to tell me that they provisioned a virtual machine, but it was stuck in an incomplete "Provisioning" state (a state that has a blue icon with a rocketship in it). The VM was provisioned on vCenter and working, but the state in Morpheus never set to "Finalized".

I couldn't figure this out, so I went to the Morpheus help site and I discovered that I myself had logged a ticket on this issue quite a while back. It turned out that the reason the state never flipped in that case, was because the clustering wasn't working properly.

So I checked RabbitMQ. It looked fine.

I checked MySQL and Percona, and I suspected that perhaps the clustering wasn't working properly. In the process of restarting the VMs, one of the virtual machines wouldn't start. I had to do a bunch of Percona advanced troubleshooting to figure out that I needed to do a wsrep recover commit before I could start the system and have it properly join the cluster.

The NEXT problem was that Zabbix was screeching about these Morpheus VMs using too much disk space. It turned out that the /var file system was 100% full - because of ElasticSearch. Fortunately I had an oversized /home directory, and was able to do an rsync of the elasticsearch directory over to /home and re-link it.

But this gets to the topic of system administration with respect to disks.

First let's start with some KEY commands you MUST know:

>df -Th

This command (disk free = df) shows how much space is used in human readable format, but with the mountpoint and file system type. This tells you NOTHING about the physical disks though!

>lsblk -f

This command (list block device) will give you the physical disk, the mountpoint, the uuid and any labels. It is a device specific command and doesn't show you space consumption.

>fdisk -l

I don't really like this command that much because of the output formatting. But it does list disk partitions and related statistics.

Some other commands you can use are:

>sudo file -sL /dev/sda3

the -s flag enables reading of block or character files and -L enables following of symlinks:

>blkid /dev/sda3

Similar command to lsblk -f above.

Monday, September 16, 2024

Recovering a Corrupted RPM Database

I got this scary error when trying to run an upgrade on a cloud management system.

Here is what caused it:

1. The OS was CentOS 7.

2. The repositories for CentOS 7 were removed because CentOS 7 was End of Life (EOL).

The repos were moved to an archive, and I have a post about how to update a Cent7 OS using archived repos in a previous post.

3. The upgrade was running Chef scripts that in turn were making yum update calls.

What effectively happened, was that the rpm database was getting corrupted:

We were getting the error DB_RUNRECOVERY: Fatal error, run database recovery.

Sounds frightening. The rpm database is where all of the package information is stored on a Linux operating system. Without this database intact, you cannot update or install anything, really. And there are numerous things that will invoke dnf, or yum, or some package manager which triggers it to check the integrity of this database.

As it turns out, a post I found saved the day. Apparently rebuilding the rpm database is simple.

From this link, to give credit where credit is due: rebuilding the rpm database

$ mv /var/lib/rpm/__db* /tmp/
$ rpm --rebuilddb
$ yum clean all

Friday, August 16, 2024

Pinephone Pro - Unboxing and Use Part II

I picked up the Pinephone Pro, which I had attached to a standard USB-C charger. It indeed was sitting at 100%. So it looks like the charging works okay.

The OS asked me for a pin code to unlock the screen. Yikes. I wasn't prompted to set up a pin code!

I rebooted the phone to see if I could figure out what OS was on it from the boot messages. I figured out that the phone was running the Pinephone Manjaro OS.

https://github.com/manjaro-pinephone/phosh/releases

Since the Manjaro OS has a default pincode, I attempted that pin code and got lucky - it wasn't changed, and it worked. I (re) connected to WiFi, and noticed that the OS is prompting for my WiFi Password every single time and doesn't seem to remember it from before. Secure? Yes Annoying? Yes.

The form factor issue I ran into using the Firefox browser seemed to be more related to Firefox than the OS. The issue with Firefox is that the browser is sized past the phone form factor, and you need to scroll left and right which is a major hassle. The browser doesn't auto-size itself for the screen dimensions.

I played with the Terminal app, and noticed that the user when I launched the Terminal app was pico-xxxx (I don't remember what the suffix is). I tried to sudo to root, but didn't know what the password was for this user.

Lastly, I played a video from YouTube, and the sound was very tinny. So the speaker on this phone is not high-end. I have not yet attempted to use a headphone on this device yet.

Since the Linux-Mobile apps are so limited, many apps you typically run from a dedicated icon app/client on a mobile phone will need to be run from a browser.

I am not sure Manjaro is the "right" OS to use on this phone, or if the version of the OS running is current or stale. I ordered the Docking Hub and a Micro SD Card and when those arrive, maybe I will try flashing a new/different OS on this phone.

Friday, August 9, 2024

Pinephone Pro - Unboxing and First Use

I ordered a Linux Pinephone that just arrived.

In the United States, trying to get off of Google, Apple, and even Samsung is nigh onto impossible. Carriers make a ton of money off of selling and promoting phones, and have locked Linux phones out of their stores and off of their networks because they can't all collude and make money, either by selling the devices (carriers) or siphoning your data on their operating systems or defaulting the browser, etc.

There are probably numerous videos that show the unboxing of a Pinephone, so I will skip that and just make some general comments on my first experience.

When I unboxed the phone, there was no charger included. I bought this phone used on eBay, and while it came in the box, I wasn't sure if they come standard with a charger or not. The phone uses USB-C as a charger, though, and I had plenty of these. The phone had some weight to it. The screen seemed quality, but the back cover looked like a cheap piece of plastic and I could feel something pushing against the back cover (battery? dip or kill switches?). As I don't yet have a SIM for it, I have not yet opened the back.

The phone did not boot up at first. I wasn't sure of the button sequences, so I downloaded the Pinephone User Guide to get going. I decided that the phone probably needed to be charged, and plugged it into my USB-C charger, and immediately, I got a Linux boot sequence on the screen. Linux boot sequences are intimidating to just about anyone and most certainly to a user that is unfamiliar with Linux and not Linux-savvy.

When the boot sequence finished, the phone shut itself down again - presumably because it didn't have enough juice to boot and stay running. I left the phone on the charger, and returned to it 3-4 hours later.

When I came in and picked the phone up and powered it on, I got the boot sequence again and it booted up to the operating system. The OS was reasonably intuitive. I don't have a SIM in the phone yet, so I configured it for WiFi as a first step. Then I tried to set the clock, and I added my city but it is using UTC as the default. Next I went looking to see what apps were installed. It took me a few minutes to realize that the "Discover" app is the app for finding, updating and installing applications. The first time I tried to run Discover, it crashed. When I re-launched it, it showed me some apps and I tried to update a couple of them, and got a repository error. I finally was able to update Firefox, though. Then I launched Firefox.

Right away with Firefox, I had issues with screen real-estate and positioning. The browser didn't fit on the screen, and I didn't see a way to shrink it down to fit the screen properly. After closing the 2nd tab I had opened, I was able to use my finger to "grab" the browser, and pull it around, but clearly the browser window fit and lack of a gyroscope to re-orient the browser when the phone is turned sideways are going to make this browser a bit of a hassle - unless I can solve this.

I want to test out the sound quality. That's next.

Wednesday, June 26, 2024

Rocky Generic Cloud Image - Image Prep, Cloud-Init and VMware Tools

The process I have been using up to now, has been to download the generic cloud images from various Linux Distro sites (CentOS, now Rocky). These images are pre-baked for clouds, meaning that they're smaller, more efficient, and they generally have cloud packages installed on them (i.e. cloud-init).

It is easier (and more efficient) to use one of these images, in my thinking, than to try and take an ISO and build an image "from scratch".

The problem, though, is that "cloud images" are generally public cloud images: AWS, Azure, GKE, et al. If you are running your own private cloud on VMware, you will run into problems using these cloud images.

Today, I am having issues with the Rocky 9.5 generic cloud image.

I am downloading the qcow2, using qemu-img convert to convert qcow2 to vmdk, then running ovftool using a templatized template.vmx file. Everything works fine, but when I load the image into our CMP which initializes with cloud-init, the VM is booting up fine, but no cloud-init is running, so you cannot log into the VM.

Here is the template.vmx.parameterized file I am using. I use sed to replace the parameters, then the file is renamed template.vmx before running ovftool on it.

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "11"
vmci0.present = "TRUE"
floppy0.present = "FALSE"
svga.vramSize = "16777216"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
scsi0.virtualDev = "lsilogic"
scsi0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "PARM_VMDK"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
ide0:0.present ="true"
ide0:0.startConnected = "TRUE"
ide0:0.fileName = "/opt/images/nfvcloud/imagegen/rocky9/cloudinit.iso"
ide0:0.deviceType = "cdrom-image"
displayName = "PARM_DISPLAYNAME"
guestOS = "PARM_GUESTOS"
vcpu.hotadd = "TRUE"
mem.hotadd = "TRUE"
bios.hddOrder = "scsi0:0"
bios.bootOrder = "cdrom,hdd"
sched.cpu.latencySensitivity = "normal"
svga.present = "TRUE"
RemoteDisplay.vnc.enabled = "FALSE"
RemoteDisplay.vnc.keymap = "us"
monitor.phys_bits_used = "42"
softPowerOff = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.shares = "normal"
sched.mem.minsize = "1024"
memsize = "PARM_MEMSIZE"
migrate.encryptionMode = "opportunistic"

I have tried using cdrom,hdd and just hdd on the boot order. Neither makes a difference.

When I run the ovftool program, it generates the following files, which look correct.

Rocky-9-5-GenericCloud-LVM-disk1.vmdk
Rocky-9-5-GenericCloud-LVM-file1.iso
Rocky-9-5-GenericCloud-LVM.mf
Rocky-9-5-GenericCloud-LVM.ovf

The ovf file, I have inspected. It does have references to both the vmdk and iso file in it, as it should.

The iso file, I ran a utility on it and it seems to look okay also. The two directories user_data and meta_data seem to be on there as they should be.

$ isoinfo  -i Rocky-9-5-GenericCloud-LVM-file1.iso -l

Directory listing of /
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  META_DAT
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  USER_DAT

Directory listing of /META_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

Directory listing of /USER_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

This Rocky generic cloud image, it does NOT have VMware Tools (open-vm-tools package) installed on it - I checked into that. But you shouldn't need VMware Tools for cloud-init to initialize properly.

I am perplexed as to why cloud-init won't load properly, and I am about to drop kick this image and consider alternative ways of generating an image for this platform. I don't understand why these images work fine on public clouds, but not VMware.

I may need to abandon this generic cloud image altogether and use another process. I am going to examine this Packer process.

https://docs.rockylinux.org/guides/automation/templates-automation-packer-vsphere/

Tuesday, April 16, 2024

What is an Application Binary Interface (ABI)?

After someone mentioned Alma Linux to me, it seemed similar to Rocky Linux, and I wondered why there would be two Linux distros doing the same thing (picking up from CentOS and remaining RHEL compatible).

I read that "Rocky Linux is a 1-to-1 binary to RHEL while AlmaLinux is Application Binary Interface-compatible with RHEL".

Wow. Now, not only did I learn about a new Linux distro, but I also have to run down what an Application Binary Interface, or ABI is.

Referring to this, Stack Exchange post: https://stackoverflow.com/questions/2171177/what-is-an-application-binary-interface-abi, I liked this "oversimplified summary":

API: "Here are all the functions you may call."

ABI: "This is how to call a function."

Friday, March 1, 2024

I thought MacOS was based on Linux - and apparently I was wrong!

I came across this link, which discusses some things I found interesting to learn:

Linux is a Monolithic Kernel - I thought because you could load and unload kernel modules, that the Linux kernel had morphed into more of a Microkernel architecture because of this. But apparently not?
The macOS kernel is officially known as XNU, which stands for “XNU is Not Unix.”

According to Apple's GitHub page:

"XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and C++ API for writing drivers”.

Very interesting. I stand corrected now on MacOS being based on Linux.

Thursday, February 8, 2024

Linux Phones Are Mature - But US Carriers Won't Allow Them

Today I looked into the status of some of the Linux phones, which are mature now.

Librem is one of the ones most people have heard about, but the price point on it is out of reach for anyone daring enough to jump in the pool and start swimming with a Linux phone.

Pinephone looks like it has a pretty darn nice Linux phone now, but after watching a few reviews, it is pretty clear that you need to go with the Pinephone Pro, and put a fast(er) Linux OS on it.

The main issue with performance on these phones, has to do with the graphics rendering. If you are running the Gnome Desktop for example, the GUI is going to take up most of the cycles and resources that you want for your applications. I learned this on regular Linux running on desktop servers years ago, and got into the habit of installing a more lightweight KDE desktop to try and get some of my resources back under my control.

Today, I found a German phone that apparently is really gaining in popularity in Europe - especially Germany. It is called Volla Phone. Super nice phone, and they have done some work selecting the hardware components and optimizing the Linux distro for you, so that you don't have to spend hours and hours tweaking, configuring, and putting different OS images on the phone to squeeze performance out of it.

Volla Phone - Linux Privacy Phone

Problem is - United States carriers don't allow these phones! They are not on the "Compatibility List". Now, I understand there might be an FCC cost to certifying devices on a cellular network (I have not verified this). The frequencies matter of course, but the SIM cards also matter. Volla Phone will, for instance, apparently work on T-Mobile, but only if you have an older SIM card. If you are on T-Mobile and have a new SIM card, then it won't work because of some fields that aren't exchanged (if I understand correctly).

Carriers that are in bed with Google and Apple, such as at&t and Verizon, they're going to do everything they can to prevent a Linux BYOD (Bring Your Own Device) phone hitting their network. They make too much $$$$$$$$$$$$ off of Apple and Android. T-Mobile, they're German of course, so maybe they have a little bit more of the European mindset. These are your three network rollouts across the United States, and all of your mom and pop cellular plays (i.e. Spectrum Mobile, Cricket, et al) are just MVNOs riding on that infrastructure.

So if you have one of these Linux phones, you can use it in your home. On WiFi. But if you carry it outdoors, it's a brick apparently. Here we are in 2024, and that STILL seems to be the case.

Friday, August 18, 2023

The Linux XFS File System - How Resilient Is It?

We are using VMWare Datastores, using NFS version 3.x. The storage was routed, which is never a good thing to do because let's face it, if your VMs all lose their storage simultaneously, that constitutes a disaster. Having dependencies on a router, which can lose its routing prefixes due to a maintenance or configuration problem, is architecturally deficient (polite way of putting it). To solve this, you need to make sure that you don't have routing hops (storage on same segment as storage interface on hypervisor).

So, after our storage routers went AWOL due to a maintenance event, I noticed some VMs came back and appeared to be fine. They had rebooted and were at a login prompt. Other VMs, however, did not come back, and had some nasty things printing on the console (you could not log into these VMs).

What we noticed, was that any Linux virtual machine running with XFS file system type on boot or root (/boot or /) had this issue of being unrecoverable. VMs that were using ext3 or ext4 seemed to be able to recover and start running their services - although some were still echoing some messages to the console.

There is a lesson here. That the file system matters when it comes to resiliency in a virtualized environment.

I did some searching around for discussions on file system types, and of course there are many. This one in particular, I found interesting: ext4-vs-xfs-vs-btrfs-vs-zfs-for-nas

Wednesday, April 19, 2023

Colorizing Text in Linux

I went hunting today, for a package that I had used to colorize text. There are tons of those out there of course. But - what if you want to filter the text and colorize based on a set of rules?

There's probably a lot of stuff out there for that, too. Colord for example, runs as a daemon in Linux.

Another package, is grc, found at this GitHub site: https://github.com/garabik/grc

Use Case:

I had a log that was printing information related to exchanges with different servers. I decided to color these so that messages from Server A were green, Server B were blue, etc. In this way, I could do really cool things like suppress messages from Server B (no colorization). Or, I could take Control Plane messages from, say, Server C, and highlight those Yellow.

This came in very handy during a Demo, where people were watching the messages display in rapid succession on a large screen.

Friday, October 28, 2022

Moving a LVM file system to a new disk in Linux

I had to dive back into Linux disk partitioning, file systems, and volumes when I got an alert from Zabbix that a cluster of 3 VMs were running out of space. As the alert from Zabbix said disk space was greater than 88 percent, I grew concerned and took a look.

In the labs, we had 3 x CentOS7 Virtual Machines, each deployed with a 200G VMDK file. But inside the VM, in the Linux OS, there were logical volumes (centos-root, centos-swap, centos-home) that were mounted as XFS file systems on a 30G partition. There was no separate volume for /var (centos-var). And /var was the main culprit of the disk space usage.

The decision was made to put /var on a separate disk as a good practice, because the var file system was used to store large virtual machine images.

The following steps were taken to move the /var file system to the new disk:

1. Add new Disk in vCenter to VM - create new VMDK file (100G in this particular case)

2. If the disk is seen, a /dev/sdb will be present in the Linux OS of the virtual machine. We need to create a partition on it (/dev/sdb1).

# fdisk /dev/sdb

n is the option to create a new partition, then p for selecting primary, then a bunch of useless question for this case, like the partition number, first and last cylinder, just use the default options.
This will create a Linux primary partition, you will need to use the command t in order to change the partition type to 8e (Linux LVM).
Then w will write everything to the disk and exit from fdisk.
# fdisk -l /dev/sdb

Will return something like this:

Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 20971519 20969472 10G 8e Linux LVM

3. Add device to physical volume (this creates a partition)
# pvcreate /dev/sdb1

NOTE: to delete a device from a physical volume, use vgreduce first, then pvremove!
vgreduce centos /dev/sdb1
pvremove /dev/sdb1

4. display volume group
# vgdisplay

--- Volume group ---
VG Name centos
[... more detail …]

5. display physical volumes in volume group

pvdisplay -C --separator ' | ' -o pv_name,vg_name

6. Extend the volume group so it can contain the new disk (partition)

# vgextend centos /dev/sdb1

You will get info like this:
VG Size 29.75 GiB
PE Size 4.00 MiB
Total PE 7617
Alloc PE / Size 5058/ 19.75 GiB
Free PE / Size 2559 / 10 Gib

7. Create new logical volume

NOTE: this command can be tricky. You either need to know extents, and semantics, or you can keep is simple. Such as:
# lvcreate -n var -l 100%FREE centos

8. Create file system - NOTE that XFS is the preferred type, not ext4!
# mkfs -t xfs /dev/centos/var

9. Mount the new target var directory as newvar
# mkdir /mnt/newvar
# mount /dev/centos/var /mnt/newvar

10. Copy the files

NOTE: Lots of issues can occur during this, depending on what directory you are copying (i.e. var is notorious because of run and lock dirs).

I found this command to work:
# cp -apxv /var/* /mnt/newvar

Another one people seem to like, is the rsync command, but this one below I attempted hung:
# rsync -avHPSAX /var/ /mnt/newvar/

11. You can do a diff, or try to, to see how sane the copy went:
# diff -r /var /mnt/newvar/

12. Update fstab for reboot
/dev/mapper/centos-var /var xfs defaults 0 0

Note that we used the logical volume centos-var here, not centos (the volume group). LVM calls the volumes centos-swap, centos-home, etc.

13. Move the old /var on old root file system
# mv /var /varprev

14.Rename current var, create a new folder and remount
# mkdir /var
# mount /var

15. Use the df command to bring all the mounts
# df -h | grep /dev/

16. Decide whether you want to remove the old var file system and reclaim that disk space.

NOTE: Do not do this until you’re damned sure the new one is working fine. I recommend rebooting the system, inspecting all services that need to be running, etc.

Now, the only thing left to consider now, is that after we have moved /var to a new 100G VMDK disk, what do we do about the fact that we now have a 200G boot/swap/root disk that is only using a small fraction of 200G in space now? Well, shrinking disks is even MORE daunting, and is not the topic of this post. But, if I decide to reclaim some space, expect another post that documents how I tackled that effort (or attempted to).

For now, no more alerts about running out of space on a root file system is good news, and this VM can now run peacefully for quite a while.

Friday, March 4, 2022

ESXi is NOT Linux

ESXi is not built upon the Linux kernel, but uses an own VMware proprietary kernel (the VMkernel) and software, and it misses most of the applications and components that are commonly found in all Linux distributions.

Because ESXi uses "-ix" commands (Unix, Linux, POSIX), it "looks and smells" like Linux, but in fact, these commands are similar to the package CygWin that one can run on a Windows system to get a Linux terminal and command line interpreter. ESXi does not use CygWin, however. They run something called BusyBox.

BusyBox is something used on a lot of small-factor home networking gear. PfSense for example, runs Berkeley Unix (BSD). But many small routers (Ubiquiti EdgeMax comes to mind) use different chipsets, different OS kernels, and then use BusyBox to abstract this kernel away from users by providing a common interface - meaning users don't need to learn a whole slew of new OS commands.

ESXi has a LOT of things that Linux does NOT have:

1. File systems VMFS6 for example is the newest revision of VMFS.

2. Process Scheduler - and algorithms

3. Kernel hooks that tools like esxtop use (think system activity reporting in Unix and Linux)

This article (the source for this post), discusses some nice facts in comparing ESXi to Linux:

ESXi-is-not-based-on-Linux

I learned some interesting things from this article, such as:

ESXi even uses the same binary format for executables (ELF) than Linux does, so it is really not a big surprise anymore that you can run some Linux binaries in an ESXi shell - provided that they are statically linked or only use libraries that are also available in ESXi! (I exploited this "feature" when describing how to run HP's hpacucli tool in ESXi and when building the ProFTPD package for ESXi).

...You cannot use binary Linux driver modules in ESXi. Lots of Linux device drivers can be adapted to ESXi though by modifying their source code and compiling them specifically for ESXi. That means that the VMkernel of ESXi implements a sub-set of the Linux kernel's driver interfaces, but also extends and adapts them to its own hypervisor-specific needs.

In my opinion this was another very clever move of the VMware ESXi architects and developers, because it makes it relatively easy to port an already existing Linux driver of a hardware device to ESXi. So the partners that produce such devices do not need to develop ESXi drivers from scratch. And it also enables non-commercial community developers to write device drivers for devices that are not supported by ESXi out-of-the-box!

There is a PDF download of the ESXi architecture, which can be downloaded here:

https://www.vmware.com/techpapers/2007/architecture-of-vmware-esxi-1009.html

Thursday, August 13, 2020

NIC Teaming vs Active-Active NIC Bonding - differences - which is better?

I just went down a path of discovery trying to fully understand the differences between Bonding and NIC Teaming.

Bonding, of course, has a concept of "bonding modes" that allow you to use NICs together for a failover purpose in active-standby mode, and even active-active failover. When using these modes, however, the focus is not gluing the NICs together to achieve linear increases in bandwidth (i.e. 10G + 10G = 20G). To get into true link aggregation, you need to use different bonding modes that are specifically for that purpose. I will include a link that discusses in detail the bonding modes in Linux:

Linux Bonding Modes

So what is the difference, between using Bonding Mode 4 (LACP Link Aggregation), or Bonding Mode 6 (Adaptive Load Balancing), and NIC Teaming?

I found a great link that covers the performance differences between the two.

https://www.redhat.com/en/blog/if-you-bonding-you-will-love-teaming

At the end of the day, it comes down to the drivers and how well they're written of course. But for Red Hat 7 at least, we can see the following.

The performance is essentially "six of one half dozen of another" on RHEL 7, on a smaller machine with a 10G Fiber interface. But if you look carefully, while NIC Teaming provides you gains on smaller packet sizes, as packet sizes start to get large (64KB or higher), the Bonding starts to give you some gains.

I'll include the link and the screen snapshot. Keep in mind, I did not run this benchmark myself here. I am citing an external source for this information, found at the link below:

Monday, May 11, 2020

DPDK Hands-On - Part II - Poll Mode Drivers and IOMMU

In our last post Hands On with DPDK - Part I we chose a box to try and install DPDK on.

This box, was a circa 2015 Dell T-1700. A bit long in the tooth (it is now 2020), and it is not and never was, a data center grade server.

And, looking forward, this will bite us. But will help us learn a LOT about DPDK due to the persistence and troubleshooting.

So - to get started, I did something rather unconventional. Rather than read all of the documentation (there is a LOT of documentation), I took a cursory look at the dpdk.org site (Getting Started), and then went looking for a couple of blogs where someone else tried to get DPDK working with OVS.

Poll Mode Drivers

Using DPDK requires using a special type of network interface card driver known as a poll mode driver. This means that the driver has to be available (custom compiled and installed with rpm, or perhaps pre-compiled and installed with package managers like yum).

Poll Mode drivers continuously poll for packets, as opposed to using the classic interrupt-driven approach that the standard vendor drivers use. Using interrupts to process packets is considered less efficient than polling for packets. But - to poll for packets continuously is cpu intensive, so there is a trade-off!

There are two poll mode drivers listed on the dpdk.org website:
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

UIO (legacy)

uio_pci_generic
igb_uio

VFIO (current recommended driver)

The DPDK website has this to say about the two driver families (UIO and VFIO).

"VFIO is the new or next-gen poll mode driver, that is a more robust and secure driver in comparison to the UIO driver, relying on IOMMU protection".

So perhaps it makes sense to discuss IOMMU, as it will need to be disabled for UIO drivers, and enabled for VFIO drivers.

IOMMU

Covering IOMMU would be a blog series in its own right. So I will simply list the Wikipedia site on IOMMU. Wikipedia IOMMU Link

What does IOMMU have to do with DPDK? DPDK has this to say in their up-front pre-requisites for DPDK.

"An input-output memory management unit (IOMMU) is required for safely driving DMA-capable hardware from userspace and because of that it is a prerequisite for using VFIO. Not all systems have one though, so you’ll need to check that the hardware supports it and that it is enabled in the BIOS settings (VT-d or Virtualization Technology for Directed I/O on Intel systems)"

So there you have it. It took getting down to the poll mode drivers, but IOMMU provides memory security...but for the newer-generation VFIO drivers. Without this security, one rogue NIC could affect the memory for all NICs, or jeopardize the memory of the system in general.

So - how do you enable IOMMU?

Well, first you need to make sure your system even supports IOMMU.

To do this, you can do one of two things (suggested: do both) - Linux system assumed here.

Check and make sure there is a file called /sys/class/iommu
type (as root) dmesg | grep IOMMU

On #2, you should see something like this

IOMMU                                                          
[    0.000000] DMAR: IOMMU enabled
[    0.049734] DMAR-IR: IOAPIC id 8 under DRHD base  0xfbffc000 IOMMU 0
[    0.049735] DMAR-IR: IOAPIC id 9 under DRHD base  0xfbffc000 IOMMU 0

Now in addition to this, you will need to edit your kernel command line so that two IOMMU directives can be passed in: iommu=pt intel_iommu=on

The typical way these directives are added is using the grub2 utility.

NOTE: Many people forget that once they add the parameters, they need to do a mkconfig to actually apply these parameters!!!

After adding these kernel parameters, you can check your kernel command line by running the following command:

# cat /proc/cmdline

And you should see your iommu parameters showing up:

BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=4102ab69-f71a-4dd0-a14e-8695aa230a0d ro rhgb quiet iommu=pt intel_iommu=on

Next Step: Part III - Huge Pages

Saturday, May 9, 2020

DPDK Hands-On - Part I - Getting Started

I decided to try and enable DPDK on my computer.

This computer is a Dell T1700 Precision, circa 2015, which is a very very nice little development workstation server.

The VERY FIRST thing anyone needs to do, with DPDK, is ensure that their server has supported NICs. It all starts with the NIC cards. You cannot do DPDK without DPDK-compatible NICs.

There is a link at the DPDK website, which shows the list of NICs that are (or should be, as it always comes down to the level of testing, right?) compatible with DPDK.
That website is: DPDK Supported NICs

This T-1700 has an onboard NIC, and two ancillary NIC cards that ARE listed as DPDK-compatible NICs.These NICs are listed as:
82571EB/82571GB Gigabit Ethernet Controller and are part of the Intel e1000e family of NICs.

I was excited that I could use this server without having to invest and install in new NIC cards!

Let's first start, with specs on the computer. First, our CPU specifications.

CPU:

# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 4

On-line CPU(s) list: 0-3

Thread(s) per core: 1

Core(s) per socket: 4

Socket(s): 1

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 60

Model name: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz

Stepping: 3

CPU MHz: 1183.471

CPU max MHz: 3900.0000

CPU min MHz: 800.0000

BogoMIPS: 6983.91

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 6144K

NUMA node0 CPU(s): 0-3

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

Let's take a look at our NUMA capabilities on this box. It says up above, we have one Numa Node. There is a utility called numactl on Linux, and we will run that with the "-H" option to get more information.

# numactl -H

available: 1 nodes (0)

node 0 cpus: 0 1 2 3

node 0 size: 16019 MB

node 0 free: 7554 MB

node distances:

node 0

0: 10

From this, we see we have 1 Numa Node. Numa Nodes equate to CPU sockets. And since we have one CPU socket, we have one Numa Node. All 4 cores of the CPU are on this node (Node 0 per above). Having just one Numa Node is not an optimal scenario for DPDK testing, but as long we are NUMA-capable, we can proceed.

Next, we will look at Memory.

Memory:

# lsmem --summary

Memory block size: 128M

Total online memory: 16G

Total offline memory: 0B

16G memory. Should be more than enough for this exercise.

So how to get started?

Obviously the right way, would be to sit and read reams of documentation from both DPDK and OpenVSwitch. But, what fun is that? Booooring. I am one of those people who like to start running and run my head into the wall.

So, I did some searching, and found a couple of engineers who had scripts that enabled DPDK. I decided to study these, pick them apart, and use them as a basis to get started. I saw a lot of stuff in these scripts that had me googling stuff - IOMMU, HugePages, CPU and Masking, PCI, Poll Mode Drivers, etc.

In order to fully comprehend what was needed to enable DPDK, I would have to familiarize myself with these concepts. Then, hopefully, I could tweak this script, or even write new scripts, and get DPDK working on my box. That's the strategy.

I did realize, as time went on, that the scripts were essentially referring back to the DPDK and OpenVSwitch websites, albeit at different points in time as the content on these sites changes release by release.

Saturday, April 25, 2020

Configuring Persistent Bans with Fail2Ban

Someone gave me a network to put a Virtual Machine on, and I thought that network was a NAT. It wasn't. I was extremely lucky the VM did not get hacked. I immediately shut down the public facing interface, and installed FirewallD, allowing only key authentication through ssh.

That is NOT enough. In examining logs, this VM was getting pounded on all day, every day.

So, I took an extra measure of installing Fail2Ban. Initially, I configured a 24 hour jail time. But after seeing the same IPs come after the VM time and time again, I decided to reconfigure for a permanent ban.

To configure a permanent ban, I used -1 on the ban time (which in old days was in seconds, but they now accept the "365d", "52w", "1y" formats.

Now from there, things get more interesting. Wanting to get this configured quickly, I took the measures explained in this blog post for configuring Persistent Bans on Fail2Ban.

Configuring Persistent Bans with Fail2Ban

First, let's discuss what he assumes. He assumes, that you are configuring your jail to use iptables-multiport actions. Indeed, I have read (in another blog) that using the iptables-multiport actions might be a bit safer than using firewalld-multiport rules, even though you might be running FirewallD!

So that is exactly what I did. My jail.local file has a default ban of 52w. My ssh-specific rules use a -1 value on ban time (permanent ban), and use the iptables-multiport action rules.

I backed up this iptables-multiport file, and added a line on "action start" to loop through all of the hosts (ip addresses) in the /etc/fail2ban/persistent.bans file, and block them (refer to blog link above for specific rule). Then, on action ban, a simple print statement will echo the action of a permanent ban to a log file, so that we can see incrementally, who is banned.

Now later, I did check out the firewallcmd-multiport file, which would essentially attempt the same things that iptables-multiport does, except with firewall-cmd statements instead.

To do that, I would do the same thing. I would back up the firewallcmd-multiport file, and make the following changes.

1. The action to ban an IP is: firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>

So I would take this, and add in the actionstart section, a loop rule that looks like this:
cat /etc/fail2ban/persistent.bans | awk '/^fail2ban-<name>/ {print $2}' | while read IP; do \
firewall-cmd --direct --add-rule <family> filter f2b-<name> 0 -s <ip> -j <blocktype>; done

2. Then, I would add in the actionban section, the same print statement that resides in the iptables-multiport.conf file so that as new bands are added, they get logged:

echo "fail2ban-<name> <ip>" >> /etc/fail2ban/persistent.bans

Of course, a restart of fail2ban needs to be made for these to kick in, and this needs to be verified before you walk away after the change!

The only thing that has me wondering now, is that as the list of banned ips grows, your rules will grow, and this could have performance impacts on packet processing. But protecting your box is imperative, and should be the first priority! You could, if your list grows too long, periodically release some prisoners from jail, I suppose. And see if they behave, or perhaps maybe move on to better things.

Thursday, March 5, 2020

Mounting a Linux Volume over a File System - the bind mount trick

Logged into a VM today trying to help troubleshoot issues. There was nothing in /var/log! No Syslog!

Turns out that this phenomenon had occurred, where Linux will indeed let you mount on top of pretty much any directory, because after all, a directory is just a mount point as far as Linux is concerned.

But what happens to the files in original directory? I used to think they were lost. They're not. They're there, but shielded. They can be recovered, with a neat trick called a bind mount!

All described here! Learn something new every day.

A snippet of dialog from the link below:
https://unix.stackexchange.com/questions/198542/what-happens-when-you-mount-over-an-existing-folder-with-contents

Q. Right now /tmp has some temporary files in it. When I mount my hard drive (/dev/sdc1) on top of /tmp, I can see the files on the hard drive. What happens to the actual content of /tmp when my hard drive is mounted?

A. Pretty much nothing. They're just hidden from view, not reachable via normal filesystem traversal.

Q. Is it possible to perform r/w operations on the actual content of /tmp while the hard drive is mounted?

A. Yes. Processes that had open file handles inside your "original" /tmp will continue to be able to use them. You can also make the "reappear" somewhere else by bind-mounting / elsewhere.

# mount -o bind / /somewhere/else
# ls /somewhere/else/tmp

Friday, November 15, 2019

How LibVirt Networking Works - Under the Hood

This is the best link on this topic that I have found.

Lots of great pictures. Pictures are worth a thousand words.

https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net

OpenContrail - Part 1

When I came to this shop and found out that they were running OpenStack but were not running Neutron, I about panicked. Especially when I found out they were running OpenContrail.

OpenContrail uses BGP and XMPP as its control plane protocols and route advertisements/exchanges. And it uses MPLS over GRE/UDP to direct packets. The documentation says it CAN use VXLAN - which Neutron also seems to favor (over GRE tunneling). But here at least, it is being run in the way the designed of OpenContrail wanted it to run - which is as an MPLS L3VPN.

I am going to drop some links in here real quick and come back and flush this blog entry out.

Here is an Architectural Guide on OpenContrail. Make sure you have time to digest this.

https://www.juniper.net/us/en/local/pdf/whitepapers/2000535-en.pdf

Once you read the architecture, here is a Gitbook on OpenContrail that can be used to get more familiarity.

https://sureshkvl.gitbooks.io/opencontrail-beginners-tutorial/content/

Perhaps the stash of gold was the location of a 2013 video from one of the developers of vRouter itself. It turns out most of the stuff in this video is still relevant for OpenContrail several years later. I could not find these slides anywhere, so I did make my own slide deck that highlights important discussions that took place on this video, as well as some of the key concepts shown.

https://www.youtube.com/watch?v=xhn7AYvv2Yg

If you read these, you are halfway there. Maybe more than halfway actually.

High Packet Loss in the Tx of TAP Interfaces

I was seeing some bond interfaces that had high dropped counts, but these were all Rx drops.

I noticed that the tap interfaces on OpenStack compute hosts - which were hooked to OpenContrail's vRouter - had drops on the Tx.

So, in trying to understand why we would be dropping packets on Tap interfaces, I did some poking around and found this link.

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/ovs-dpdk_end_to_end_troubleshooting_guide/high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface

From this, article, an excerpt:
"TX drops occur because of interference between the instance’s vCPU and other processes on the hypervisor. The TX queue of the tap interface is a buffer that can store packets for a short while in case that the instance cannot pick up the packets. This would happen if the instance’s CPU is prevented from running (or freezes) for a long enough time."

The article goes on and elaborates on diagnosis, and how to fix by adjusting the Tx Queue Length.