Wednesday, June 26, 2024

Rocky Generic Cloud Image - Image Prep, Cloud-Init and VMware Tools

 

The process I have been using up to now, has been to download the generic cloud images from various Linux Distro sites (CentOS, now Rocky). These images are pre-baked for clouds, meaning that they're smaller, more efficient, and they generally have cloud packages installed on them (i.e. cloud-init).

It is easier (and more efficient) to use one of these images, in my thinking, than to try and take an ISO and build an image "from scratch".

The problem, though, is that "cloud images" are generally public cloud images: AWS, Azure, GKE, et al.  If you are running your own private cloud on VMware, you will run into problems using these cloud images.

Today, I am having issues with the Rocky 9.5 generic cloud image.

I am downloading the qcow2, using qemu-img convert to convert qcow2 to vmdk, then running ovftool using a templatized template.vmx file. Everything works fine, but when I load the image into our CMP which initializes with cloud-init, the VM is booting up fine, but no cloud-init is running, so you cannot log into the VM.

Here is the template.vmx.parameterized file I am using. I use sed to replace the parameters, then the file is renamed template.vmx before running ovftool on it.

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "11"
vmci0.present = "TRUE"
floppy0.present = "FALSE"
svga.vramSize = "16777216"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
scsi0.virtualDev = "lsilogic"
scsi0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "PARM_VMDK"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
ide0:0.present ="true"
ide0:0.startConnected = "TRUE"
ide0:0.fileName = "/opt/images/nfvcloud/imagegen/rocky9/cloudinit.iso"
ide0:0.deviceType = "cdrom-image"
displayName = "PARM_DISPLAYNAME"
guestOS = "PARM_GUESTOS"
vcpu.hotadd = "TRUE"
mem.hotadd = "TRUE"
bios.hddOrder = "scsi0:0"
bios.bootOrder = "cdrom,hdd"
sched.cpu.latencySensitivity = "normal"
svga.present = "TRUE"
RemoteDisplay.vnc.enabled = "FALSE"
RemoteDisplay.vnc.keymap = "us"
monitor.phys_bits_used = "42"
softPowerOff = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.shares = "normal"
sched.mem.minsize = "1024"
memsize = "PARM_MEMSIZE"
migrate.encryptionMode = "opportunistic"

I have tried using cdrom,hdd and just hdd on the boot order. Neither makes a difference.

When I run the ovftool program, it generates the following files, which look correct.

Rocky-9-5-GenericCloud-LVM-disk1.vmdk
Rocky-9-5-GenericCloud-LVM-file1.iso
Rocky-9-5-GenericCloud-LVM.mf
Rocky-9-5-GenericCloud-LVM.ovf

The ovf file, I have inspected. It does have references to both the vmdk and iso file in it, as it should.

The iso file, I ran a utility on it and it seems to look okay also. The two directories user_data and meta_data seem to be on there as they should be.

$ isoinfo  -i Rocky-9-5-GenericCloud-LVM-file1.iso -l

Directory listing of /
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  META_DAT
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  USER_DAT

Directory listing of /META_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     30 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

Directory listing of /USER_DAT/
d---------   0    0    0            2048 Dec 18 2024 [     29 02]  .
d---------   0    0    0            2048 Dec 18 2024 [     28 02]  ..

This Rocky generic cloud image, it does NOT have VMware Tools (open-vm-tools package) installed on it - I checked into that. But you shouldn't need VMware Tools for cloud-init to initialize properly.

I am perplexed as to why cloud-init won't load properly, and I am about to drop kick this image and consider alternative ways of generating an image for this platform. I don't understand why these images work fine on public clouds, but not VMware. 

I may need to abandon this generic cloud image altogether and use another process. I am going to examine this Packer process. 

https://docs.rockylinux.org/guides/automation/templates-automation-packer-vsphere/

 

Thursday, June 20, 2024

New AI Book Arrived - Machine Learning for Algorithmic Trading

This thing is like 900 pages long.

You want to take a deep breath and make sure you're committed before you even open it.

I did check the Table of Contents and scrolled quickly through, and I see it's definitely a hands-on applied technology book using the Python programming language.

I will be blogging more about it when I get going.

 




Tuesday, June 4, 2024

What Makes an AI Chip?

I haven't been able to understand why the original chip pioneers, like Intel and AMD, have not been able to pivot in order to compete with NVidia (Stock Symbol: NVDA).

I know a few things, like the fact that when gaming became popular, NVidia made the graphics chips that had graphics acceleration and such. Graphics tend to draw polygons, and drawing polygons is geometric and trigonometric - which require floating point arithmetic (non-integer based mathematics). Floating point is difficult for a CPU to do, so much so that classical CPUs either offloaded or employed other tricks to do these kinds of computations.

Now, these graphics chips are the "rave" for AI. And Nvidia stock has gone through the roof while Intel and AMD have been left behind.

So what does an AI chip have, that is different from an older CPU?

  • Graphics processing units (GPUs) - used mainly for training AI models
  • Field-programmable gate arrays (FPGAs) - used mainly for inference
  • Application-specific integrated circuits (ASICs) - used in various capacities of AI

CPUs use all three of these in some form or another, but an AI chip has all three of these in a highly optimized and accelerated design. Things like prediction (such as branching prediction), parallelism, etc. They're simply better at running "algorithms".

This link, by the way, from NVidia, discusses the distinction between Training and Inference:
https://blogs.nvidia.com/blog/difference-deep-learning-training-inference-ai/

CPUs, they were so bent on running Microsoft for so long, and emulating continuous revisions of instructions to run Windows (286-->386-->486-->Pentium--> and on and on), that they just never went back and "rearchitected" or came up with new chip architectures. They sat back and collected money, along with Microsoft, to give you incremental versions of the same thing - for YEARS.

When you are doing training for an AI model, and you are running algorithmic loops millions upon millions of times, the efficiency and time start to add up - and make a huge difference in $$$ (MONEY). 

So the CPU companies, in order to "catch up", I think, with NVidia, would need to come up with a whole bunch of chip design software. Then there is the software kits necessary to develop to the chips. You also have the foundry (which uses manufacturing equipment, much of it custom per the design), etc. Meanwhile, NVidia has its rocket off the ground, with decreasing G forces (so to speak), which accelerates its orbit. It is easy to see why an increasing gap would occur.

But - when you have everyone (China, Russia, Intel, AMD, ARM, et al) all racing to catch up, they will at some point, catch up. I think. When NVidia slows down. We shall see.

Tuesday, April 16, 2024

What is an Application Binary Interface (ABI)?

After someone mentioned Alma Linux to me, it seemed similar to Rocky Linux, and I wondered why there would be two Linux distros doing the same thing (picking up from CentOS and remaining RHEL compatible).

I read that "Rocky Linux is a 1-to-1 binary to RHEL while AlmaLinux is Application Binary Interface-compatible with RHEL".

Wow. Now, not only did I learn about a new Linux distro, but I also have to run down what an Application Binary Interface, or ABI is.

Referring to this, Stack Exchange post: https://stackoverflow.com/questions/2171177/what-is-an-application-binary-interface-abi, I liked this "oversimplified summary":

API: "Here are all the functions you may call."

ABI: "This is how to call a function."

Friday, March 1, 2024

I thought MacOS was based on Linux - and apparently I was wrong!

I came across this link, which discusses some things I found interesting to learn:

  • Linux is a Monolithic Kernel - I thought because you could load and unload kernel modules, that the Linux kernel had morphed into more of a Microkernel architecture because of this. But apparently not?
  •  The macOS kernel is officially known as XNU, which stands for “XNU is Not Unix.” 
 According to Apple's GitHub page:

 "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and C++ API for writing drivers”.

  Very interesting. I stand corrected now on MacOS being based on Linux.

Neural Network Architecture - Sizing and Dimensioning the Network

In my last blog post, I posed the question of how many hidden layers should be in a neural network, and how many hidden neurons should be in each hidden layer. This is related to the Neural Network Design, or Neural Network Architecture.

Well, I found the answer, I think, in the book entitled An Introduction to Neural Networks for Java authored by Jeff Heaton. I noticed, incidentally, that Jeff was doing AI and writing about it as early as 2008 - fifteen years ago prior to the current AI firestorm we see today - and possibly before that, using languages like Java, C# (C Sharp), and Encog (which I am unfamiliar with).

In this book, in Table 5.1 (Chapter 5), Jeff states (quoted):

"Problems that require two hidden layers are rarely encountered. However, neural networks with two hidden layers can represent functions with any kind of shape. There is currently no theoretical reason to use neural networks with any more than two hidden layers. In fact, for many practical problems, there is no reason to use any more than one hidden layer. Table 5.1 summarizes the capabilities of neural network architectures with various hidden layers." 

Jeff then has the following table...

"There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

  • The number of hidden neurons should be between the size of the input layer and the size of the output layer.
  • The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
  • The number of hidden neurons should be less than twice the size of the input layer."

Simple - and useful! Now, this is obviously a general rule of thumb, a starting point.

There is a Goldilocks method for choosing the right sizes of a Neural Network. If the number of neurons is too small, you get higher bias and underfitting. If you choose too many, you get the opposite problem of overfitting - not to mention the issue of wasting precious and expensive computational cycles on floating point processors (GPUs).

In fact, the process of calibrating a Neural Network leads to a concept of Pruning, where you examine which Neurons affect the total output, and prune out those that don't have the measure of contribution that makes a significant difference to the end result.

AI - Neural Networks and Deep Learning - Nielsen - Chap 5 - Vanishing and Exploding Gradient

When training a Neural Net, it is important to have what is referred to as a Key Performance Indicator - a KPI. This is an objective, often numerical, way of "scoring" the aggregate output so that you can actually tell that the model is learning - that it is trained - and that the act of training the model is improving the output. This seems innate almost, but it is important to always step back and keep this in mind.

Chapter 5 discusses the effort that goes into training a Neural Net, but from the perspective of Efficiency. How well, is the Neural Net actually learning as you run through a specified number of Epochs, with whatever batch sizes you choose, etc.?

In this chapter, Michael Nielsen discusses the Vanishing Gradient. He graphs the "speed of learning" on each Hidden Layer, and it is super interesting to notice that these Hidden Layers do not learn at the same rate! 

In fact, the Hidden Layer closest to the Output always outperforms the preceding Hidden Layer in terms of speed of learning.

So after reading this, the next questions in my mind - and ones that I don't believe Michael Nielsen addresses head-on in his book, is 

  • how many Hidden Layers does one need?
  • how many Neurons are needed in a Hidden Layer?

I will go back and re-scan, but I don't think there are any Rules of Thumb, or general guidance tossed out in this regard - in either book I have covered thus far.  I believe that in the examples chosen in the books, the decisions about how to size (dimension) the Neural Network is more or less arbitrary.

So my next line of inquiry and research will be on the topic of how to "design" a Neural Network, at least from the outset, with respect to the sizing and dimensions.  That might well be my next post on this topic.

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...